Elasticsearch & ELK Stack Cheatsheet

Gamze Yılan
13 min readJun 11, 2023

Learn Elasticsearch with Logstash and Kibana. This article will get you from zero to hero in just 13 minutes!

Getting Started

Elasticsearch is a document oriented search engine. What it does is to simply insert, delete, retrieve, analyze and search documents. It is often used to store and manage application or server logs. Keep in mind that the concept of document here stands for a JSON rather than a traditional document. You can also think of Elasticsearch as a non-relational database (a database without columns & rows).

Elasticsearch is essentially a search engine. It’s especially designed to search within documents, and it works very fast.

Elasticsearch uses a data structure called inverted index. You can think of this as the index of the book, basically it means Elasticsearch will map your content by it’s memory address word by word. Which is why we need the Elasticsearch’s approach that makes it able to run queries in huge amounts of information, fast. A real-world use of Elasticsearch that’ll help you understand the concept better is Amazon. There, you can find the product you need within billions and billions of products word by word thanks to inverted indexing.

To get started, you must first install Java on your system. Then, you can install and configure Elasticsearch depending on your operating system. Make sure to also install Kibana, which grants you a visual control panel over Elasticsearch on your system.

Terminology and the HTTP methods

With Elasticsearch, the classical tables within the database are now each called an index. Instead of a row we have a document, and instead of a column we have a field. So let’s think of a “Menu” table containing a restaurant’s food options and their prices as:

Table: “Menu”

Here, “Meal Name” and “Price” are each called a field. The table name “Menu” is the index. And the data below would be considered a document:

{ “Meal Name”: “Veggie Noodles”,

“Price”: “18USD”}

Just as in every database; we get to post, get, put and delete the data -perform http methods. We can perform these operations from the Kibana console under http://localhost:5601, on Management/Dev Tools on the hamburger menu.

Note: 5601 is the default port for Kibana, and you can change this while configuring it on your system.

Posting a data to the database is called indexing, since Elasticsearch uses inverted indexing to store the data. Indexing stands for both post and put methods: if the document with the same id exits it will perform a put and otherwise it will perform a post. To index the data we use the structure PUT /{index}/{id}. So to index a document to the “Menu” from before we should:

PUT /Menu/4

{“Meal Name”: “Burger”,

“Price”: “18USD”}

And when you perform the operation, you’ll get a response like this:

{

“_index”: “Menu”, // index name

“_id”: “4”, //the id we have given, as string

“_version”: 1, // this is the first version of this document, if we perform put again with the same id we will overwrite hence it will be version 2

“result”: “created”, // result of the performed operation

“_shards”: {

“total”: 2,

“successful”: 1,

“failed”: 0, }

}}

Note that when you index data, Elasticsearch will detect the data type -long, text etc.- and add that to the structure automatically.

Another important thing to note, if we change a field’s value within the document and run PUT again it will update not the changed field alone but the entire document. So in order to update the changed field alone we do:

PUT /Menu/4/_update

{

“doc”:

{“Price”: “20USD”}

}

We can get a certain document by id with the structure GET /{index}/{id}. If we try to retrieve a data with an id or an index that doesn’t exist, the response will be line:

{

… // metadata created by Elasticsearch, usually the key names start with an underscore

“found”: false

}

And if we’re looking for a specific field instead of the whole document, we can run GET /{index}/{id}/{fieldname} instead.

The structure that we perform these operations such as GET /Menu/123 are all actually endpoints automatically generated by Elasticsearch for us to use. You can use this endpoint to get the data from your browser with the port your Elasticsearch instance runs on -9200 by default- as http://localhost:9200/Menu/123 and it will return the data on your browser just as it did via the console.

To delete a document, we can run DELETE /{index}/{id}. We can also delete the entire index with DELETE /{index}. When we delete a document from the Elasticsearch, it does not actually delete but instead marks the document as “deleted” and doesn’t return it to us on queries. Hence, keep in mind that your occupied disk space will not decrease as soon as you delete some records.

Understanding the Index

Now that we have created an index, we can call it via GET and see the details of it. Simply run GET /{index} and you’ll get something like:

Now let’s study the response. In the example above:

  • Aliases contain all the names that you can call the same index with instead of the direct index name. You can define none, one or multiple.
  • Mappings contain the structure of the fields. Inside, within properties, there’s a list of all the fields we have in that index. Each field is defined by it’s name, and has a type field that contains the type of the data that field contains. If the data within was a type Long, for example, we wouldn’t have the fields section inside. But for the type text, to be able to perform word-by-word searches as well as full-text searches, we map them twice as text and keyword hence we get the fields section inside each text field.

Note: You can get the mappings alone instead of all the details regarding that index via GET /{index}/_mapping.

Note: If you put a document within an index that doesn’t match the mapping, Elasticsearch won’t prevent you from doing so. It’ll simply alter the mapping to match your new document as well as the old ones. It’s not restrictive like the classical, relational databases. You can, however, prevent that by using the “dynamic”: “strict” parameter while creating an index.

  • Settings contain the general settings relating that index. The section provided_name holds the name of the index, creation_date holds the date it was created, uuid holds the default id given to your index automatically by Elasticsearch, and version holds the number of times the document was overwritten. We’ll get to the concept of shards and replicas later.

Each index can have only one type, so if you try to put a document that doesn’t fit this type structure Elasticsearch will throw an illegal argument exception.

Shards, Replicas and How the Elasticsearch Works

We love Elasticsearch because it can query so fast within large amounts of documents, and that is thanks to the concept of shards.

Basically there are multiple computers within the same network on which Elasticsearch is running, hence it makes sense to put them all in use when running a query -and Elasticsearch does exactly that!

For example, imagine that our menu contains all the meals that exist -fun fact, there are 75287520 recipes in the world- and the average price for them in the same currency. If we’re looking for the price of a burger, and our network has two computers -we call them nodes-, it makes sense to split our menu data into two and have each computer search one piece. The pieces here are called shards.

Since each one of these computers can communicate, each can take requests and queries directly. And if the incoming request contains an id for the document that we are searching for, using something called a hashing function, the receiving node will know exactly which shard that document is on and will redirect your request to the right node. If the right node is busy performing some other task, it’ll redirect the query to the node that has the replica of the same document. They’re just like co-workers!

Now since we split the data and run operations on them, it’s a good idea to keep copies of shards somewhere right? These are called replicas, aka replica shards, and original shards in this case are called primary shards. The cloud in which those nodes, shards and replicas exist — the cloud that makes Elasticsearch, is called cluster.

So we’ve separated our data into pieces and put them on different computers, and just to play safe, it makes sense to keep the replica of first shard on node2 and the replica of the second shard on node1 right? You know, in case we lose one of the nodes. That’s exactly what the brilliant Elasticsearch does!

You can define the number of shards and replicas while creating the index as below, and if you don’t, Elasticsearch will give a default value of one shard and one replica per each index.

PUT /Menu

{ “settings”: {

“number_of_shards”: 3,

“number_of_replicas”:2

}}

Queries & Filters

Elasticsearch can be simplified as http over JSON, where we perform http methods on given indexes following the Elasticsearch local instance URL, with the JSON as our method body.

Above we’ve seen how we use JSON to create mappings, document types, set settings and all. This is called Domain Specific Language (DSL). Since this is how Elasticsearch communicates, if you get the syntax/structure wrong you’ll get an error.

Now there are two ways to search within our documents: querying and filtering. Likewise, we use DSL here again.

Note: We can combine a query with a filter to create a more complex search.

Queries

We can run all queries by using the GET /{index}/_search with the criteria as the content.

You can run this query structure under http://localhost:9200/Menu/_search with the query data as the body on wherever you’d like: postman, browser, your terminal, under Dev Tools on Kibana…

The criteria we put inside the query will, of course, vary. For example if we’re looking for all noodles we have within the menu, we can use the “match” criteria domain as:

GET /Menu/_search

{“query”: {

“match”: {

“Meal Name”: “Noodle”,

}}}

This will return us a structure which contains general information about how the querying process went like the time it took, whether it was successful and finally a field called “hits” that contains the actual data we queried for. The field “total_hits” underneath hits section will give us the number of documents that match the query, and each document will have a “score” value that shows how relevant the document is to our query. So just to put it simply, if we have searched for “Chicken Noodles” and we have “Chicken Noodles” as well as “Spicy Chicken Noodles” within our documents, naturally, “Chicken Noodles” will have a higher score. Of course, there’s more engineering and detail going on behind the scenes when it comes to scoring, but just know that you can trust Elasticsearch on that!

One query can have multiple match criterias with the help of “must”, but do make sure to put it inside a “bool” field and wrap each criteria with curly braces. We can also run queries that do not match a specific criteria with the help of “must_not”. So if we’re looking for noodles we can get on a 10USD budget, but we don’t like anything chicken:

GET /Menu/_search

{“query”: {

“bool”: {

“must”: [

{“match”: {“Meal Name”: “Noodle”,}},

{“match”: {“Price”: “10USD”,}}

],

“must_not”: [

{“match”: {“Meal Name”: “Chicken”,}},

]

}}}

Note: The response to this will be messy. So if you’d like to receive a response that is more readable try http://localhost:9200/Menu/_search?pretty instead.

Let’s say some of the meals don’t have their prices listed -hence no “Price” field, and we only want to get those that do have a approximate price data. We can do what with the “exists” domain:

GET /Menu/_search

{“query”: {

“exists”: {

“field”: “Price”,

}}}

Note: If we have nested documents, we can reach inner fields by using a dot. So, for example, if our Price field goes as

“Price”: {

“price_usd”: “10USD”,

“price_eu”: “8EU”}

and we’re looking for the usd price we can get that as:

GET /Menu/_search

{“query”: {

“exists”: {

“field”: “Price.price_usd”,

}}}

If we’re looking for meals between 10USD-15USD, we can use “range” as (assume we have only the numbers and not the currency, as 10 and 15):

GET /Menu/_search

{“query”: {

“range”: {

“Price”: {

“gte”: 10, // stands for greater than or equal, can also use gt for greater than

“lte”: 15 // stands for less than or equal, can also use lt for less than

}}}}

Note: You can also use dates like “2020–03–05” in quote marks with range.

Now let’s get all the meals and their prices, but sorting from cheap to expensive — using “match_all” to get all and “sort” to sort:

GET /Menu/_search

{

“query”: {

“match_all”: {}

},

“sort”: [

{“price”: {“order”: “asc”}} //asc for ascending, desc for descending

]

}

And if we were to look at the , let’s say, 5 cheapest options:

GET /Menu/_search

{

“size”: 5, //sorts first, then gets the top 5 — also used for pagination

“query”: {

“match_all”: {}

},

“sort”: [

{“price”: {“order”: “asc”}}

]

}

Filters

Filters are pretty much the same DSL-wise with the queries. You can do anything you do with the queries using the filters.

So just as we did above, if we were to filter our documents to get Noodles that are 10USD and don’t contain chicken, we can simply do the same -but wrapping what was originally inside our actual query with a “filter” field and then a “bool” field as:

GET /Menu/_search

{“query”: {

“bool”: {

“filter”: {

“bool”: {

“must”: [

{“match”: {“Meal Name”: “Noodle”,}},

{“match”: {“Price”: “10USD”,}}

],

“must_not”: [

{“match”: {“Meal Name”: “Chicken”,}},

]

}}}}}

So why do we even have filters? Well;

  • Filters don’t calculate a score of relevancy, so if you don’t need this information using a filter will do the same job faster for you.
  • Filters are cached, so if you’re planning on running the same query over and over again, using a filter instead will do the same job faster for you.

Tip: Make it a habit to use filter field by default unless the scenario specifically requires something else, for better performance.

Aggregations

Aggregations is a way of getting statistics regarding your data, similar to the GROUP BY statement from SQL.

There are three types of aggregations:

  • Bucket Aggregations: Creates a set(bucket) of documents matching a certain criteria. When we create a bucket aggregation, we can’t access the documents within the bucket but we can run metric aggregations on them.
  • Metric Aggregations: Computes metrics over a bucket. You can, for example, get the number of documents within a bucket or the sum of a field in documents within a bucket.
  • Pipeline Aggregations: Gets statistics regarding other aggregations. For example, we can calculate how many percent of the meal prices are higher than the average of all meal prices.

Now let’s calculate how many meals we have that cost a certain price for each price. For that we need to use the “aggs” domain, and give this aggregation a name. Then, each price will be a bucket and the meals matching that price will be clustered inside that bucket. The “field” domain will contain the name of the field which we group by, hence in our case, “Price”. So we have to do something like this:

GET /Menu/_search

{“aggs”: {

“my_price_ranges”: {

“terms”: {

“field”: “Price”,

}}}}

We can get the average price using the “avg” domain:

GET /Menu/_search

{“aggs”: {

“my_average_price”: {

“avg”: {

“field”: “Price”,

}}}}

We can perform all basic statistics just like that, for example in order to get the maximum price and minimum price:

GET /Menu/_search

{“aggs”: {

“my_max_price”: {

“max”: {

“field”: “Price”,

}},

“my_min_price”: {

“min”: {

“field”: “Price”,

}}

}}

Note: As shown above, you can calculate and get multiple aggregations at once.

We can get all stats -max, min, average, sum, count- regarding our data as:

GET /Menu/_search

{“aggs”: {

“my_stats_on_price”: {

“stats”: {

“field”: “Price”,

}}}}

The ELK Stack

So far we’ve learned about Elasticsearch, and used Kibana a bit. Elasticsearch is the heart of an ELK Stack. Now an ELK Stack is simply a bunc of technologies developed by Elasticsearch that are often used together for various purposes. Although the use of some are optional, the most basic and common form of ELK Stack consists of Elasticsearch, Logstash and Kibana. Let’s review and learn what these are:

  • Elasticsearch: Works like a database, takes JSON and performs fast queries.
  • Kibana: A platform to analyze and visualize all things regarding Elasticsearch and our data within. Basically an Elasticsearch dashboard.
  • Logstash: A tool to process logs from your applications and send them to Elasticsearch. Basically a data processing pipeline. Takes inputs from by listening to a Kafka queue or reading a relational database etc. via input plugins, uses filter plugins to parse this input, then uses output plugins like Kafka, Elasticsearch or a mail server to send the processed input to.

So the most basic configuration of an ELK stack consists of Logstash to take our input data, parse it into the form Elasticsearch can read and analyze, and send it to the Elasticsearch. Then, Elasticsearch to keep the data, query, back up, and perform http methods on it. Finally, Kibana to manage and analyze all things Elasticsearch.

Another thing we can use with our stack, although optional, is Beats. Beats vary for different purposes: Filebeat for handling files, Packetbeat to capture network packet data, Metricbeat to capture system metrics… But the basic idea behind them is the same — to capture/collect data to feed Logstash with.

We can also -optionally- use X-pack to add a security and authentication layer, grant users different privileges, monitor the performance & health of the entire stack and even perform some Machine Learning jobs!

Conclusion

Elasticsearch is an engine that helps store and back up data, perform distributed searches and analysis. It’s scalable, works fast, language independent and very smart.

An ELK stack is an acronym for a bunch of tools with Elasticsearch at heart, configured to work together as a system, that collects and processes data from multiple data sources. The Elk stack answers a project’s all needs in a fast, smart, enabling manner.

Common uses of Elasticsearch and ELK Stack in general are log analytics, application performance monitoring, security monitoring, and business information analytics.

All mentioned within this article come with a very strong documentation that is updated regularly. For more information and to keep up to date, make sure to visit here often.

--

--