A comprehensive summary for Elastic Search

What is ElasticSearch?

ElasticSearch is a search server that works on top of Lucene. A search engine is a type of technology that allows you to do text searches very efficiently. The best-known equivalent of this use case is Google, although there are others such as Yahoo or Bing. Although today it is not necessary to have a huge index to use this type of technology, this technology is used because it provides with a very good reading speed. Today practically all websites use very similar technologies at some level of the application.

A search engine is based on a data structure called an inverted index, where for each word it has the correspondence of the documents that mention that word.  Apart from that, it allows you to form aggregations or sorting documents in a very efficient way. Compared to a database, it is a type of technology that gives you very good reading speed but is difficult to maintain when you have to make too many modifications per second.

To put a little context Lucene was an Open-Source project that was founded in 1999 initially at SourceForge, it passed in 2001 to Apache. Within the Lucene and Apache umbrella other projects emerged such as Tika, Mahout, or Nutch and finally Apache Solr – which would be the equivalent of ElasticSearch created by the Apache Software Foundation. Apache Solr was the server layer that allowed access by REST API to Lucene trying to simplify all its internal complexity.

Following the creation of Apache Solr, software came out that also implemented the server layer on top of Lucene to help manage this complexity that Apache Solr was not able to solve quite well. One of them was ElasticSearch in 2010.

What happened when Lucene entered the market?

The popularization of this type of software has allowed many companies to scale their growth costs very well. Since performing this type of massive read operations from a database can be prohibitively expensive, while if you perform them from an inverted index paradigm, the cost at the infrastructure level is significantly reduced. The idea of this type of open-source software is to have something similar to what Google had at that time with a relevance algorithm based on TF-IDF that is able to paginate documents from 10 to 10, to search by text efficiently, to filter content, etc…

Advantages of ElasticSearch

What ElasticSearch provides in comparison with other frameworks is that it gives you a distributed architecture where it is relatively easy to add and remove nodes. On the other hand, Elastic has focused on giving access to the actions of the search engine via REST API, this allows you to make integrations in any programming language in a simple way as long as you follow the rules that the API gives you. So we can find implementations in PHP, Java, Ruby, Golang, etc …

ElasticSearch use cases

Product catalog and text search engines

What would be a typical ElasticSearch use case? For example a relatively large product catalog with 2-3 million documents where you need to search by text efficiently. An example could be the FNAC product catalog when you go to the search engine and try to locate a specific model of television that you want to buy. You have to have the documents indexed by words and when you are looking for a specific word (in this case the television model), Elastic has to return the documents that speak of this television model.

Time series and logs

Another typical use case would be time series. Like what we have in Graylog and Kibana that allows us to store logs and perform searches in a relatively short time. This second use case seems to be Elasticsearch’s current business niche and that is why they have released products such as Logstash, APM or Kibana.

The Elastic family

Alternative products such as Kibana, Logstash, Beats or APM have emerged around the original search engine, which is known as ElasticSearch.

ElasticSearch is the search server that works on Lucene.

Logstash and Beats are designed to transform log lines (or other types of inputs such as machine performance indicators) to Elasticsearch so that they can then read the data in time series.

Kibana is in a set of tools that allows you to visualize data and mount visualizations of this data in a very interesting way.

APM is focused on the performance monitoring use case. It is an Elasticsearch + Beats + Kibana configuration optimized for the application performance monitoring use case.

Primary shards and replica shards

A basic Elastic concept  is that the indexes are divided into shards, this means that an index of a product catalog could be divided into 4 parts, for instance. On the other hand, an index can have multiple replicas but the concept of replication is applied at the shard level. That is, the shard practically acts as an independent index.

The important thing to ensure high availability is that a primary shard and a replica shard are never on the same node. This way if a node falls, you will find a copy of its contents on another node. Another important concept of replicating your shards is that it allows you to improve the reading speed, this means that we can make more requests in parallel because there are several nodes with the same data that can attend to queries.

At the shard level it is important to know how to give it an effective size. We have found incorrect configuration cases of up to 1500 indexes with 1MB size and 4 shards each. This does not make sense because you would spending a lot of resources giving the cluster an extra work that does not generate any performance gain. A cluster that receives relatively few requests with this configuration would suffer simply by having to manage so many indexes and shards.

Elastic node types: Master node and data node

The master nodes are responsible for tasks such as creating and deleting indexes, keeping track of all the nodes in the cluster, and making the decision of where to put each of the shards based on this data on nodes. Meanwhile, the data nodes contain the documents you want to save in the indexes. They are responsible for managing search operations, sorting, aggregation, etc …

Data nodes require much more memory than master nodes. It is possible that the same node performs the two roles (master and data) but it is highly recommended to separate it into different nodes to avoid performance falls.

How many shards does your index need?

Choosing the number of shards always has a complicated answer. But there are several basic guidelines to follow to answer this question. For instance, if we have a catalog that require many writes per minute. It may be interesting to have more shards to isolate these writings. An example would be a time series index or logs. In this case we are only interested in writing in today’s index and leaving those from several days ago unchanged and fully optimized to answer.

In a scenario with more requests per second, we are interested in having more replicas. As we have seen before, this allows us to distribute the load among several nodes and serve more requests per second.

A scenario where we have large datasets, we must try to follow the golden rule of having shards between 10 and 40 GB because it is the ideal size when it comes to not losing performance on the node at search time and not overloading the master-nodes with knowledge about the cluster.

In a second scenario with a small dataset, a unique shard should be more than enough. In case you need high availability you always can add replicas to achieve a good reading speed.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s