Real-timish Elasticsearch

The product I work on uses the Elasticsearch heavily. We use it to provide product filtering with many verticals selected by user and filtered by our business rules. We also use it to show similar and recommended products (via similarity vectors). Usually not compressed response from Elasticsearch for more important pages may take like 3-4MB each so you may guess how complicated it is. The peak traffic is also quite heavy, so we have to really care about Elasticsearch performance. Here are some tips how to survive.

Don’t use it as storage

Store only values that queries use and resource identifier. Fetch as little data as possible, ideally just ordered ids and score maybe.

Optimize queries

It’s of course very complex topic but:

Transport layer

It’s (gziped) JSON of HTTP protocol, so it is slower than standard DB connections. Make sure both transport and response parsing is really fast.

Cache

The most obvious one. You may use Elasticsearch query SHA (or other hash) as cache key and serve slightly stale responses. In our case we used some abstraction over query to make the cache key more stable. You may also add grace period to Elasticsearch responses - if it crashes you still have old cache value.

Index settings

Cluster settings

Reduce pending search tasks queue to be small multiple of search threads - i.e. if you have 40 working threads per node use 200 or 100. It’s better to reject some queries than put the node in high CPU state with loads of pending tasks. It’s also worth noting ES seems to have problem to limit to exact number, I mean I’ve had like 8000 pending tasks with 1000 limit set.

Critical events

When ES may be the slowest? It becomes slow when index is updated with new data - both new documents and updates of old document. It’s ok when it happends during low-traffic hours, but if you hit high traffic with updates - that may be hard. It would be the best if your query cache would not get too many misses during refreshing period. Of course the bigger / more complex standard query the worse.

Some unknowns

You need to know we use older Elasticsearch version - it’s 5.6.10, so some tips may be out-of-date, but when I worked on upgrade project I saw the newer version is not faster - the performance seemed to be quite similar. Some default were different, which could affect out-of-the-box performance but with our needs it remained the same (but if it’s possible go with defaults).

If you feel the list is incomplete please tweet your advise - I’m still looking for better ES performance.