How to Configure ZooKeeper for SOLR Instances Effectively

How to Configure ZooKeeper for SOLR Instances Effectively

A common issue that seems to be poorly documented, is how a SOLR ensemble should be set up. I had to fix a bad setup because of a lack of understanding by the initial setup, and this caused a lot of headache.

So here are 4 questions I will address:

  • What is ZooKeeper even doing?
  • How many ZooKeepers instances and servers should I have?
  • How many SOLR instances servers should I have?
  • How much resources should I allocate to the instance?

I will do my best to make these as concise as I can, despite them being fairly complex topics the deeper you dive into them.

In short, ZooKeeper is almost a perfect name as it manages the beast that is SOLR. While SOLR can work on its own, when you need to scale, a ZooKeeper is necessary. The ZooKeeper is the middleman between all of the SOLR instances, and helps guide them where they need to be. It can be complex or simple in setup, with many configurations available. I advise looking at the version documentation on the ZK website for a deeper explanation.

Quick functionality list:

  • House schema configs for replication to SOLR instances
  • A pulse check to help determine who the leader is in a SOLR ensemble to determine if leader should be reassigned if it is down or unresponsive
  • Can house data as well for replication to the SOLR instances

These are just a few things, but the functionality can go beyond that.

How Many ZooKeepers to SOLR instances?

This one is documented on the ZooKeeper and Solr documentation in an easy to miss sentence, and it is also hard to truly understand what they mean, since they do not go deeper into what this means. This is especially true when working with older versions.

Take this quote from the administration guide:

“For a ZooKeeper service to be active, there must be a majority of non-failing machines that can communicate with each other. To create a deployment that can tolerate the failure of F machines, you should count on deploying 2xF+1 machines.”

I have heard this:

“But there is also documentation saying not to go above 5 nodes? What does this really mean, and where does Solr come in? “

Well, doing the math here, if I have a max of 5 ZooKeeper servers, this means that 3 of the 5 can fail for ZooKeeper to remain functional. It has absolutely nothing to do with fault tolerance of SOLR. However, ZooKeeper must remain up when you have 2+ Solr instances. Whether they are on different or the same machines or not. SOLR has its own fault tolerance with the use of replication and transaction logging on the servers for the index. I believe in Solr 8.1 the transactions are set to NRT(near real time) so it is replicated and committed fairly quickly across them all. However, if ZooKeeper goes down, Solr all of a sudden struggles to migrate to the failover instance, and then also leader confusion happens and it is a total mess.

So, to answer the question, there is a one-to-many relationship with ZK and SOLR. I would say a decent starting point is 4 SOLR instances, preferably all on different servers, and 3 ZK servers. Now scaling is entirely dependent on your needs and size of index you have. You will have to do some math, but a ZK instance should be able to handle quite a few SOLR per each. I have heard of larger companies using 100’s of SOLR instances across 5 ZK servers. The counts are not really dependent on each other.

How Many SOLR instances?

A SOLR cloud environment requires a minimum of 2. if you have 2 SOLR instances, you need at least 3 zk instances, but I advise 1:1 server to instance ratio if possible. If a server goes down, it will take all instances with it.

How Much RAM and CPU is needed?

Less is more when it comes to these in my opinion. Start small and move up as needed. Garbage collection (one of the biggest killers) becomes a big problem the bigger the heap size. Think of it like “the bigger they are the harder they fall” kind of situation.

This definitely gets into a bigger conversation about tuning the JVM (Java Virtual Machine) and GC tuning, which can get a little complex. I would start with around 2GB of ram dedicated to the JVM and 16GB overall dedicated to the server itself.

This gives you room to upgrade the JVM and server without oversizing it. Garbage Collection becomes more of a problem above 2GB JVM heap size, so anything above that expect to see pauses upwards of 500ms per index pause operation depending on the data, etc. This is a much much much deeper conversation, however.

in short:

  • Start with a 2GB heap size and tune up
  • start with 4-8 cores and move up, memory is the issue on SOLR as that is what the JVM is using

Leave a comment

Damien Rincon

From Debug to Deploy: Lessons from the Fullstack Trenches.