A SolrCloud cluster holds one or more distributed indexes which are called Collections. Each Collection is divided into shards (to increase write capacity) and each shard has one or more replicas (to increase query capacity). One replica from each shard is elected as a leader, who performs the additional task of adding a ‘version’ to each update before streaming it to available replicas. This means that write traffic for a particular shard hits the shard’s leader first and is then synchronously replicated to all available replicas. One Solr node (a JVM instance) may host a few replicas belonging to different shards or even different collections.
All nodes in a SolrCloud cluster talk to a Apache ZooKeeper ensemble (an odd number of nodes, typically 3 or more for a production deployment). Among other things, ZooKeeper stores:
- The cluster state of the system (details of collections, shards, replicas and leaders)
- The set of live nodes at any given time (determined by heartbeat messages sent by Solr nodes to ZooKeeper)
- The state of each replica (active, recovering or down)
- Leader election queues (a queue of live replicas for each shard such that the first in the list attempts to become the leader assuming it fulfils certain conditions)
- Configurations (schema etc) which is shared by each replica in a collection.
Each Solr replica keeps a Lucene index (partly in memory and disk for performance) and a write-ahead transaction log. An update request to SolrCloud is firstly, written to the transaction log, secondly to the lucene index and thirdly, if the replica is a leader, streamed synchronously to all available replicas of the same shard. A commit command flushes the lucene indexes to disk and makes new documents visible to searchers. Therefore, searches are eventually consistent. Real time “gets” of documents can be done at any time using the document’s unique key, returning content from either the Lucene index (for committed docs) or the transaction log for uncommitted docs. Such real time “gets” are always answered by the leader for consistency.If a leader dies, a new leader is elected from among the ‘live’ replicas as long as the last published state of the replica was active. The selected replica syncs from all other live replicas (in both directions) to make sure that it has the latest updates among all replicas in the cluster before becoming the leader.
If a leader is not able to send an update to a replica then it updates ZooKeeper to publish the replica as inactive and at the same time spawns a background thread to ask the replica to recover. Both of these actions together make sure that a replica which loses an update neither becomes a leader nor receives traffic from other Solr nodes and smart clients before recovering from a leader node.
Solr nodes (and Solrj clients) forward search requests to active replicas. The active replicas will continue to serve search traffic even if they lose connectivity to ZooKeeper. If the entire ZooKeeper cluster becomes unavailable then the cluster state is effectively frozen which means that searches continue to be served by whoever was active at the time of ZooKeeper going down, but replica recovery cannot happen during this time. However replicas which are not active are allowed to return results if, somehow, a non-distributed request reaches them. This is needed for the sync that happens during leader election as well as useful for debugging at times.
In summary, SolrCloud chooses consistency over availability during writes and it prefers consistency for searches but under severe conditions such as ZooKeeper being unavailable, degrades to possibly serving stale data.