Dynamo

Gossip

Todo

todo

Failure Detection

Todo

todo

Token Ring/Ranges

Todo

todo

Replication

The replication strategy of a keyspace determines which nodes are replicas for a given token range. The two main replication strategies are SimpleStrategy and NetworkTopologyStrategy.

SimpleStrategy

SimpleStrategy allows a single integer replication_factor to be defined. This determines the number of nodes that should contain a copy of each row. For example, if replication_factor is 3, then three different nodes should store a copy of each row.

SimpleStrategy treats all nodes identically, ignoring any configured datacenters or racks. To determine the replicas for a token range, Cassandra iterates through the tokens in the ring, starting with the token range of interest. For each token, it checks whether the owning node has been added to the set of replicas, and if it has not, it is added to the set. This process continues until replication_factor distinct nodes have been added to the set of replicas.

NetworkTopologyStrategy

NetworkTopologyStrategy allows a replication factor to be specified for each datacenter in the cluster. Even if your cluster only uses a single datacenter, NetworkTopologyStrategy should be prefered over SimpleStrategy to make it easier to add new physical or virtual datacenters to the cluster later.

In addition to allowing the replication factor to be specified per-DC, NetworkTopologyStrategy also attempts to choose replicas within a datacenter from different racks. If the number of racks is greater than or equal to the replication factor for the DC, each replica will be chosen from a different rack. Otherwise, each rack will hold at least one replica, but some racks may hold more than one. Note that this rack-aware behavior has some potentially surprising implications. For example, if there are not an even number of nodes in each rack, the data load on the smallest rack may be much higher. Similarly, if a single node is bootstrapped into a new rack, it will be considered a replica for the entire ring. For this reason, many operators choose to configure all nodes on a single “rack”.

Tunable Consistency

Cassandra supports a per-operation tradeoff between consistency and availability through Consistency Levels. Essentially, an operation’s consistency level specifies how many of the replicas need to respond to the coordinator in order to consider the operation a success.

The following consistency levels are available:

ONE
Only a single replica must respond.
TWO
Two replicas must respond.
THREE
Three replicas must respond.
QUORUM
A majority (n/2 + 1) of the replicas must respond.
ALL
All of the replicas must respond.
LOCAL_QUORUM
A majority of the replicas in the local datacenter (whichever datacenter the coordinator is in) must respond.
EACH_QUORUM
A majority of the replicas in each datacenter must respond.
LOCAL_ONE
Only a single replica must respond. In a multi-datacenter cluster, this also gaurantees that read requests are not sent to replicas in a remote datacenter.
ANY
A single replica may respond, or the coordinator may store a hint. If a hint is stored, the coordinator will later attempt to replay the hint and deliver the mutation to the replicas. This consistency level is only accepted for write operations.

Write operations are always sent to all replicas, regardless of consistency level. The consistency level simply controls how many responses the coordinator waits for before responding to the client.

For read operations, the coordinator generally only issues read commands to enough replicas to satisfy the consistency level. There are a couple of exceptions to this:

  • Speculative retry may issue a redundant read request to an extra replica if the other replicas have not responded within a specified time window.
  • Based on read_repair_chance and dclocal_read_repair_chance (part of a table’s schema), read requests may be randomly sent to all replicas in order to repair potentially inconsistent data.

Picking Consistency Levels

It is common to pick read and write consistency levels that are high enough to overlap, resulting in “strong” consistency. This is typically expressed as W + R > RF, where W is the write consistency level, R is the read consistency level, and RF is the replication factor. For example, if RF = 3, a QUORUM request will require responses from at least two of the three replicas. If QUORUM is used for both writes and reads, at least one of the replicas is guaranteed to participate in both the write and the read request, which in turn guarantees that the latest write will be read. In a multi-datacenter environment, LOCAL_QUORUM can be used to provide a weaker but still useful guarantee: reads are guaranteed to see the latest write from within the same datacenter.

If this type of strong consistency isn’t required, lower consistency levels like ONE may be used to improve throughput, latency, and availability.