.. Licensed to the Apache Software Foundation (ASF) under one
.. or more contributor license agreements.  See the NOTICE file
.. distributed with this work for additional information
.. regarding copyright ownership.  The ASF licenses this file
.. to you under the Apache License, Version 2.0 (the
.. "License"); you may not use this file except in compliance
.. with the License.  You may obtain a copy of the License at
..
..     http://www.apache.org/licenses/LICENSE-2.0
..
.. Unless required by applicable law or agreed to in writing, software
.. distributed under the License is distributed on an "AS IS" BASIS,
.. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
.. See the License for the specific language governing permissions and
.. limitations under the License.

Find The Misbehaving Nodes
==========================

The first step to troubleshooting a Cassandra issue is to use error messages,
metrics and monitoring information to identify if the issue lies with the
clients or the server and if it does lie with the server find the problematic
nodes in the Cassandra cluster. The goal is to determine if this is a systemic
issue (e.g. a query pattern that affects the entire cluster) or isolated to a
subset of nodes (e.g. neighbors holding a shared token range or even a single
node with bad hardware).

There are many sources of information that help determine where the problem
lies. Some of the most common are mentioned below.

Client Logs and Errors
----------------------
Clients of the cluster often leave the best breadcrumbs to follow. Perhaps
client latencies or error rates have increased in a particular datacenter
(likely eliminating other datacenter's nodes), or clients are receiving a
particular kind of error code indicating a particular kind of problem.
Troubleshooters can often rule out many failure modes just by reading the error
messages. In fact, many Cassandra error messages include the last coordinator
contacted to help operators find nodes to start with.

Some common errors (likely culprit in parenthesis) assuming the client has
similar error names as the Datastax :ref:`drivers <client-drivers>`:

* ``SyntaxError`` (**client**). This and other ``QueryValidationException``
  indicate that the client sent a malformed request. These are rarely server
  issues and usually indicate bad queries.
* ``UnavailableException`` (**server**): This means that the Cassandra
  coordinator node has rejected the query as it believes that insufficent
  replica nodes are available.  If many coordinators are throwing this error it
  likely means that there really are (typically) multiple nodes down in the
  cluster and you can identify them using :ref:`nodetool status
  <nodetool-status>` If only a single coordinator is throwing this error it may
  mean that node has been partitioned from the rest.
* ``OperationTimedOutException`` (**server**): This is the most frequent
  timeout message raised when clients set timeouts and means that the query
  took longer than the supplied timeout. This is a *client side* timeout
  meaning that it took longer than the client specified timeout. The error
  message will include the coordinator node that was last tried which is
  usually a good starting point. This error usually indicates either
  aggressive client timeout values or latent server coordinators/replicas.
* ``ReadTimeoutException`` or ``WriteTimeoutException`` (**server**): These
  are raised when clients do not specify lower timeouts and there is a
  *coordinator* timeouts based on the values supplied in the ``cassandra.yaml``
  configuration file. They usually indicate a serious server side problem as
  the default values are usually multiple seconds.

Metrics
-------

If you have Cassandra :ref:`metrics <monitoring-metrics>` reporting to a
centralized location such as `Graphite <https://graphiteapp.org/>`_ or
`Grafana <https://grafana.com/>`_ you can typically use those to narrow down
the problem. At this stage narrowing down the issue to a particular
datacenter, rack, or even group of nodes is the main goal. Some helpful metrics
to look at are:

Errors
^^^^^^
Cassandra refers to internode messaging errors as "drops", and provided a
number of :ref:`Dropped Message Metrics <dropped-metrics>` to help narrow
down errors. If particular nodes are dropping messages actively, they are
likely related to the issue.

Latency
^^^^^^^
For timeouts or latency related issues you can start with :ref:`Table
Metrics <table-metrics>` by comparing Coordinator level metrics e.g.
``CoordinatorReadLatency`` or ``CoordinatorWriteLatency`` with their associated
replica metrics e.g.  ``ReadLatency`` or ``WriteLatency``.  Issues usually show
up on the ``99th`` percentile before they show up on the ``50th`` percentile or
the ``mean``.  While ``maximum`` coordinator latencies are not typically very
helpful due to the exponentially decaying reservoir used internally to produce
metrics, ``maximum`` replica latencies that correlate with increased ``99th``
percentiles on coordinators can help narrow down the problem.

There are usually three main possibilities:

1. Coordinator latencies are high on all nodes, but only a few node's local
   read latencies are high. This points to slow replica nodes and the
   coordinator's are just side-effects. This usually happens when clients are
   not token aware.
2. Coordinator latencies and replica latencies increase at the
   same time on the a few nodes. If clients are token aware this is almost
   always what happens and points to slow replicas of a subset of token
   ranges (only part of the ring).
3. Coordinator and local latencies are high on many nodes. This usually
   indicates either a tipping point in the cluster capacity (too many writes or
   reads per second), or a new query pattern.

It's important to remember that depending on the client's load balancing
behavior and consistency levels coordinator and replica metrics may or may
not correlate. In particular if you use ``TokenAware`` policies the same
node's coordinator and replica latencies will often increase together, but if
you just use normal ``DCAwareRoundRobin`` coordinator latencies can increase
with unrelated replica node's latencies. For example:

* ``TokenAware`` + ``LOCAL_ONE``: should always have coordinator and replica
  latencies on the same node rise together
* ``TokenAware`` + ``LOCAL_QUORUM``: should always have coordinator and
  multiple replica latencies rise together in the same datacenter.
* ``TokenAware`` + ``QUORUM``: replica latencies in other datacenters can
  affect coordinator latencies.
* ``DCAwareRoundRobin`` + ``LOCAL_ONE``: coordinator latencies and unrelated
  replica node's latencies will rise together.
* ``DCAwareRoundRobin`` + ``LOCAL_QUORUM``: different coordinator and replica
  latencies will rise together with little correlation.

Query Rates
^^^^^^^^^^^
Sometimes the :ref:`Table <table-metrics>` query rate metrics can help
narrow down load issues as  "small" increase in coordinator queries per second
(QPS) may correlate with a very large increase in replica level QPS. This most
often happens with ``BATCH`` writes, where a client may send a single ``BATCH``
query that might contain 50 statements in it, which if you have 9 copies (RF=3,
three datacenters) means that every coordinator ``BATCH`` write turns into 450
replica writes! This is why keeping ``BATCH``'s to the same partition is so
critical, otherwise you can exhaust significant CPU capacitity with a "single"
query.


Next Step: Investigate the Node(s)
----------------------------------

Once you have narrowed down the problem as much as possible (datacenter, rack
, node), login to one of the nodes using SSH and proceed to debug using
:ref:`logs <reading-logs>`, :ref:`nodetool <use-nodetool>`, and
:ref:`os tools <use-os-tools>`. If you are not able to login you may still
have access to :ref:`logs <reading-logs>` and :ref:`nodetool <use-nodetool>`
remotely.