
.. _alerts_metrics:

Definitions of alert metrics
============================
From the Alerts area of OpsCenter Enterprise Edition, you can configure alert
thresholds for a number of Cassandra :ref:`cluster-wide <cluster-wide-alerts>`, :ref:`column family <adv-cf-alerts>`, and :ref:`operating
system metrics <os-system-alerts>`. This proactive monitoring feature is available only in OpsCenter Enterprise Edition.

Commonly watched alert metrics
------------------------------
OpsCenter provides the capability to configure alerts for the following most
commonly watched Cassandra and system metrics.

======================= ========================================================
Metric                  Definition
======================= ========================================================
Node Down               When a node is not responding to requests, it is marked
                        as down.
Write Requests          The number of write requests per second. Monitoring the
                        number of writes over a given time period can 
                        give you and idea of system write workload and usage
                        patterns.
Write Request Latency   The response time (in milliseconds) for successful
                        write operations. The time period starts when a 
                        node receives a client write request, and ends when the
                        node responds back to the client.
Read Requests           The number of read requests per second. Monitoring the
                        number of reads over a given time period can 
                        give you and idea of system read workload and usage
                        patterns.
Read Request Latency    The response time (in milliseconds) for successful read
                        operations. The time period starts when a 
                        node receives a client read request, and ends when the
                        node responds back to the client.
CPU Usage               The percentage of time that the CPU was busy, which is
                        calculated by subtracting the percentage of time the
                        CPU was idle from 100 percent.
Load                    Load is a measure of the amount of work that a computer
                        system performs. An idle computer has a load 
                        number of 0 and each process using or waiting for CPU
                        time increments the load number by 1.
======================= ========================================================

.. _cluster-wide-alerts:

Advanced Cassandra alert metrics
--------------------------------
OpsCenter provides the ability to configure alerts for the following Cassandra
metrics. These metrics are aggregated across all nodes in the cluster.

================================= ==================================================================
Metric                            Definition
================================= ==================================================================
Heap Max                          The maximum amount of shared memory allocated
                                  to the JVM heap for Cassandra processes.
Heap Used                         The amount of shared memory in use by the JVM
                                  heap for Cassandra processes.
JVM CMS Collection Count          The number of concurrent mark-sweep (CMS) garbage collections performed by the JVM per second.
JVM ParNew Collection Count       The number of parallel new-generation garbage collections performed by the JVM per second. 
JVM CMS Collection Time           The time spent collecting CMS garbage in milliseconds per second (ms/sec).
JVM ParNew Collection Time        The time spent performing ParNew garbage collections in ms/sec. 
Data Size                         The size of column family data (in gigabytes)
                                  that has been loaded/inserted into Cassandra, including 
                                  any storage overhead and system metadata.
Compactions Pending               The number of compaction operations that are
                                  queued and waiting for system resources in order to run. 
                                  The optimal number of pending compactions is
                                  0 (or at most a very small number). A value
                                  greater than 0 indicates that read operations are in I/O
                                  contention with compaction operations, which
                                  usually manifests itself as declining read performance.
Total Bytes Compacted             The number of sstable data compacted in bytes per second.
Total Compactions                 The number of compactions (minor or major) performed per second.
Flush Sorter Tasks Pending        The flush sorter process performs the first
                                  step in the overall process of flushing memtables to disk 
                                  as SSTables. The optimal number of pending
                                  flushes is 0 (or at most a very small number).
Flushes Pending                   The flush process flushes memtables to disk
                                  as SSTables. This metric shows the number of memtables queued 
                                  for the flush process. The optimal number of
                                  pending flushes is 0 (or at most a very small
                                  number).
Gossip Tasks Pending              Cassandra uses a protocol called gossip to
                                  discover location and state information about the other
                                  nodes participating in a Cassandra cluster. In
                                  Cassandra, the gossip process runs once per second on each
                                  node and exchanges state messages with up to three
                                  other nodes in the cluster. Gossip tasks pending shows the 
                                  number of gossip messages and acknowledgments queued and
                                  waiting to be sent or received. The optimal number of pending
                                  gossip tasks is 0 (or at most a very small number).
Hinted Handoff Pending            While a node is offline, other nodes in the
                                  cluster will save hints about rows that were updated 
                                  during the time the node was unavailable.
                                  When a node comes back online, its corresponding replicas 
                                  will begin streaming the missed writes to the node to catch it
                                  up. The hinted handoff pending metric 
                                  tracks the number of hints that are queued and waiting
                                  to be delivered once a failed node is back online 
                                  again. High numbers of pending hints are commonly seen
                                  when a node is brought back online after some down 
                                  time. Viewing this metric can help you determine when
                                  the recovering node has been made consistent again.
Internal Responses Pending        The number of pending tasks from various internal tasks
                                  such as nodes joining and leaving the cluster.
Manual Repair Tasks Pending       The number of operations still to be completed when you
                                  run anti-entropy repair on a node. It will only 
                                  show values greater than 0 when a repair is in progress.
                                  It is not unusual to see a large number of 
                                  pending tasks when a repair is running, but you should
                                  see the number of tasks progressively decreasing.
Memtable Post Flushers Pending    The memtable post flush process performs the final step
                                  in the overall process of flushing memtables to 
                                  disk as SSTables. The optimal number of pending flushes
                                  is 0 (or at most a very small number).
Migrations Pending                The number of pending tasks from system methods that
                                  have modified the schema. Schema updates have to 
                                  be propagated to all nodes, so pending tasks for this
                                  metric can manifest in schema disagreement errors.
Misc. Tasks Pending               The number of pending tasks from other miscellaneous
                                  operations that are not ran frequently.
Read Requests Pending             The number of read requests that have arrived into the
                                  cluster but are waiting to be handled. During 
                                  low or moderate read load, you should see 0 pending read
                                  operations (or at most a very low number).
Read Repair Tasks Pending         The number of read repair operations that are queued and
                                  waiting for system resources in order to run. 
                                  The optimal number of pending read repairs is 0 (or at
                                  most a very small number). A value greater than 0 
                                  indicates that read repair operations are in I/O
                                  contention with other operations.
Replicate on Write Tasks Pending  When an insert or update to a row is written, the
                                  affected row is replicated to all other nodes that 
                                  manage a replica for that row. This is called the
                                  ReplicateOnWriteStage. This metric tracks the pending 
                                  tasks related to this stage of the write process. During
                                  low or moderate write load, you should see 0 
                                  pending replicate on write tasks (or at most a very low number).
Request Responses Pending         Streaming of data between nodes happens during
                                  operations such as bootstrap and decommission when one 
                                  node sends large numbers of rows to another node. The
                                  metric tracks the progress of the streamed rows 
                                  from the receiving node.
Streams Pending                   Streaming of data between nodes happens during
                                  operations such as bootstrap and decommission when 
                                  one node sends large numbers of rows to another node.
                                  The metric tracks the progress of the streamed 
                                  rows from the sending node.
Write Requests Pending            The number of write requests that have arrived into the
                                  cluster but are waiting to be handled. During 
                                  low or moderate write load, you should see 0 pending
                                  write operations (or at most a very low number). 
================================= ==================================================================

.. _adv-cf-alerts:

Advanced column family alert metrics
------------------------------------
OpsCenter provides the capability to configure alerts for the following column
family metrics. Column family metrics provide a granular level of detail for
certain Cassandra metrics as they relate to a particular column family.

======================================= ============================================================
Metric                                  Definition
======================================= ============================================================
Local Writes                            The write load on a column family measured in operations 
                                        per second. This metric includes all writes to a 
                                        given column family, including write requests forwarded
                                        from other nodes.
Local Write Latency                     The response time in milliseconds for successful write
                                        operations on a column family. The time period starts 
                                        when nodes receive a write request, and ends when nodes
                                        respond.
Local Reads                             The read load on a column family measured in operations
                                        per second. This metric includes all reads to a 
                                        given column family, including read requests forwarded
                                        from other nodes.
Local Read Latency                      The response time in microseconds for successful read
                                        operations on a column family. The time period starts 
                                        when a node receives a read request, and ends when the
                                        node responds.
CF: KeyCache Hits                       The number of read requests that resulted in the
                                        requested row key being found in the key cache.
CF: KeyCache Requests                   The total number of read requests on the row key cache.
CF: KeyCache Hit Rate                   The key cache hit rate indicates the effectiveness of
                                        the key cache for a given column family by giving 
                                        the percentage of cache requests that resulted in a
                                        cache hit.
CF: RowCache Hits                       The number of read requests that resulted in the read
                                        being satisfied from the row cache.
CF: RowCache Requests                   The total number of read requests on the row cache.
CF: RowCache Hit Rate                   The key cache hit rate indicates the effectiveness of
                                        the row cache for a given column family by giving 
                                        the percentage of cache requests that resulted in a
                                        cache hit.
Live Disk Used                          The current size of live SSTables for a column family.
                                        It is expected that SSTable size will grow over 
                                        time with your write load, as compaction processes
                                        continue doubling the size of SSTables. Using this 
                                        metric together with SSTable count, you can monitor the
                                        current state of compaction for a given column family.
Total Disk Used                         The current size of the data directories for the column
                                        family including space not reclaimed by obsolete objects.
SSTable Count                           The current number of SSTables for a column family. When
                                        column family memtables are persisted to disk as 
                                        SSTables, this metric increases to the configured
                                        maximum before the compaction cycle is repeated. Using 
                                        this metric together with live disk used, you can
                                        monitor the current state of compaction for a given 
                                        column family.
Pending Reads/Writes                    The number of pending reads and writes on a column
                                        family. Pending operations are an indication that 
                                        Cassandra is not keeping up with the workload. A value
                                        of zero indicates healthy throughput.
CF: Bloom Filter Space Used             The size of the bloom filter files on disk.
CF: Bloom Filter False Positives        The number of false positives, which occur when the bloom 
                                        filter said the row existed, but it actually did not exist
                                        in absolute numbers.
CF: Bloom Filter False Positive Ratio   The fraction of all bloom filter checks resulting in a
                                        false positive. 
======================================= ============================================================

.. _os-system-alerts:

Advanced system alert metrics
-----------------------------
OpsCenter provides the capability to configure alerts for the following operating system metrics:	

* `Linux Metrics`_
* `Windows Metrics`_
* `Mac OSX Metrics`_

As with any database system, Cassandra performance greatly depends on
underlying systems on which it is running. To configure advanced system metric
alerts, you should first have an understanding of the baseline performance of
your hardware and the averages of these system metrics when the system is
handling a typical workload.

Linux metrics
^^^^^^^^^^^^^

On Linux, you can configure alerts on :ref:`memory <linux-mem>`, :ref:`cpu <linux-cpu>` and :ref:`disk <linux-disk>` events.

.. _linux-mem:

**Memory metrics on Linux**

================================= =============================================================
Metric                            Definition
================================= =============================================================
Memory Free                       System memory that is not being used.
Memory Used                       System memory used by application processes.
Memory Buffered                   System memory used for caching file system metadata and 
                                  tracking in-flight pages.
Memory Shared                     System memory that is accessible to CPUs.
Memory Cached                     System memory used by the OS disk cache.  
================================= =============================================================

.. _linux-cpu:

**CPU metrics on Linux**
            
================================= =============================================================
Metric                            Definition
================================= =============================================================
Idle                              Percentage of time the CPU is idle.                  
Iowait                            Percentage of time the CPU is idle and there is a pending 
                                  disk I/O request.                             
Nice                              Percentage of time spent processing prioritized tasks.
                                  Niced tasks are also counted in system and user time. 
Steal                             Percentage of time a virtual CPU waits for a real CPU while 
                                  the hypervisor services another virtual processor.
System                            Percentage of time allocated to system processes.    
User                              Percentage of time allocated to user processes.       
================================= =============================================================

.. _linux-disk:

**Disk metrics on Linux**

================================= =============================================================
Metric                            Definition
================================= =============================================================
Disk Usage                        Percentage of disk space Cassandra uses at a given time.                                                  
Free Disk Space                   Available disk space in GB.
Used Disk Space                   Used disk space in GB.                                
Disk Read Throughput              Average disk throughput for read operations in megabytes per 
                                  second. Exceptionally high disk throughput values may indicate 
                                  I/O contention.        
Disk Write Throughput             Average disk throughput for write operations in megabytes per 
                                  second.                                 
Disk Read Rate                    Averaged disk speed for read operations.    
Disk Write Rate                   Averaged disk speed for write operations.  
Disk Latency                      Average time consumed by disk seeks in milliseconds. 
Disk Request Size                 Average size in sectors of requests issued to the disk.                                                 
Disk Queue Size                   Average number of requests queued due to disk latency.
Disk Utilization                  Percentage of CPU time consumed by disk I/O.  
================================= =============================================================

Windows metrics
^^^^^^^^^^^^^^^

On Windows, you can configure alerts on :ref:`memory <win-mem>`, :ref:`cpu <win-cpu>` and :ref:`disk <win-disk>` events.

.. _win-mem:

**Memory metrics on Windows**

================================= =============================================================
Metric                            Definition
================================= =============================================================
Available Memory                  Physical memory that is not being used.                
Pool Nonpaged                     Physical memory that stores the kernel and other system data
                                  structures.                              
Pool Paged Resident               Physical memory allocated to unused objects that can be 
                                  written to disk to free memory for reuse.
System Cache Resident             Physical pages of operating system code in the file system 
                                  cache.
================================= =============================================================

.. _win-cpu:
 
**CPU metrics on Windows**

================================= =============================================================
Metric                            Definition
================================= =============================================================
Idle                              Percentage of time the CPU is idle. 
Privileged                        Percentage of time the CPU spends executing kernel commands.                                             
User                              Percentage of time allocated to user processes. 
================================= =============================================================

.. _win-disk:

**Disk metrics on Windows**

================================= =============================================================
Metric                            Definition
================================= =============================================================
Disk Usage                        Percentage of disk space Cassandra uses at a given time.                                                 
Free Disk Space                   Available disk space in GB. 
Used Disk Space                   Used disk space in GB.
Disk Read Throughput              Average disk throughput for read operations in megabytes per 
                                  second. Exceptionally high disk throughput values may 
                                  indicate I/O contention.
Disk Write Throughput             Average disk throughput for write operations in megabytes per
                                  second.      
Disk Read Rate                    Averaged disk speed for read operations.
Disk Write Rate                   Averaged disk speed for write operations.
Disk Latency                      Average time consumed by disk seeks in milliseconds. 
Disk Request Size                 Average size of requests in KB issued to the disk.  
Disk Queue Size                   Average number of requests queued due to disk latency.
Disk Utilization                  Percentage of CPU time consumed by disk I/O.          
================================= =============================================================

Mac OSX metrics
^^^^^^^^^^^^^^^

On Mac OSX, you can configure alerts on :ref:`memory <mac-mem>`, :ref:`cpu <mac-cpu>` and :ref:`disk <mac-disk>` events.

.. _mac-mem:

**Memory metrics on Mac OSX**

================================= =============================================================
Metric                            Definition
================================= =============================================================
Free Memory                       System memory that is not being used. 
Used Memory                       System memory that is being used by application processes.                                            
================================= =============================================================

.. _mac-cpu:

**CPU metrics on Mac OSX**

================================= =============================================================
Metric                            Definition
================================= =============================================================
Idle                              Percentage of time the CPU is idle.                                         
System                            Percentage of time allocated to system processes. 
User                              Percentage of time allocated to user processes
================================= =============================================================

.. _mac-disk:

**Disk metrics on Mac OSX**

================================= =============================================================
Metric                            Definition
================================= =============================================================
Disk Usage                        Percentage of disk space Cassandra uses at a given time.                                          
Free Space                        Available disk space in GB. 
Used Disk Space                   Used disk space in GB.
Disk Throughput                   Average disk throughput for read/write operations in 
                                  megabytes per second. Exceptionally high disk throughput 
                                  values may indicate I/O contention.  
================================= =============================================================






