Monitoring with Metrics¶

Metrics constitutes of the measurements of resource usage or behavior that can be observed and collected all over SnappyData clusters. Using the Metrics, you can monitor the cluster health and statistics. Monitoring the clusters allows you to do the following:

Increase availability by quickly detecting downtime/degradation.
Facilitate performance monitoring by external tools. These external tools can handle functions such as metrics aggregation, alerting, and visualization.

SnappyData uses the Spark’s Metrics Subsystem for metrics collection. This system allows you to publish the metrics to a variety of sinks that you can enable for metrics collection. Spark supports the following sinks for Metrics. SnappyData can send metrics to all these sinks. However, MetricsServlet sink is enabled by default and all the metrics get published here.

Sink	Description
MetricsServlet	Adds a servlet within the existing Spark UI to serve metrics data as JSON data. The metrics are published here by default and can be easily accessible via a web page. The servlet URL is http://:5050/metrics/json.
JmxSink	Register metrics for viewing in a JMX console. Monitoring tools/ agents that understand the JMX protocol, such as JMX exporter from Prometheus, can export this information to external monitoring systems for storage and visualization.
CsvSink	Exports metrics data to CSV files at regular intervals.
ConsoleSink	Logs metrics information to the console.
GraphiteSink	Sends metrics to a Graphite node.
Slf4jSink	Sends metrics to slf4j as log entries.

Note

You cannot store metrics for long term retention as well as for archival purposes. Detailed instructions for integration with external monitoring systems to build monitoring dashboards by sourcing metrics data from the sinks will be published in a future release.

Whenever you start the cluster, the metrics are published and made available through commonly used sinks, which can be consumed by monitoring tools.

Note

Metrics are not published for Smart Connector mode.

The following type of metrics are made available for SnappyData:

Availability Metrics This type of metrics alerts about the status and availability of the cluster.

Examples:
- Cluster and node status
- Cluster and node uptime/downtime
- Member Statistics
- Table Statistics
Performance Metrics This type of metric provides insights into system performance.

Examples:
- Resource usage for system-level resources such as CPU, memory, disk, and network.
- Resource usage for SnappyData application/component resources such as heap or off-heap memory and their percentage utilization etc.

Enabling the Sinks for Metrics collection¶

If you want to publish the SnappyData metrics using any of the sinks, you must enable these sinks in the metrics.properties file, which is located in the conf folder of the SnappyData installation directory.

To enable a sink, do the following:

Open the conf folder and change the name of metrics.properties.template to metrics.properties file.
```
cp metrics.properties.template metrics.properties
```

Uncomment the properties (remove the #) of the sink that you want to enable and provide the necessary property values. For example, for enabling CsvSink, you must uncomment the following properties:

# Enable CsvSink for all instances by class name
*.sink.csv.class=org.apache.spark.metrics.sink.CsvSink

# Polling period for the CsvSink
*.sink.csv.period=1

# Unit of the polling period for the CsvSink
*.sink.csv.unit=minutes

# Polling directory for CsvSink
*.sink.csv.directory=/tmp/

# Polling period for the CsvSink specific for the worker instance
worker.sink.csv.period=10

# Unit of the polling period for the CsvSink specific for the worker instance
worker.sink.csv.unit=minutes

Start the cluster for the configurations to take effect by executing this command from the SnappyData installation folder:
```
    ./sbin/snappy-start-all.sh
```

Note

In case you make changes to metrics.properties file when the cluster is running, you must always restart the cluster for the configuration changes to reflect.

Accessing Metrics¶

You can check the SnappyData metrics collected through MetricsServlet, which is enabled by default, at the following URL: <LeadNode-hostname>:<5050>/metrics/json

Accessing Metrics from JmxSink¶

Do the following to access Metrics from JmxSink:

If org.apache.spark.metrics.sink.JmxSink is enabled in the metrics.properties file, you can use JConsole to access Metrics captured through JmxSink. To enable JmxSink, uncomment the properties for the JmxSink in the metrics.properties file. For example:
```
    *.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink
```
Launch JConsole and connect to the host on which primary lead node is running, select the process ID of SnappyData primary lead process.
Go to MBeans > metrics, to access the SnappyData Metrics.

Note

To access metrics for remote processes, you need to add the following JMX remote properties in the node configuration file:-jmx-manager=true -jmx-manager-start=true -jmx-manager-port=<port_value>

Accessing Metrics from CsvSink¶

Do the following to access Metrics from CsvSink:

If org.apache.spark.metrics.sink.CsvSink is enabled in the metrics.properties file, you can access the metrics captured in the CSV files at the location specified in the *.sink.csv.directory property. To enable CsvSink, uncomment the properties for the CsvSink in the metrics.properties file. For example,
```
*.sink.csv.class=org.apache.spark.metrics.sink.CsvSink
```

Note

You must ensure that the location mentioned in the *.sink.csv.directory property already exists and has write permissions.

Disabling Metrics Collection¶

To disable the metrics collection for a specific Sink, edit the metrics.properties file and comment the corresponding entries by adding the prefix # to the required lines. Save the file and then restart the cluster.

Statistics¶

The following statistics in SnappyData are collected when you Metrics monitoring is enabled.

Note

All memory statistics are published in MB units.

TableCountStatistics¶

Source	Description	Metric Type
embeddedTablesCount	Count of the row and column tables.	Gauge
rowTablesCount	Count of the row tables.	Gauge
columnTablesCount	Count of the column tables.	Gauge
externalTablesCount	Count of external tables.	Gauge

Table Statistics¶

Source	Description	Metric Type	Probable Values
tableName	For each table, the tableName is provided as the fully qualified table name.	Gauge
isColumnTable	Specifies if it is a column table.	Gauge	Boolean value(True or False).
rowCount	Number of rows.	Gauge	Specifies the number of rows inserted which is otherwise 0.
sizeInMemory	Table size in memory.	Gauge
sizeSpillToDisk	Table size spilled to disk.	Gauge
totalSize	Total size of the table.	Gauge
isReplicatedTable	Specifies if it is a replicated table.	Gauge	Boolean Values(True or False)
bucketCount	Number of buckets.	Gauge
redundancy	Specifies if the redundancy is enabled.	Gauge	Specifies the redundancy value provided while creating a table which is otherwise 0.
isRedundancyImpaired	Specifies if the redundancy is impaired. (since one or more replicas are unavailable)	Gauge	Boolean Values(True or False)
isAnyBucketLost	Specifies if any buckets are lost. (UI shows bucket count in red color.)	Gauge	Boolean Values(True or False)

ExternalTableStatistics¶

Source	Description	Metric Type	Probable Values
dataSourcePath	Data source path.	Gauge	Path of the file from which data to be loaded
provider	Data source provider.	Gauge	csv, parquet, orc, json, etc
tableName	Table name	Gauge
tableType	Table type	Gauge	EXTERNAL

MemberStatistics¶

Source	Description	Metric Type
totalMembersCount	Count of total members.	Gauge
leadCount	Count of leads.	Gauge
locatorCount	Count of locators.	Gauge
dataServerCount	Count of data servers.	Gauge
connectorCount	Count of connectors.	Gauge

MemberStatistics¶

Source	Description	Metric Type	Probable Values
memberId	Contains IP address, port and process ID.	Gauge
nameOrId	Contains IP address, port, and pid or the name	Gauge
host	IP address or name of the machine.	Gauge
shortDirName	Relative path of the log directory.	Gauge
fullDirName	Absolute path of log directory.	Gauge
logFile	Name of the log file.	Gauge
processId	Member's process ID.	Gauge
diskStoreUUID	Member's unique disk store UUID.	Gauge
diskStoreName	Member's disk store name.	Gauge
status	Current status of the member.	Gauge	Running / Stopped
memberType	Type (Lead/ Server/ Locator/ Accessor).	Gauge	Lead / Locator/ Data Server
isLocator	Flag returns true if the member is locator or false otherwise.	Gauge	Boolean value(True or false)
isDataServer	Flag returns true if the member is data server or false otherwise.	Gauge	Boolean value(True or false)
isLead	Flag returns true if the member is lead or false otherwise.	Gauge	Boolean value(True or false)
isActiveLead	Flag returns true if the member is primary lead or false otherwise.	Gauge	Boolean value(True or false)
cores	Total number of cores.	Gauge
cpuActive	Number of active CPUs.	Gauge
clients	Number of clients connected.	Gauge
jvmHeapMax	Max JVM heap size.	Gauge
jvmHeapUsed	Used JVM heap size.	Gauge
jvmHeapTotal	Total JVM heap size.	Gauge
jvmHeapFree	Free JVM heap size.	Gauge
heapStoragePoolUsed	Used heap storage pool.	Gauge
heapStoragePoolSize	Heap storage pool size.	Gauge
heapExecutionPoolUsed	Used heap execution pool.	Gauge
heapExecutionPoolSize	Heap execution pool size.	Gauge
heapMemorySize	Heap memory size.	Gauge
heapMemoryUsed	Used heap memory.	Gauge
offHeapStoragePoolUsed	Used off-heap storage pool.	Gauge
offHeapStoragePoolSize	Off-heap storage pool size.	Gauge
offHeapExecutionPoolUsed	Used off-heap execution pool.	Gauge
offHeapExecutionPoolSize	Off-heap execution pool size.	Gauge
offHeapMemorySize	Off-heap memory size.	Gauge
offHeapMemoryUsed	Used off-heap memory.	Gauge
diskStoreDiskSpace	Disk store disk space.	Gauge
cpuUsage	CPU usage.	Gauge
jvmUsage	JVM usage.	Gauge
heapUsage	Heap usage.	Gauge
heapStorageUsage	Heap storage usage.	Gauge
heapExecutionUsage	Heap execution usage.	Gauge
offHeapUsage	Off-heap usage.	Gauge
offHeapStorageUsage	Off-heap storage usage.	Gauge
offHeapExecutionUsage	Off-heap execution size	Gauge
aggrMemoryUsage	Aggregate memory usage.	Gauge
cpuUsageTrends	CPU usage trends.	Histogram
jvmUsageTrends	JVM usage trends.	Histogram
heapUsageTrends	Heap usage trends.	Histogram
heapStorageUsageTrends	Heap storage usage trends.	Histogram
heapExecutionUsageTrends	Heap execution usage trends.	Histogram
offHeapUsageTrends	Off-heap usage trends.	Histogram
offHeapStorageUsageTrends	Off-heap storage usage trends.	Histogram
offHeapExecutionUsageTrends	Off-heap execution usage trends.	Histogram
aggrMemoryUsageTrends	Aggregate memory usage trends.	Histogram
diskStoreDiskSpaceTrend	Disk store and space trends.	Histogram

ClusterStatistics¶

Source	Description	Metric Type
totalCores	Totals number of cores in the cluster.	Gauge
jvmUsageTrends	JVM usage trends in the cluster.	Histogram
heapUsageTrends	Heap usage trends in the cluster.	Histogram
heapStorageUsageTrends	JVM usage trends in the cluster.	Histogram
heapExecutionUsageTrends	Heap execution usage trends in the cluster.	Histogram
offHeapUsageTrends	Off-heap usage trends in the cluster.	Histogram
offHeapStorageUsageTrends	Off-heap storage usage trends in the cluster.	Histogram
offHeapExecutionUsageTrends	Off-heap execution usage trends in the cluster.	Histogram
aggrMemoryUsageTrends	Aggregate memory usage trends in the cluster.	Histogram
diskStoreDiskSpaceTrend	Disk store and space trends in the cluster.	Histogram