Monitoring with Metrics¶
Metrics constitutes of the measurements of resource usage or behavior that can be observed and collected all over SnappyData clusters. Using the Metrics, you can monitor the cluster health and statistics. Monitoring the clusters allows you to do the following:
- Increase availability by quickly detecting downtime/degradation.
- Facilitate performance monitoring by external tools. These external tools can handle functions such as metrics aggregation, alerting, and visualization.
SnappyData uses the Spark’s Metrics Subsystem for metrics collection. This system allows you to publish the metrics to a variety of sinks that you can enable for metrics collection. Spark supports the following sinks for Metrics. SnappyData can send metrics to all these sinks. However, MetricsServlet sink is enabled by default and all the metrics get published here.
Sink | Description |
---|---|
MetricsServlet | Adds a servlet within the existing Spark UI to serve metrics data as JSON data. The metrics are published here by default and can be easily accessible via a web page. The servlet URL is http:// |
JmxSink | Register metrics for viewing in a JMX console. Monitoring tools/ agents that understand the JMX protocol, such as JMX exporter from Prometheus, can export this information to external monitoring systems for storage and visualization. |
CsvSink | Exports metrics data to CSV files at regular intervals. |
ConsoleSink | Logs metrics information to the console. |
GraphiteSink | Sends metrics to a Graphite node. |
Slf4jSink | Sends metrics to slf4j as log entries. |
Note
You cannot store metrics for long term retention as well as for archival purposes. Detailed instructions for integration with external monitoring systems to build monitoring dashboards by sourcing metrics data from the sinks will be published in a future release.
Whenever you start the cluster, the metrics are published and made available through commonly used sinks, which can be consumed by monitoring tools.
Note
Metrics are not published for Smart Connector mode.
The following type of metrics are made available for SnappyData:
-
Availability Metrics This type of metrics alerts about the status and availability of the cluster.
Examples:
- Cluster and node status
- Cluster and node uptime/downtime
- Member Statistics
- Table Statistics
-
Performance Metrics This type of metric provides insights into system performance.
Examples:
- Resource usage for system-level resources such as CPU, memory, disk, and network.
- Resource usage for SnappyData application/component resources such as heap or off-heap memory and their percentage utilization etc.
Enabling the Sinks for Metrics collection¶
If you want to publish the SnappyData metrics using any of the sinks, you must enable these sinks in the metrics.properties file, which is located in the conf folder of the SnappyData installation directory.
To enable a sink, do the following:
-
Open the conf folder and change the name of metrics.properties.template to metrics.properties file.
cp metrics.properties.template metrics.properties
-
Uncomment the properties (remove the #) of the sink that you want to enable and provide the necessary property values. For example, for enabling CsvSink, you must uncomment the following properties:
# Enable CsvSink for all instances by class name *.sink.csv.class=org.apache.spark.metrics.sink.CsvSink # Polling period for the CsvSink *.sink.csv.period=1 # Unit of the polling period for the CsvSink *.sink.csv.unit=minutes # Polling directory for CsvSink *.sink.csv.directory=/tmp/ # Polling period for the CsvSink specific for the worker instance worker.sink.csv.period=10 # Unit of the polling period for the CsvSink specific for the worker instance worker.sink.csv.unit=minutes
-
Start the cluster for the configurations to take effect by executing this command from the SnappyData installation folder:
./sbin/snappy-start-all.sh
Note
In case you make changes to metrics.properties file when the cluster is running, you must always restart the cluster for the configuration changes to reflect.
Accessing Metrics¶
You can check the SnappyData metrics collected through MetricsServlet, which is enabled by default, at the following URL: <LeadNode-hostname>:<5050>/metrics/json
Accessing Metrics from JmxSink¶
Do the following to access Metrics from JmxSink:
-
If org.apache.spark.metrics.sink.JmxSink is enabled in the metrics.properties file, you can use JConsole to access Metrics captured through JmxSink. To enable JmxSink, uncomment the properties for the JmxSink in the metrics.properties file. For example:
*.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink
-
Launch JConsole and connect to the host on which primary lead node is running, select the process ID of SnappyData primary lead process.
- Go to MBeans > metrics, to access the SnappyData Metrics.
Note
To access metrics for remote processes, you need to add the following JMX remote properties in the node configuration file:-jmx-manager=true -jmx-manager-start=true -jmx-manager-port=<port_value>
Accessing Metrics from CsvSink¶
Do the following to access Metrics from CsvSink:
- If org.apache.spark.metrics.sink.CsvSink is enabled in the metrics.properties file, you can access the metrics captured in the CSV files at the location specified in the *.sink.csv.directory property. To enable CsvSink, uncomment the properties for the CsvSink in the metrics.properties file. For example,
*.sink.csv.class=org.apache.spark.metrics.sink.CsvSink
Note
You must ensure that the location mentioned in the *.sink.csv.directory property already exists and has write permissions.
Disabling Metrics Collection¶
To disable the metrics collection for a specific Sink, edit the metrics.properties file and comment the corresponding entries by adding the prefix # to the required lines. Save the file and then restart the cluster.
Statistics¶
The following statistics in SnappyData are collected when you Metrics monitoring is enabled.
Note
All memory statistics are published in MB units.
TableCountStatistics¶
Source | Description | Metric Type | Probable Values |
---|---|---|---|
embeddedTablesCount | Count of the row and column tables. | Gauge | |
rowTablesCount | Count of the row tables. | Gauge | |
columnTablesCount | Count of the column tables. | Gauge | |
externalTablesCount | Count of external tables. | Gauge |
Table Statistics¶
Source | Description | Metric Type | Probable Values |
---|---|---|---|
tableName | For each table, the tableName is provided as the fully qualified table name. | Gauge | |
isColumnTable | Specifies if it is a column table. | Gauge | Boolean value(True or False). |
rowCount | Number of rows. | Gauge | Specifies the number of rows inserted which is otherwise 0. |
sizeInMemory | Table size in memory. | Gauge | |
sizeSpillToDisk | Table size spilled to disk. | Gauge | |
totalSize | Total size of the table. | Gauge | |
isReplicatedTable | Specifies if it is a replicated table. | Gauge | Boolean Values(True or False) |
bucketCount | Number of buckets. | Gauge | |
redundancy | Specifies if the redundancy is enabled. | Gauge | Specifies the redundancy value provided while creating a table which is otherwise 0. |
isRedundancyImpaired | Specifies if the redundancy is impaired. (since one or more replicas are unavailable) | Gauge | Boolean Values(True or False) |
isAnyBucketLost | Specifies if any buckets are lost. (UI shows bucket count in red color.) | Gauge | Boolean Values(True or False) |
ExternalTableStatistics¶
Source | Description | Metric Type | Probable Values |
---|---|---|---|
dataSourcePath | Data source path. | Gauge | Path of the file from which data to be loaded |
provider | Data source provider. | Gauge | csv, parquet, orc, json, etc |
tableName | Table name | Gauge | |
tableType | Table type | Gauge | EXTERNAL |
MemberStatistics¶
Source | Description | Metric Type | Probable Values |
---|---|---|---|
totalMembersCount | Count of total members. | Gauge | |
leadCount | Count of leads. | Gauge | |
locatorCount | Count of locators. | Gauge | |
dataServerCount | Count of data servers. | Gauge | |
connectorCount | Count of connectors. | Gauge |
MemberStatistics¶
Source | Description | Metric Type | Probable Values |
---|---|---|---|
memberId | Contains IP address, port and process ID. | Gauge | |
nameOrId | Contains IP address, port, and pid or the name | Gauge | |
host | IP address or name of the machine. | Gauge | |
shortDirName | Relative path of the log directory. | Gauge | |
fullDirName | Absolute path of log directory. | Gauge | |
logFile | Name of the log file. | Gauge | |
processId | Member's process ID. | Gauge | |
diskStoreUUID | Member's unique disk store UUID. | Gauge | |
diskStoreName | Member's disk store name. | Gauge | |
status | Current status of the member. | Gauge | Running / Stopped |
memberType | Type (Lead/ Server/ Locator/ Accessor). | Gauge | Lead / Locator/ Data Server |
isLocator | Flag returns true if the member is locator or false otherwise. | Gauge | Boolean value(True or false) |
isDataServer | Flag returns true if the member is data server or false otherwise. | Gauge | Boolean value(True or false) |
isLead | Flag returns true if the member is lead or false otherwise. | Gauge | Boolean value(True or false) |
isActiveLead | Flag returns true if the member is primary lead or false otherwise. | Gauge | Boolean value(True or false) |
cores | Total number of cores. | Gauge | |
cpuActive | Number of active CPUs. | Gauge | |
clients | Number of clients connected. | Gauge | |
jvmHeapMax | Max JVM heap size. | Gauge | |
jvmHeapUsed | Used JVM heap size. | Gauge | |
jvmHeapTotal | Total JVM heap size. | Gauge | |
jvmHeapFree | Free JVM heap size. | Gauge | |
heapStoragePoolUsed | Used heap storage pool. | Gauge | |
heapStoragePoolSize | Heap storage pool size. | Gauge | |
heapExecutionPoolUsed | Used heap execution pool. | Gauge | |
heapExecutionPoolSize | Heap execution pool size. | Gauge | |
heapMemorySize | Heap memory size. | Gauge | |
heapMemoryUsed | Used heap memory. | Gauge | |
offHeapStoragePoolUsed | Used off-heap storage pool. | Gauge | |
offHeapStoragePoolSize | Off-heap storage pool size. | Gauge | |
offHeapExecutionPoolUsed | Used off-heap execution pool. | Gauge | |
offHeapExecutionPoolSize | Off-heap execution pool size. | Gauge | |
offHeapMemorySize | Off-heap memory size. | Gauge | |
offHeapMemoryUsed | Used off-heap memory. | Gauge | |
diskStoreDiskSpace | Disk store disk space. | Gauge | |
cpuUsage | CPU usage. | Gauge | |
jvmUsage | JVM usage. | Gauge | |
heapUsage | Heap usage. | Gauge | |
heapStorageUsage | Heap storage usage. | Gauge | |
heapExecutionUsage | Heap execution usage. | Gauge | |
offHeapUsage | Off-heap usage. | Gauge | |
offHeapStorageUsage | Off-heap storage usage. | Gauge | |
offHeapExecutionUsage | Off-heap execution size | Gauge | |
aggrMemoryUsage | Aggregate memory usage. | Gauge | |
cpuUsageTrends | CPU usage trends. | Histogram | |
jvmUsageTrends | JVM usage trends. | Histogram | |
heapUsageTrends | Heap usage trends. | Histogram | |
heapStorageUsageTrends | Heap storage usage trends. | Histogram | |
heapExecutionUsageTrends | Heap execution usage trends. | Histogram | |
offHeapUsageTrends | Off-heap usage trends. | Histogram | |
offHeapStorageUsageTrends | Off-heap storage usage trends. | Histogram | |
offHeapExecutionUsageTrends | Off-heap execution usage trends. | Histogram | |
aggrMemoryUsageTrends | Aggregate memory usage trends. | Histogram | |
diskStoreDiskSpaceTrend | Disk store and space trends. | Histogram |
ClusterStatistics¶
Source | Description | Metric Type | Probable Values |
---|---|---|---|
totalCores | Totals number of cores in the cluster. | Gauge | |
jvmUsageTrends | JVM usage trends in the cluster. | Histogram | |
heapUsageTrends | Heap usage trends in the cluster. | Histogram | |
heapStorageUsageTrends | JVM usage trends in the cluster. | Histogram | |
heapExecutionUsageTrends | Heap execution usage trends in the cluster. | Histogram | |
offHeapUsageTrends | Off-heap usage trends in the cluster. | Histogram | |
offHeapStorageUsageTrends | Off-heap storage usage trends in the cluster. | Histogram | |
offHeapExecutionUsageTrends | Off-heap execution usage trends in the cluster. | Histogram | |
aggrMemoryUsageTrends | Aggregate memory usage trends in the cluster. | Histogram | |
diskStoreDiskSpaceTrend | Disk store and space trends in the cluster. | Histogram |