5.2. Monitoring QuasarDB Clusters#

5.2.1. Introduction and Goals#

Effective monitoring is essential for running QuasarDB in production. As a distributed, high-performance time series database, QuasarDB powers workloads that demand consistent uptime, low latency, and reliable ingestion under high throughput. Monitoring enables operators to detect anomalies early, investigate performance issues, and ensure the system meets business requirements.

This guide introduces a top-down approach to monitoring. It starts by addressing the key business and operational goals, and gradually drills down to the metrics, dashboards, and logs required to support them.

5.2.1.1. Goals of Monitoring a QuasarDB Cluster#

Ensure Cluster Availability Detect and respond to node failures, replication issues, and cluster-wide disruptions.
Maintain Performance and Latency SLAs Monitor query and ingestion latency to ensure the system responds predictably under load.
Track Ingestion Health Ensure that incoming data is being ingested at expected rates and without backlog.
Support Capacity Planning Monitor storage growth, memory usage, and CPU utilization to anticipate scaling needs.
Reduce Time to Resolution (TTR) Enable fast root cause analysis during incidents by correlating metrics and logs.
Enable Auditability and Compliance Provide historical insight into cluster health and behavior for audit trails or post-mortems.

5.2.1.2. Why Monitoring a Distributed Time Series Database Is Different#

QuasarDB is horizontally scalable, write-optimized, and designed for extremely fast time-based queries. These characteristics introduce unique monitoring needs:

Horizontal scale introduces new failure modes Node-level metrics must be contextualized across the cluster to detect imbalances or bottlenecks.
High-throughput ingestion is the norm Monitoring must capture ingestion lag, batching efficiency, and write errors.
Query patterns evolve rapidly Ad-hoc analytics or large scans can cause unpredictable load if not monitored.
Temporal patterns matter Time-based anomalies (e.g., ingestion drops at specific hours) require visualization over time windows.

5.2.1.3. Next Steps#

The next chapters will guide you through:

The most important metrics to monitor
How to visualize cluster health and performance using dashboards
How to define alerts
Using logs and traces for troubleshooting

Note

If you’re looking for a complete list of available metrics, refer to the Metrics Reference.