5. Observability#

5.1. Statistics#

If you have enabled statistics, QuasarDB will collect runtime statistics per node. QuasarDB uses its direct key/value API to store these statistics.

Key

Type

Description

async_pipelines.pulled.total_bytes

int64

The total number of bytes pulled by the asynchronous pipelines

async_pipelines.pulled.total_count

int64

The total count of pull operations performed by the asynchronous pipelines

engine_build_date

blob

The QuasarDB engine build date

engine_version

blob

The QuasarDB engine version

evicted.count

int64

The total count of evicted items from the in-memory cache of data on-disk

evicted.total_bytes

int64

The total number of bytes evicted

hardware_concurrency

int64

The detected concurrent number of hardware threads supported on the system

license.attribution_date

int64

The date when the license was attributed in Unix epoch timestamp format

license.expiration_date

int64

The date of license expiration

license.memory

int64

The memory limit set by the license

license.remaining_days

int64

The remaining days until the license expiration

license.support_until

int64

The date until the support is valid

startup

int64

The startup timestamp

startup_time

int64

The startup time

node_id

blob

A string representing the node id

operating_system

blob

A string representing the operating system

pageins.count

int64

The total count of data read from disk into the in-memory cache

pageins.total_bytes

int64

The total number of bytes read from disk into the in-memory cache

partitions_count

int64

The number of partitions

chord.invalid_requests_count

int64

The number of times a request was incorrectly routed to this node by a client

chord.predecessor_changes_count

int64

The number of times the node’s predecessor changed

chord.successor_changes_count

int64

The number of times the node’s successor changed

chord.unstable_errors_count

int64

The number of times the node returned to a client an “unstable” error

cpu.idle

int64

The cumulated CPU idle time

cpu.system

int64

The cumulated CPU system time

cpu.user

int64

The cumulated CPU user time

disk.path

blob

The persistence path

disk.bytes_free

int64

The bytes free on the persistence path

disk.bytes_total

int64

The total bytes on the persistence path

memory.persistence.cache_bytes

int64

The number of bytes cached in memory for persistence

memory.persistence.memtable_bytes

int64

The number of bytes used by the memtable for persistence

memory.persistence.memtable_unflushed_bytes

int64

The number of bytes in the memtable waiting to be flushed for persistence

memory.persistence.table_reader_bytes

int64

The number of bytes used by the table reader for persistence

memory.persistence.total_bytes

int64

The total number of bytes used for persistence in memory

memory.bytes_resident_size

int64

The computed amount of RAM used for data by QuasarDB

memory.physmem.bytes_total

int64

Physical RAM free bytes count

memory.physmem.bytes_used

int64

Physical RAM used bytes count

memory.resident_bytes

int64

The number of bytes currently resident in memory

memory.resident_count

int64

The number of entries in RAM

memory.vm.bytes_total

int64

Virtual memory free bytes count

memory.vm.bytes_used

int64

Virtual memory used bytes count

network.current_users_count

int64

The current count of connected users

network.sessions.max_count

int64

The configured maximum number of sessions

network.sessions.available_count

int64

The current number of available sessions

network.sessions.unavailable_count

int64

The current number of used sessions

persistence.bucket_read_count

int64

The total count of bucket reads

persistence.bucket_update_count

int64

The total count of bucket updates

persistence.capacity_bytes

int64

The persistence layer storage capacity, in bytes. May be 0 if the value is unknown.

persistence.cloud_local_cache_bytes

int64

The number of bytes in the cloud local cache

persistence.utilized_bytes

int64

The number of bytes currently used in the persistence layer

persistence.read_bytes

int64

The cumulated number of bytes read

persistence.written_bytes

int64

The cumulated number of bytes written

persistence.entries_count

int64

The current number of entries in the persistence layer

persistence.info

blob

Additional information about the persistence layer

persistence.persistent_cache_bytes

int64

The number of bytes in the persistent cache

persistence.write_successes_count

int64

The total count of successful writes to the persistence layer

queries.microindex.aggregation.match

int64

The total number of matching aggregation queries

queries.microindex.filter.match

int64

The total number of matching filter queries

requests.bytes_in

int64

The total number of bytes received in requests

requests.errors_count

int64

The total count of request errors

requests.total_count

int64

The cumulated number of requests

requests.successes_count

int64

The cumulated number of successful operations

requests.bytes_out

int64

The total number of bytes sent in responses by the server

shutdown_time

int64

The shutdown timestamp

check.online

int64

The online status check result (1 for online, 0 for offline)

check.duration_ms

int64

The duration of the check operation in milliseconds

5.2. Retrieving Statistics from a QuasarDB Cluster#

5.2.1. Using QuasarDB Python Client Library#

This section provides a Python script that demonstrates how to use the QuasarDB Python API to connect to a QuasarDB cluster and retrieve statistics of all nodes in the cluster. The script outlines the process for accessing cumulative and by_uid statistics and provides guidance on interpreting various types of stats.

Prerequisites

Before running the script, make sure you have the following installed:

  • Python (3.x or higher)

  • QuasarDB Python API (quasardb)

Python Script

import quasardb
import quasardb.stats as qdbst
import json

with quasardb.Cluster('qdb://127.0.0.1:2836') as conn:
    stats = qdbst.by_node(conn)
    print(json.dumps(stats, indent=4))
 {
   "127.0.0.1:2836": {
       "by_uid": {},
       "cumulative": {
           "async_pipelines.pulled.total_bytes": 0,
           "async_pipelines.pulled.total_count": 0,
           "cpu.idle": 6012100000,
           "cpu.system": 166390000,
           "cpu.user": 318260000,
           "disk.bytes_free": 221570932736,
           "disk.bytes_total": 274865303552,
           "disk.path": "insecure/db/0-0-0-1",
           "engine_build_date": "2023.12.06T20.59.15.000000000 UTC",
           "engine_version": "3.15.x",
           "evicted.count": 1,
           "evicted.total_bytes": 209,
           "hardware_concurrency": 8,
           "license.attribution_date": 1701900515,
           "license.expiration_date": 0,
           "license.memory": 8589934592,
           "license.remaining_days": 31337,
           "license.support_until": 0,
           "memory.persistence.cache_bytes": 104,
           "memory.persistence.memtable_bytes": 1071088,
           "memory.persistence.memtable_unflushed_bytes": 1071088,
           "memory.persistence.table_reader_bytes": 0,
           "memory.persistence.total_bytes": 1071192,
           "memory.physmem.bytes_total": 33105100800,
           "memory.physmem.bytes_used": 12619468800,
           "memory.resident_bytes": 0,
           "memory.resident_count": 0,
           "memory.vm.bytes_total": 140737488351232,
           "memory.vm.bytes_used": 1776730112,
           "network.current_users_count": 0,
           "network.sessions.available_count": 510,
           "network.sessions.max_count": 512,
           "network.sessions.unavailable_count": 2,
           "node_id": "0-0-0-1",
           "operating_system": "Linux 5.10.199-190.747.amzn2.x86_64",
           "partitions_count": 8,
           "perf.common.get_type_for_removal.deserialization.total_ns": 1374,
           "perf.common.get_type_for_removal.processing.total_ns": 8101,
           "perf.common.get_type_for_removal.total_ns": 9475,
           "perf.control.get_system_info.deserialization.total_ns": 4851,
           "perf.control.get_system_info.processing.total_ns": 599,
           "perf.control.get_system_info.total_ns": 5450,
           "perf.total_ns": 249523,
           "perf.ts.create_root.content_writing.total_ns": 16186,
           "perf.ts.create_root.deserialization.total_ns": 1717,
           "perf.ts.create_root.entry_writing.total_ns": 53294,
           "perf.ts.create_root.processing.total_ns": 131766,
           "perf.ts.create_root.serialization.total_ns": 128,
           "perf.ts.create_root.total_ns": 203091,
           "perf.ts.get_columns.deserialization.total_ns": 1409,
           "perf.ts.get_columns.processing.total_ns": 30098,
           "perf.ts.get_columns.total_ns": 31507,
           "persistence.capacity_bytes": 0,
           "persistence.cloud_local_cache_bytes": 0,
           "persistence.entries_count": 1,
           "persistence.info": "RocksDB 6.27",
           "persistence.persistent_cache_bytes": 0,
           "persistence.read_bytes": 10245,
           "persistence.utilized_bytes": 0,
           "persistence.written_bytes": 25929,
           "requests.bytes_in": 851,
           "requests.bytes_out": 32,
           "requests.errors_count": 2,
           "requests.successes_count": 4,
           "requests.total_count": 6,
           "startup": 1701900515,
           "startup_time": 1701900515,
           "check.online": 1,
           "check.duration_ms": 4
       }
   }
}

5.2.2. Using qdbsh#

In qdbsh, you can use the direct_prefix_get command to retrieve a list of all statistics. Here’s an example:

qdbsh > direct_prefix_get $qdb.statistics 100
 1. $qdb.statistics.async_pipelines.pulled.total_bytes
 2. $qdb.statistics.async_pipelines.pulled.total_count
 3. $qdb.statistics.cpu.idle
 4. $qdb.statistics.cpu.system
 ...
52. $qdb.statistics.startup_time

qdbsh > direct_int_get $qdb.statistics.cpu.user
1575260000

qdbsh > direct_blob_get $qdb.statistics.node_id
c5fe30bf0154acc-5d63bd06e7878b9c-5f635b3cf7fc3560-dbe35df7b5080651
1:12

Note

The direct_prefix_get command is used to retrieve a list of statistics keys that match a specific prefix. In this case, the command is used to fetch statistics keys that start with the prefix $qdb.statistics. The number 100 is a parameter that limits the maximum number of keys returned by the command.

5.3. Understanding “by_uid” and “cumulative” Statistics#

The retrieved statistics are organized into two main dictionaries: “by_uid” and “cumulative.”

  • “by_uid”: This dictionary contains user statistics for a secure cluster with users. It groups statistics by their corresponding user IDs.

  • “cumulative”: This dictionary holds cumulative statistics for each node in the cluster. These statistics provide aggregated information across all users and are typically global values for the entire node.

5.4. Interpreting Cumulative Statistics#

Many statistics, like CPU user usage, are cumulative. To interpret them correctly, you need to get the difference between the values and consider the time range between measurements.

For example, if you retrieve the CPU user usage twice, like this:

qdbsh > direct_int_get $qdb.statistics.cpu.user
1576680000

qdbsh > direct_int_get $qdb.statistics.cpu.user
1576740000

In this scenario, the CPU user usage increased from 1576680000 to 1576740000 between the two measurements. To determine the CPU user usage during that time range, you can use the calculate_delta(prev, curr) command.

5.5. Statistic Type#

Statistics can be categorized into two types: “gauge” and “counter.”

  • “Gauge” Statistics: Gauge statistics represent instantaneous measurements of a value at a particular point in time. For example, “memory.vm.bytes_used” indicates the current virtual memory used.

  • “Counter” Statistics: Counter statistics represent continuously increasing values that keep track of changes over time. For example, “requests.total_count” records the total number of requests received since the node started.

To retrieve the statistic type of a specific statistic, you can use the QuasarDB Python API along with the following code:

import quasardb
import quasardb.stats as qdbst

with quasardb.Cluster('qdb://127.0.0.1:2836') as conn:
    stat_id = "requests.successes_count"
    stat_type = qdbst.stat_type(stat_id)
    print(f"The statistic type for '{stat_id}' is: {stat_type}")
The statistic type for 'requests.successes_count' is: ('counter', 'count')

In this format, you will find statistics related to various aspects of the QuasarDB cluster, such as system resources, persistence, network, queries, and more. The “by_uid” dictionary may contain user statistics for a secure cluster with users. The “cumulative” dictionary holds cumulative statistics for each node in the cluster.

5.6. Performance Tracing#

If you have enabled performance profiling from the server-side, you can run detailed performance traces of the operations that you execute on your cluster. This will provide you with a detailed, step-by-step trace of tje low-level operations, and are usually useful when you are debugging together with your Solution Architect.

If you have enabled statistics in conjunction with performance traces, you will get additional performance trace metrics in your statistics.

To use performance traces from the QuasarDB shell, use the enable_perf_trace command as follows:

$ qdbsh
quasardb shell version 3.5.0master build ce779a9 2019-10-23 00:04:01 +0000
Copyright (c) 2009-2019 quasardb. All rights reserved.
Need some help? Check out our documentation here:  https://doc.quasardb.net

qdbsh > enable_perf_trace
qdbsh > create table testable(col1 int64, col2 double)

*** begin performance trace

total time: 175 us

- function + ts.create_root - 175 us [100.00 %]
           |
           |              data received:         0 us - delta:         0 us [00.00 % - 00.00 %]
           |     deserialization starts:         2 us - delta:         2 us [01.14 % - 01.14 %]
           |       deserialization ends:         7 us - delta:         5 us [02.86 % - 02.86 %]
           |             entering chord:         9 us - delta:         2 us [01.14 % - 01.14 %]
           |                   dispatch:        14 us - delta:         5 us [02.86 % - 02.86 %]
           |     deserialization starts:        24 us - delta:        10 us [05.71 % - 05.71 %]
           |       deserialization ends:        26 us - delta:         2 us [01.14 % - 01.14 %]
           |          processing starts:        28 us - delta:         2 us [01.14 % - 01.14 %]
           |      entry trimming starts:        79 us - delta:        51 us [29.14 % - 29.14 %]
           |        entry trimming ends:        81 us - delta:         2 us [01.14 % - 01.14 %]
           |     content writing starts:        82 us - delta:         1 us [00.57 % - 00.57 %]
           |       content writing ends:       141 us - delta:        59 us [33.71 % - 33.71 %]
           |       entry writing starts:       141 us - delta:         0 us [00.00 % - 00.00 %]
           |         entry writing ends:       172 us - delta:        31 us [17.71 % - 17.71 %]
           |            processing ends:       175 us - delta:         3 us [01.71 % - 01.71 %]

*** end performance trace

In the trace above, for example, we can see the performance trace of the CREATE TABLE statement, and get a detailed idea of where it’s spending most of its time. In this case, the total operation lasted 175 microseconds, and a detailed breakdown of the low level function timings.