9. Troubleshooting Guide#

This troubleshooting guide is structured around clear questions and concise, actionable answers. Each section addresses a specific issue you might encounter with QuasarDB, accompanied by detailed diagnostic steps and solutions. For optimal clarity and ease of reference, each topic begins with a descriptive question that reflects common symptoms or error messages. Carefully reviewing and remembering these guidelines will help you rapidly identify and resolve problems, ensuring smooth and efficient operation of your QuasarDB deployment.

9.1. Connection Failures: Operation Timed Out#

Symptom: You encounter the following error when attempting to connect:

at qdb_connect: The operation timed out

Troubleshooting Steps:

Determine whether the connection issue is permanent or sporadic:
- Permanent issue:
  
  This is likely related to network configuration or firewall settings.
  - Verify connectivity by pinging the QuasarDB server from the client, ensuring ICMP packets are not dropped by any firewall between the client and server.
  - Review firewall rules and network security groups to confirm that the appropriate ports used by QuasarDB are open.
- Sporadic issue: Sporadic connection timeouts typically point to resource exhaustion or configuration limitations.
  1. Check available server sessions: The most common cause of sporadic connection timeouts is session exhaustion.
    - Review monitoring dashboards and QuasarDB metrics specifically for:
      - network.sessions.available_count
      - network.sessions.unavailable_count
    - Note: If the server exhausts sessions, your metrics collector might also fail to connect, resulting in gaps or missing data. Confirm this by reviewing dashboards for missing metrics.
    - If session exhaustion is identified:
      - Increase the total_sessions parameter in your QuasarDB configuration file.
      - Additionally, review your application code to ensure sessions are not being held open longer than necessary, and promptly close unused sessions.
  2. Check open file descriptors:
    
    Although uncommon, QuasarDB may be running out of open file descriptors.
    - Inspect qdbd logs for an initialization error similar to:
      error during initialization: config/network: The number of allowed descriptors allowed by the OS is 2560. The current network configuration requires 1000000 descriptors. Please reduce the number of partitions or sessions.
    - If found, increase the OS limit for open file descriptors to meet the requirements specified by QuasarDB.
Verify Peer-to-Peer Cluster Topology:

QuasarDB uses a peer-to-peer model where each client automatically discovers all nodes in the cluster. If the IP address used to connect to the cluster does not match the IP addresses that the server nodes advertise internally, clients can encounter timeouts.
- Check the IP configuration: - Confirm that the IP address you connect to is identical to what the cluster nodes themselves bind to and advertise. - If you are connecting from a new environment or have introduced a NAT (Network Address Translation) layer, this step becomes especially critical.
- Adjust advertised IP if necessary: - In your QuasarDB configuration file, you can set a different IP address to advertise using:
  "local.network.advertise_as": "10.0.10.1:2836"
  - This ensures the client is guided to the correct IP address, rather than the one the daemon is listening on by default.
- Ensure cluster-wide connectivity: - All cluster nodes must also be able to reach each other through the advertised IP address. - If your environment uses advanced firewall or NAT rules, verify they do not block or alter the advertised address.

By following the steps above—validating network connectivity, ensuring sufficient resources, and verifying peer-to-peer IP configurations—you can typically resolve or prevent “operation timed out” errors.

9.2. Connection Failures: Connection Refused#

Symptom: You encounter the following error when attempting to connect:

at qdb_connect: Connection refused.

Troubleshooting Steps:

Check the Database Service
- Even though the IP is reachable, this error often means the database itself may not be running.
- Verify that your database service is up and operational before proceeding.
Verify Firewall Configuration
- In most cases, the firewall is not configured to allow the required ports.
- Ensure that port 2836 (standard data port) is open.
- Ensure that port 2837 (control port) is open. This port is used by servers and clients for cluster topology and liveness checks.
Prioritize Low-Volume Control Traffic
- If you are performing advanced QoS or further firewall restrictions, be aware that port 2837 is low-volume.
- You should prioritize this port’s traffic to ensure stable cluster communication.

By confirming both the database is running and the firewall allows communication on the required ports, you can typically resolve the “Connection refused” error.

9.3. Connection Failures: Client/Server Version Mismatch#

Symptom: You encounter the following error when attempting to connect:

at qdb_connect: The remote host and Client API versions mismatch

Troubleshooting Steps:

This error indicates that the version of the QuasarDB client library used is incompatible with the version of the QuasarDB daemon it is attempting to connect to.

QuasarDB supports backward compatibility for clients: older clients can connect to newer servers. However, forward compatibility is not guaranteed: newer clients may fail to connect to older servers.

Additionally, in rare cases, breaking protocol changes may be introduced, which prevent even backward compatibility. These are always explicitly mentioned in the release changelogs.

To resolve this issue:

Check the version of your client:
- For qdbsh, run:
```
qdbsh --version
```
Check the version of your QuasarDB daemon:
- Run this on the server:
```
qdbd --version
```
Compare the versions:
- The client version should be equal to or older than the server version.
- Clients that are newer than the server may not be compatible and should be downgraded.
- If you are using a client that is too old, refer to the changelogs to verify whether breaking changes have been introduced.
Upgrade or downgrade as needed:
- Prefer using matching versions for both client and server whenever possible.
- If mismatched versions are unavoidable, always ensure that the client is older than or equal to the server version.

9.4. Transactional Conflicts: Operation Aborted#

Symptom: You encounter the following error during a transactional operation:

at query_query: The operation has been aborted as it conflicts with another ongoing operation.

Troubleshooting Steps:

This error indicates a transactional conflict due to QuasarDB’s use of optimistic locking. Unlike traditional databases that use pessimistic locks, QuasarDB performs operations without acquiring locks by default, and only raises an error if a conflicting operation would otherwise require one.

This design ensures high throughput and low latency, but it does mean that conflicts can occur during concurrent modifications to the same data.

Follow these steps to address the issue:

Identify the context of the failure:
- If the error occurs during data insertion:
  - Ensure that multiple writers are not concurrently inserting into the same table.
  - Best practice:
    - One writer may insert into multiple tables.
    - Each table should be written to by only one writer at a time.
  - This prevents overlapping transactions and avoids MVCC conflicts.
- If the error occurs during ALTER TABLE operations:
  - Conflicts may arise from other long-running or aborted transactions.
  - A previous operation may have timed out and left uncommitted data behind.
Resolve persistent conflicts:
- Use the TRIM TABLE command to clean up expired MVCC copies and rollback any aborted transactions.
```
TRIM TABLE <table>
```
- Caution:
  - Do not invoke this on a large number of tables too quickly, as trimming can temporarily increase write I/O load.

9.5. Slow Queries or Poor Query Performance#

Symptom: Queries are noticeably slower than expected, either intermittently or consistently.

Overview: Diagnosing slow query performance involves identifying where the time is being spent: client-side, server-side, or during the transfer and transformation of data. Because the causes can be diverse and layered, it is important to ask follow-up questions and perform step-by-step measurements to isolate the bottleneck. This guide outlines a progressive approach to narrowing down the root cause.

Part 1: Determine if the delay is client-side

Observe CPU usage on the client machine when executing the slow query.
- If CPU usage remains high for a prolonged time, the delay is likely due to the client’s processing overhead, such as converting raw native memory into the runtime’s internal format. This is particularly common in languages such as Python. For example, in Jupyter notebooks, displaying query results may appear slow because the client transforms the entire native buffer into Python memory objects, even if only a small portion is shown on screen.
Run the same query in qdbsh:
- If the query is fast in `qdbsh` but slow in your environment, it strongly suggests the issue is on the client side (data conversion or rendering).
- If the query is also slow in `qdbsh`, and the resultset is large, note that qdbsh may spend significant time rendering the data to the terminal. In this case, inspect the final output line:
```
Returned 1 row in 3,538 us
```
  - This value reflects the actual query execution time on the server, separate from any rendering time.
Modify your query to include LIMIT 1 and execute it again in qdbsh or your client environment.
- If the query becomes significantly faster, the issue is likely in rendering or transferring large volumes of data, not in the database engine itself.
- If the query includes GROUP BY or ORDER BY, note that QuasarDB must fully process the entire dataset even with LIMIT 1. This helps distinguish between true execution overhead and final-stage rendering.
If the resultset is large, consider adding an offset:
```
SELECT ... FROM ... WHERE ... LIMIT 1 OFFSET 100000;
```
- If this still returns quickly, it confirms that the slowness stems from post-processing or result transformation on the client side.

Part 2: Identify server-side performance bottlenecks

If the performance issue cannot be attributed to the client side, the problem is likely within the server-side execution path. To confirm this, enable performance tracing.

In qdbsh, enable performance tracing:
```
qdbsh > enable_perf_trace
```
After running the query, observe the trace summary at the end:
```
total time: 412,431 us

*** end performance trace
```
- If this number closely matches the total query execution time, the slowdown is occurring server-side.

Server-side bottlenecks generally fall into one of the following four categories:

I/O limitation:
- The query touches large amounts of data, some of which may need to be fetched from cold storage (especially when using S3 as a backend).
- Run the same query multiple times. If the second execution is much faster, this strongly indicates that the first execution was I/O-bound.
- Recommendations:
  - If using S3 as a backend:
    - Increase the size of the local SSD disk cache.
    - Add more memory to improve in-memory caching.
  - If not using S3:
    - Increase system memory to improve data caching.
CPU limitation:
- Complex aggregation functions (e.g., STDDEV) can cause CPU bottlenecks.
- If the QuasarDB daemons show available CPU headroom, increase the client-side option_set_connection_per_address_soft_limit to allow more server threads to process the request in parallel.
- If all CPU cores are fully utilized, consider scaling up the CPU capacity of your QuasarDB nodes.
Resource contention with other clients:

Other users or processes may be competing for shared cluster resources.
- Typical mitigation strategies include:
  - Splitting workloads across multiple clusters with distinct SLAs.
  - For example, use one cluster for real-time monitoring and another for data science or analytics tasks.
  - If using S3, you may deploy a read-only QuasarDB cluster dedicated to data science workloads that do not require real-time data access.
Network limitation:
- If CPU and cache usage appear normal, slowdowns may be due to network bandwidth limitations.
- This is common when using an S3-based backend:
  - QuasarDB can saturate 10 Gbit/s or 25 Gbit/s links when pulling cold data.
  - Monitor network throughput during query execution to validate this.
  - Background compaction can also contribute to network usage.
  - To reduce interference, temporarily disable automatic compaction with:
    qdbsh > cluster_disable_auto_compaction

9.6. Heavy Disk I/O#

Symptom: Sustained high disk I/O, especially during periods of frequent writes, leading to performance degradation or elevated compaction overhead.

Overview: QuasarDB uses a transactional storage engine based on RocksDB with copy-on-write semantics. Small incremental writes to large shards can lead to severe write amplification. This section outlines how to detect such patterns and progressively optimize disk write behavior, starting from high-level options to advanced storage engine tuning.

Step 1: Detect small incremental inserts

Ensure the following QuasarDB daemon option is enabled:
```
"log_small_append_percentage": 5
```

This causes the database to emit a warning log when inserts are too small to be efficient. Look for messages like:

warning small incremental insert detected: append increased data size for
shard <table>/<shard_id> by only 3.6% (below threshold of 10%). This
negatively affects write performance.

Such logs indicate that small incremental appends are occurring, which cause full shard rewrites and increase I/O pressure.

Step 2: High-level mitigations

Depending on how your application interacts with the database, several common strategies are available:

Not using async push:
- Consider enabling async push. It buffers writes server-side and merges them on disk more efficiently.
Already using async push:
- Tune the pipeline configuration:
  - Increase the flush interval (5 minutes is a common setting in production).
  - Increase the pipeline buffer size (e.g., 4 GB).
- These adjustments allow QuasarDB to coalesce more writes and reduce incremental shard rewrites.
Business constraints prevent tuning async pipelines:
- Consider reducing the shard size. Smaller shards reduce the cost of rewriting each shard and thus mitigate write amplification.

Step 3: Advanced tuning (RocksDB layer)

If none of the above options yield satisfactory improvements, the following low-level configuration options can be tuned. These are applied via the column_family_options section in qdbd.conf.

Increase `table_mem_budget` (default: 1 GB):
- This increases the size of the RocksDB memtables (25% of table_mem_budget).
- Larger memtables flush larger L0 files, improving efficiency for write-heavy workloads. 4 GB or 8 GB budgets are common in write-heavy production setups.
Set `sst_partitioner_threshold`:
- Default behavior causes SST files to be split based on file size rather than shard boundaries.
- Set this value to 50% of the target SST file size (e.g., 64 MB if the target is 128 MB).
- Benefits: - Reduces shard fragmentation across SST files. - Minimizes write amplification and improves efficiency for S3 particularly, where random access is not possible.
Set `target_file_size_base` (default: 128 MB):
- This sets the desired size of L0 SST files.
- Larger files improve compaction throughput but may slightly increase read amplification, particularly with S3 storage.
- When increasing this value, always tune sst_partitioner_threshold accordingly.
Set `max_write_buffer_number` (default: 2):
- This sets the number of RocksDB memtables (write buffers) allowed in memory before flushing.
- Higher values produce larger L0 flushes, which improve efficiency by deduplicating writes within large batches.
- In production, values like 16 have shown good performance.
- Example calculation:
  - With table_mem_budget = 4 GB, memtables are 1 GB each.
  - With max_write_buffer_number = 16, up to 16 GB of SST files may be flushed to L0.
Set `level0_file_num_compaction_trigger` (default: 2):
- Number of L0 SST files that must accumulate before compaction to L1.
- Increasing this allows larger, more efficient compactions.
- Example:
  - With max_write_buffer_number = 16 and target_file_size_base = 128 MB, an L0 compaction trigger of 4 would result in a compaction job of ~512 MB.
- Trade-off: - Improves write efficiency. - Increases read amplification for recently written data (as L0 is not globally sorted).

Final Note: For write-heavy workloads—especially those with small or frequent inserts—it is critical to combine high-level architectural choices (e.g., async push and shard layout) with low-level RocksDB tuning to avoid overwhelming the disk subsystem and maximize throughput.