1.2. Async Pipelines#

Async pipelines are a server-side mechanism in QuasarDB designed to optimize high-frequency, incremental data ingestion by buffering writes in memory before flushing them to disk. This approach reduces write amplification, enhances performance, and maintains data integrity.

1.2.1. When to Use Async Pipelines#

Async pipelines are particularly beneficial in scenarios involving:

Streaming or Real-Time Data Ingestion: Continuous insertion of small data points across multiple tables or shards.
High-Frequency Trading Systems: Environments where data is rapidly generated and needs efficient handling.
IoT Data Collection: Systems collecting time-series data from numerous sensors or devices.

In contrast, for batch insertions where large volumes of data are written infrequently, the traditional batch inserter may suffice. We still recommend you to use fast insertion mode in those cases.

1.2.2. Understanding Write Amplification#

QuasarDB utilizes a Log-Structured Merge-tree (LSM-tree) architecture, which is efficient for write operations. However, incremental inserts can lead to write amplification, where multiple writes are required for a single data point due to the nature of LSM-trees. This can impact performance and storage efficiency.

For more details on LSM-trees and compaction, refer to the Compaction and Trimming section.

1.2.3. How Async Pipelines Work#

Async pipelines mitigate write amplification by:

Buffering Incoming Writes: Data is temporarily stored in memory, reducing immediate disk I/O.
Merging Operations: Similar write operations are combined, minimizing redundant writes.
Periodic Flushing: Buffered data is written to disk at configured intervals or when certain thresholds are met.

This process ensures efficient disk usage and improved write performance.

1.2.4. Configuration Options#

Async pipelines can be tailored to specific workloads through various configuration parameters:

Number of Pipelines: Determines parallelism. Typical values are 4, 8, or 16.
Pipeline Buffer Size: Amount of memory allocated per pipeline. Common sizes range from 1GB to 4GB.
Flush Interval: Frequency at which data is flushed to disk. Configurable between 3 seconds and 900 seconds.

These settings can be adjusted in the QuasarDB configuration file. For detailed configuration instructions, see Asynchronous Timeseries Inserter.

1.2.5. Monitoring and Metrics#

Monitoring async pipelines is crucial for maintaining optimal performance. Key metrics include:

`async_pipelines.busy_denied_count`: Number of write requests denied due to full buffers.
`async_pipelines.buffer.total_bytes`: Current memory usage of the pipelines.
`async_pipelines.write.bytes_total`: Total bytes written to disk.
`async_pipelines.write.elapsed_us`: Time taken for write operations.

For a comprehensive list of metrics and their interpretations, refer to Storage - Async Pipelines.

1.2.6. Visualizing Async Pipeline Activity#

To aid in understanding and troubleshooting, QuasarDB provides dashboards and charts illustrating async pipeline behavior. These visual tools can help identify patterns, bottlenecks, and anomalies.

Example charts include:

Memory Usage Over Time: Tracks buffer utilization.
Write Throughput: Displays the volume of data written to disk.
Error Rates: Highlights occurrences of denied write requests.

For more information and examples, see Async Pipelines.

1.2.7. Considerations and Best Practices#

Resource Allocation: Ensure sufficient memory and CPU resources are available to accommodate the configured number of pipelines and buffer sizes.
Replication: If replication is enabled, async pipeline data is also replicated, impacting network and storage resources.
Data Consistency: Proper configuration and monitoring help maintain data integrity and prevent loss during high-throughput operations.

By understanding and effectively configuring async pipelines, users can significantly enhance the performance and reliability of their QuasarDB deployments, especially in environments requiring real-time data processing.