7. RocksDB Tuning Guide

    7. RocksDB Tuning Guide

    This guide provides a collection of best practices to optimize RocksDB for specific use cases and troubleshoot performance issues. It is based on QuasarDB 3.14 and later.

    7.1. Introduction

    Tuning RocksDB involves making tradeoffs between various factors. On a high level, you’ll be balancing:

    1. Read Amplification vs. Write Amplification

    2. Space Amplification vs. Read/Write Amplification

    3. Memory Usage vs. Read/Write Amplification

    A read amplification of 10 means that in order to read 1GB of data, you actually need to retrieve 10GB from disk.

    A write amplification of 10 means that 1GB of data written into the database ends up being re-written 10 times.

    7.1.1. General Configuration Options

    Config File

    Description

    block_size

    This determines the size of each RocksDB block. RocksDB reads and writes (and compresses) data in blocks. This should always be a multiple of your filesystem’s block size. A larger block size increases compression and storage efficiency, but may cause read amplification.

    bottommost_compression

    The compression algorithm RocksDB uses for the “oldest” data. Default is zstd. Use lz4 as a much faster alternative with a worse compression ratio.

    direct_read and direct_write

    Bypass filesystem buffers/caches when writing to disk. Could be beneficial to reduce memory usage, but use with caution.

    level_compaction_dynamic_level_bytes

    Use an algorithm to dynamically determine the optimal base level bytes for leveled compaction, based on the total dataset size. Should be left by default unless you know exactly how much data you’re intending to store in the database. See also: Dynamic level .

    sync_every_write

    Causes flush() to be called after every write. Will increase durability and reliability at the expense of write throughput. Defaults to false.

    sst_partitioner_threshold

    The threshold for which the SST partitioner “cuts off” SST files. This can cause SST files to have perfectly aligned starts/ends of shards, and in turn may reduce read amplification. Especially useful with S3 storage backend, where there’s a high cost of accessing SST files.

    Should be a number between 0 and 100. After sst_partitioner_threshold % bytes have been written to an SST file, will stop writing to SST file as soon as a new table or shard is reached. Leave 0 to disable. Defaults to 0. See also: SST Partitioner.

    table_mem_budget

    The amount of memory allocated to each column family; this includes memtables and table caches. By default, the size of each memtable is 25% of the table_mem_budget . If you wish to increase the size of the memtables (advised in Troubleshooting section), increase this value.

    data_cache

    Shared rocksdb-level data cache per column family. Should be left to default value, as too high values prohibit QuasarDB from performing caching on a higher level.

    7.1.2. Low-Level RocksDB Options

    Troubleshooting may involve low-level RocksDB configuration options. These can be provided using the column_family_options RocksDB option, and should be separated using semicolons.

    For example:

    "column_family_options": "soft_pending_compaction_bytes_limit = 0; hard_pending_compaction_bytes_limit = 0;"
    

    7.2. Troubleshooting

    7.2.1. Slowing Down Writes

    If you’ve observed a slowdown in write operations, follow these troubleshooting steps:

    1. Check qdbd.log

    Review your qdbd.log for logs like the following:

    could not process table insert [tablename] (bucket ..ts.[tablename].bkt.000001628193600000) from client id 4: Resource temporarily unavailable
    

    When you see Resource unavailable, it indicates that all partitions are currently blocked, resulting in timeouts while acquiring a free partition.

    1. Analyze Disk I/O

    First, analyze disk I/O using tools like iotop or ordstat. If you observe minimal disk I/O, the issue is unlikely to be related to RocksDB. Continue reading if significant disk I/O is evident.

    1. High Read-IO, Low Write-IO

    If you notice high read-IO but low write-IO, it suggests potential read amplification and the presence of uncompacted data. Perform the following checks:

    • Is auto-compaction disabled? If yes, increase the frequency of running manual compaction.

    • If auto-compaction is enabled, it implies that you’re writing data faster than it can be compacted.

    To address this, consider:

    • Increasing the number of lo threads or the number of subscompactions allowed.

    • Fine-tuning RocksDB settings.

    1. High Read-IO, High Write-IO

    When both read and write IO are high, RocksDB may be frequently performing compactions and flushes. Use htop to monitor QuasarDB threads are being active:

    • Many rocksdb hi threads indicate frequent flushes.

    • Numerous rocksdb lo threads imply heavy compactions.

    Grep the RocksDB logs for entries such as:

    • “Stopping writes because of [reason]”

    • “Stalling writes because of [reason]”

    If found, you’ve identified the bottleneck. RocksDB is slowing down new writes due to ingestion constraints.

    Causes:

    Too Many Memtables

    If there are too many memtables, RocksDB struggles to flush data to disk. Look for logs like:

    Stalling writes because we have 5 immutable memtables (waiting for flush), `max_write_buffer_number` is set to 6
    

    Possible mitigations, in order of preference:

    1. Flush Faster:

      • This works if you are not already at the write I/O limit of your underlying block device.

      • Ensure you have at least 2 rocksdb:hi threads.

      • Consider allocating more hi threads and setting max_background_flushes accordingly.

      • Consider using tiered storage.

    2. Flush Less:

      • Focus on flushing less by increasing buffer sizes or minimum buffers to merge, which automatically deduplicates keys that are updated.

      • This strategy is effective if you frequently update shards in short timespans (e.g., lots of incremental inserts). It has no effect if you’re already using async pipelines.

      • Increase write_buffer_size (by default, 1/4th of table memory budget) to reduce write amplification.

      • Increase min_write_buffer_to_merge to a higher value to encourage larger initial memtable to L0 flushes.

    3. Buffer Spikes in Writes:

      • If you believe the slowdown is due to a temporary spike in writes, consider increasing the buffer number max_write_buffer_number to a higher value. For example, if you’re doing hourly batch loads of data, you may want a large buffer.

      • You will also need to increase the WAL size accordingly. For instance, if your write buffer number is 10, and each can be 128MB, you need at least a 10 * 128MB WAL size.

    4. Disable Auto-Compaction:

      • This is useful only in bulk-loading scenarios and can result in reads becoming slower over time.

      • If you choose to disable auto-compaction, at least consider periodically calling a manual compaction.

    Too Many Level-0 SST Files

    When you have an excess of level-0 SST files, RocksDB compaction can’t match memtable flush speed. Look for logs like:

    Stalling writes because we have 20 level-0 files
    

    Possible mitigations:

    • Increase level0_slowdown_writes_trigger and level0_stop_writes_trigger.

    Implications:

    • Reads of recent data will be slower due to additional read amplification

    • Additional write amplification due to intra-L0 compactions

    • Can potentially be mitigated by allow_cache_after_insert = True in qdbd.conf

    Too Much Pending Compaction Data

    If compaction can’t keep up with new data, consider:

    • Increasing the number of RocksDB lo threads if you have sufficient I/O bandwidth.

    • Ensure that RocksDB has a high enough max_subcompactions number. Each compaction thread is able to saturate about 100MB/sec of read/write per second; so e.g. if you have 2GB/sec of read/write bandwidth, consider setting this number to 16 - 24;

    Keep in mind:

    QuasarDB automatically sets this to the number of rocksdb:lo threads, which is not the same as subcompactions

    You could have 16 compaction jobs (each running on a different rocksdb “thread”), each with 16 different subcompactions. Keep this in mind!

    Subcompactions only affect L0->L1 compactions, higher level compactions cannot use subcompactions (but they can run in parallel).

    • Increasing max_background_compactions can enable multiple parallel compactions if you have enough read/write bandwidth.

    Note:

    This may trigger Intra-L0 compactions (discussed below)

    Too Much Uncompacted Data

    Look for logs like:

    Stalling writes because of estimated pending compaction bytes 5000000
    Stopping writes because of estimated pending compaction bytes 10000000
    

    What it means is that RocksDB tries to maintain optimal query speed by automatically reducing the speed at which writes can happen, or stop them at all, to make sure that compaction can keep up with the write speed. This aims to reduce read amplification at the expense of write throughput.

    Possible mitigation:

    • Tune the soft_pending_compaction_bytes_limit and hard_pending_compaction_bytes_limit RocksDB configuration variables.

    • Set them to 0 to disable these slowdowns altogether.

    7.3. Compaction

    QuasarDB employs leveled compaction by default, or tiered+leveled when selecting universal.

    7.3.1. Compaction Process

    Compactions take overlapping SST files from Level N and “compact” them into Level N+1. This process is initially triggered if the number of L0 SST files exceeds level0_file_num_compaction_trigger. All L0 files, as they overlap, are merged into L1.

    Non-L0 levels have a size target. After the L0->L1 compaction, if the size of L1 surpasses its target, an additional compaction L1->L2 will be triggered, and so on. Each level typically increases the target size by an order of magnitude. For instance, L1 might be set to 1GB, L2 to 10GB, L3 to 100GB, and so forth. The multiplier between levels is configured using max_bytes_for_level_multiplier (default is 10).

    The target size of L1 is set using max_bytes_for_level_base, which defaults to 256MB. Subsequent levels are determined by multiplying this base size.

    For instance, with defaults of 256MB and multiplier=10, L7 will have a target of 256MB * (10^6) = 256TB.

    7.3.2. Advice

    Ensure that max_bytes_for_level_base is at least max_write_buffer_number * write_buffer_size. Otherwise, all L0 files will be larger than what can fit in L1 (or L2). In other words, if you increase write buffer sizes, also increase the size targets for each level.

    7.3.3. Intra-L0 Compaction

    Sometimes, “Intra-L0” compactions are triggered, combining multiple L0 SST files into a single L0 SST file. Normally, when level0_file_num_compaction_trigger is reached, it attempts to sort all L0 files into L1. However, this is not always possible, especially when max_background_compactions is greater than 1.

    Intra-L0 compaction merges files from L0 into larger L0 SST files, identifiable by logs like:

    Compacting 15@0 files to L0
    

    In this context, you can observe that it takes 15 L0 files and outputs them as L0.

    These compactions can cause significant memory usage as they write larger SST files.

    Mitigations to avoid intra-L0 compactions and high memory usage include:

    • Keeping the number of write buffers small.

    • Increasing the size of individual write buffers.

    • Raising the base size of L1.

    • Increasing compaction speed as described in the previous section.

    7.4. Tuning: Compaction Strategy

    RocksDB offers two distinct compaction strategies:

    1. Leveled (QuasarDB Default): This strategy prioritizes reducing read amplification but comes at the cost of increased write amplification. It’s suitable when:

      • Storage resources are limited or expensive (e.g., SSD).

      • You aim to optimize read and query performance, even if it means higher write overhead.

    Please note that write amplification levels of > 10 are common with this strategy.

    1. Universal / Tiered: This approach focuses on minimizing write amplification while accepting increased read and space amplification as trade-offs.. Universal compaction selects the “oldest” data files in a level, assuming that less recent data is less likely to change. Consider this strategy when:

      • Storage costs are low.

      • Your primary goal is to maximize throughput and reduce I/O overhead.

      Please note that there is a known double size issue that may require a full recompaction of all data in the database when using the universal strategy.

    You have the flexibility to switch between these compaction strategies even after creating a database. For instance, you can start with universal compaction and later transition to leveled compaction if you find that the disk usage overhead becomes a concern.