1.1. Batch inserter#
1.1.1. Purpose#
The QuasarDB batch inserter provides you with an interface to send data to the QuasarDB cluster. The data is buffered client-side and sent in a single logical batch, ensuring efficiency and transactional consistency.
The batch inserter supports inserting into multiple tables in a single, atomic operation.
1.1.2. APIs#
There are currently three different batch insertion APIs:
API |
Status |
Description |
---|---|---|
Regular |
Deprecated |
Regular batch inserter, which exposes a row-based insertion API. This API has been deprecated for performance reasons. |
Pinned |
Deprecated |
Pinned writer, which exposes a column-oriented API. This API has been deprecated for performance reasons. |
Exp(erimental) |
Recommended |
Next generation batch insertion API, available since 3.13.1. It exposes a column-oriented API, and all new features and functionality are available under this API. |
We strongly recommend using the new, experimental batch writer API for any newly written code. Old code should be ported to this API.
1.1.3. Insertion modes#
The batch writer has various modes of operation, each with different tradeoffs:
Insertion mode |
Description |
Use case(s) |
---|---|---|
Default |
Transactional insertion mode that employs Copy-on-Write |
General purpose |
Fast |
Transactional insert that does not employ Copy-on-Write. Newly written data may be visible to queries before the transaction is fully completed. |
High-volume inserts constrained by disk I/O |
Asynchronous |
Data is buffered in-memory in the QuasarDB daemon nodes before writing to disk. Data from multiple sources is buffered together, and periodically flushed to disk. |
Streaming data where multiple processes simultaneously write into the same table(s) |
Truncate |
Replaces any existing data with the provided data in a single transactional operation. |
Replay of historical data |
If you intend on doing a lot of small, incremental inserts to the same table, we recommend using the asychronous insertion mode.
If you process and insert data in larger batches, we recommend using the fast insertion mode.
1.1.4. Options#
In addition to different an insertion modes, the batch writer API provides various options and parameters that affect operation, which are documented below.
Name |
Description |
---|---|
Truncate ranges |
When using Truncate insertion mode, these ranges specify which ranges to truncate. All newly inserted data must fall within these ranges. |
Duplicates |
Allows specification of deduplication options. See the documentation on deduplication for more information. |
1.1.5. Batch size#
The batch inserter parallelizes its operations, based on the number of threads configuration option. It first groups the entire operation into batches of tasks, and then executes these in parallel.
You can configure the maximum size of the batch by setting the client_max_batch_load
option. For example, if you are running your application with a connection parallelism of 4, your max
batch load is 21, and you attempt to insert data into 210 different shards, these tasks will be grouped in 10 different batches, and executed in parallel using 4 threads.
In Python, you can configure it as such:
with pool.instance().connect() as conn:
conn.options().set_client_max_batch_load(42)
The batch inserter will now group the operation into batches of up to 42 tasks.
1.1.6. Usage#
The steps involved in using the batch writer API is as follows:
Initialize a local batch inserter instance, providing it with the tables and columns you want to insert data for. Note that specifying multiple tables is supported: this will allow you to insert data into multiple tables in one atomic operation.
Prepare/buffer the batch you want to insert. Buffering locally before sending ensures that the tranmission of the data is happening at maximum throughput, ensuring server-side efficiency.
Push the batch to the cluster.
If necessary, go back to step 2 to send additional batches.
For any insertion mode that is not asynchronous, we recommend batch sizes as large as possible: batch sizes of million of rows are not uncommon and encouraged.