1.5. Deduplication#

1.5.1. Purpose#

Data deduplication is a technique that enables the elimination of repeating or redundant data. QuasarDB offers first-class support for dealing with duplicate data as part of the insertion process.

Our support for deduplication aims to solve the following problems:

Rogue processes sending repeating data: by automatically dropping duplicate data upon insertion, the impact of these processes are minimal and no data is modified on disk;

Auditability: deduplication ensures that once data is written, it is never modified, which is often a requirement for audit and compliance;

Referential transparency: upstream processes can retry any number of times (for example in case of partial failure), and the result remains the same.

1.5.2. Implementation#

QuasarDB natively supports deduplication as part of the batch writer API, and provides various modes as described in the table below:

Mode	Effects
Disabled	No deduplication is performed
Full	New records will be dropped if (and only if) all values of a row match data already found in the table
Column-wise	New records will be dropped if (and only if) a certain set of columns’ values match data already found in the table.

Often, from a business perspective, there are invariants that can be enforced using deduplication. For example, if you know a single measurement will ever be recorded per instrument_id for a unique timestamp.

1.5.2.1. Example#

Below is an example on how to use each of the deduplication modes with the Python Pandas API:

import quasardb
import quasardb.pandas as qdbpd
import pandas as pd

def _create_dataframe():
    data = [[pd.to_datetime('2017-01-01'), 10, "1.0"], [pd.to_datetime('2017-01-04'), 11, "1.1"]]
    columns = ["$timestamp", "instrument_id", "version"]
    df = pd.DataFrame(data, columns=columns)
    return df.set_index('$timestamp').reindex()

df = _create_dataframe()

# Acquire connection to the cluster and a reference to the table we want to
# insert into
conn = quasardb.Cluster("qdb://127.0.0.1:2836")
table = conn.table("measurements")

# Disable deduplication (the default) by explicitly setting deduplicate to False, create table if it doesn't exist
qdbpd.write_dataframe(df, conn, table, deduplicate=False, create=True)

# Full deduplication by setting deduplicate to True
qdbpd.write_dataframe(df, conn, table, deduplicate=True)

# Column-wise deduplication by providing a list of columns to use for deduplication.
# In this example, we deduplicate based on the timestamp, instrument_id and version:
qdbpd.write_dataframe(df, conn, table, deduplicate=['$timestamp', 'instrument_id', 'version'])

1.5.3. Alternatives#

There are various alternative techniques to deduplication which, although related, are different in their use case and logic.

1.5.3.1. Versioning#

Versioning is a technique of keeping track of multiple versions of the same row: rather than deduplicating or updating a record, one inserts a new version of the same row into the table. This enables keeping track of how a record or measurement evolves over time, while retaining the ability to efficiently query the latest version and/or reconstruct the state of the database at a certain point in time.

The notable differences between deduplication are:

New data is always inserted, rather than dropped;

Determining which data to process is done upon query time, rather than insertion time.

QuasarDB supports versioning using the RESTRICT TO query statement. See the howto on forecasting for a real-world example on how this works with QuasarDB.

1.5.3.2. Insert truncation#

Truncated inserts enable truncating an entire range of data and replacing it with new data in a single transactional process.

This is most commonly seen in data warehouses and ETL processes, where you want to (re-)load data into a table, without having any old/overlapping data in there.

Although very similar, there are differences:

Rather than comparing data on a row-by-row basis as is done with deduplication, truncation always removes all old data;

New data is always written, never dropped;

Although it does provide referential transparency, it does not protect against rogue processes and does not provide auditability.

1.5.3.3. Upsert#

Upsert is a common technique to tell a database to “insert or update” a record, and (often) works in conjuction with a primary key.

Upon invocation, the database first attempts to insert the record. If a primary key conflict is dedected, the database updates the record to reflect the provided content. Upserts are most commonly used in relational databases and/or used by ORMs where all you want is to make sure a certain record is persisted into a database, even if this means implicitly updating a previous version.

Although similar, it is very different from deduplication: for every invocation of an upsert, the record is modified. This has the following implications:

Every invocation always writes new data, and as such does not protect against rogue processes putting a lot of pressure on the database by repeatedly inserting data with the same key;

Auditability: because data is implicitly updated in-place, the old version is lost and no way to reconstruct the state of the database at a certain point in time.