1.3. Deduplication#
1.3.1. Purpose#
Data deduplication is a technique that enables the elimination of repeating or redundant data. QuasarDB offers first-class support for dealing with duplicate data as part of the insertion process.
Our support for deduplication aims to solve the following problems:
Rogue processes sending repeating data: by automatically dropping duplicate data upon insertion, the impact of these processes are minimal and no data is modified on disk;
Auditability: deduplication ensures that once data is written, it is never modified, which is often a requirement for audit and compliance;
Referential transparency: upstream processes can retry any number of times (for example in case of partial failure), and the result remains the same.
1.3.2. Implementation#
QuasarDB natively supports deduplication as part of the batch writer
API, and provides various modes as described in the table below:
Mode |
Effects |
---|---|
Disabled |
No deduplication is performed |
Full |
New records will be dropped if (and only if) all values of a row match data already found in the table |
Column-wise |
New records will be dropped if (and only if) a certain set of columns’ values match data already found in the table. |
Often, from a business perspective, there are invariants that can be enforced using deduplication. For example, if you know a single measurement will ever be recorded per instrument_id
for a unique timestamp.
1.3.2.1. Example#
Below is an example on how to use each of the deduplication modes with the Python Pandas API:
import quasardb
import quasardb.pandas as qdbpd
import pandas as pd
def _create_dataframe():
data = [[pd.to_datetime('2017-01-01'), 10, "1.0"], [pd.to_datetime('2017-01-04'), 11, "1.1"]]
columns = ["$timestamp", "instrument_id", "version"]
df = pd.DataFrame(data, columns=columns)
return df.set_index('$timestamp').reindex()
df = _create_dataframe()
# Acquire connection to the cluster and a reference to the table we want to
# insert into
conn = quasardb.Cluster("qdb://127.0.0.1:2836")
table = conn.table("measurements")
# Disable deduplication (the default) by explicitly setting deduplicate to False, create table if it doesn't exist
qdbpd.write_dataframe(df, conn, table, deduplicate=False, create=True)
# Full deduplication by setting deduplicate to True
qdbpd.write_dataframe(df, conn, table, deduplicate=True)
# Column-wise deduplication by providing a list of columns to use for deduplication.
# In this example, we deduplicate based on the timestamp, instrument_id and version:
qdbpd.write_dataframe(df, conn, table, deduplicate=['$timestamp', 'instrument_id', 'version'])
1.3.3. Alternatives#
There are various alternative techniques to deduplication which, although related, are different in their use case and logic.
1.3.3.1. Versioning#
Versioning is a technique of keeping track of multiple versions of the same row: rather than deduplicating or updating a record, one inserts a new version of the same row into the table. This enables keeping track of how a record or measurement evolves over time, while retaining the ability to efficiently query the latest version and/or reconstruct the state of the database at a certain point in time.
The notable differences between deduplication are:
New data is always inserted, rather than dropped;
Determining which data to process is done upon query time, rather than insertion time.
QuasarDB supports versioning using the RESTRICT TO
query statement. See the howto on forecasting for a real-world example on how this works with QuasarDB.
1.3.3.2. Insert truncation#
Truncated inserts enable truncating an entire range of data and replacing it with new data in a single transactional process.
This is most commonly seen in data warehouses and ETL processes, where you want to (re-)load data into a table, without having any old/overlapping data in there.
Although very similar, there are differences:
Rather than comparing data on a row-by-row basis as is done with deduplication, truncation always removes all old data;
New data is always written, never dropped;
Although it does provide referential transparency, it does not protect against rogue processes and does not provide auditability.
1.3.3.3. Upsert#
Upsert is a common technique to tell a database to “insert or update” a record, and (often) works in conjuction with a primary key.
Upon invocation, the database first attempts to insert the record. If a primary key conflict is dedected, the database updates the record to reflect the provided content. Upserts are most commonly used in relational databases and/or used by ORMs where all you want is to make sure a certain record is persisted into a database, even if this means implicitly updating a previous version.
Although similar, it is very different from deduplication: for every invocation of an upsert, the record is modified. This has the following implications:
Every invocation always writes new data, and as such does not protect against rogue processes putting a lot of pressure on the database by repeatedly inserting data with the same key;
Auditability: because data is implicitly updated in-place, the old version is lost and no way to reconstruct the state of the database at a certain point in time.