1.3. Deduplication

    1.3. Deduplication

    1.3.1. Purpose

    Data deduplication is a technique that enables the elimination of repeating or redundant data. QuasarDB offers first-class support for dealing with duplicate data as part of the insertion process.

    Our support for deduplication aims to solve the following problems:

    • Rogue processes sending repeating data: by automatically dropping duplicate data upon insertion, the impact of these processes are minimal and no data is modified on disk;

    • Auditability: deduplication ensures that once data is written, it is never modified, which is often a requirement for audit and compliance;

    • Referential transparency: upstream processes can retry any number of times (for example in case of partial failure), and the result remains the same.

    1.3.2. Implementation

    QuasarDB natively supports deduplication as part of the batch writer API, and provides various modes as described in the table below:

    Mode

    Effects

    Disabled

    No deduplication is performed

    Full

    New records will be dropped if (and only if) all values of a row match data already found in the table

    Column-wise

    New records will be dropped if (and only if) a certain set of columns’ values match data already found in the table.

    Often, from a business perspective, there are invariants that can be enforced using deduplication. For example, if you know a single measurement will ever be recorded per instrument_id for a unique timestamp.

    1.3.2.1. Example

    Below is an example on how to use each of the deduplication modes with the Python Pandas API:

    import quasardb
    import quasardb.pandas as qdbpd
    
    # Assuming `df` is a Pandas Dataframe
    df = _create_dataframe(..)
    
    # Acquire connection to the cluster and a reference to the table we want to
    # insert into
    conn = quasardb.Cluster("qdb://127.0.0.1:2836")
    table = conn.table("measurements")
    
    # Disable deduplication (the default) by explicitly setting drop_duplicates to False
    qdbpd.write_dataframe(df, conn, table, drop_duplicates=False)
    
    # Full deduplication by setting drop_duplicates to True
    qdbpd.write_dataframe(df, conn, table, drop_duplicates=True)
    
    # Column-wise deduplication by providing a list of columns to use for deduplication.
    # In this example, we deduplicate based on the timestamp, instrument_id and version:
    qdbpd.write_dataframe(df, conn, table, drop_duplicates=['instrument_id', 'version'])
    

    1.3.3. Alternatives

    There are various alternative techniques to deduplication which, although related, are different in their use case and logic.

    1.3.3.1. Versioning

    Versioning is a technique of keeping track of multiple versions of the same row: rather than deduplicating or updating a record, one inserts a new version of the same row into the table. This enables keeping track of how a record or measurement evolves over time, while retaining the ability to efficiently query the latest version and/or reconstruct the state of the database at a certain point in time.

    The notable differences between deduplication are:

    • New data is always inserted, rather than dropped;

    • Determining which data to process is done upon query time, rather than insertion time.

    QuasarDB supports versioning using the RESTRICT TO query statement. See the howto on forecasting for a real-world example on how this works with QuasarDB.

    1.3.3.2. Insert truncation

    Truncated inserts enable truncating an entire range of data and replacing it with new data in a single transactional process.

    This is most commonly seen in data warehouses and ETL processes, where you want to (re-)load data into a table, without having any old/overlapping data in there.

    Although very similar, there are differences:

    • Rather than comparing data on a row-by-row basis as is done with deduplication, truncation always removes all old data;

    • New data is always written, never dropped;

    • Although it does provide referential transparency, it does not protect against rogue processes and does not provide auditability.

    1.3.3.3. Upsert

    Upsert is a common technique to tell a database to “insert or update” a record, and (often) works in conjuction with a primary key.

    Upon invocation, the database first attempts to insert the record. If a primary key conflict is dedected, the database updates the record to reflect the provided content. Upserts are most commonly used in relational databases and/or used by ORMs where all you want is to make sure a certain record is persisted into a database, even if this means implicitly updating a previous version.

    Although similar, it is very different from deduplication: for every invocation of an upsert, the record is modified. This has the following implications:

    • Every invocation always writes new data, and as such does not protect against rogue processes putting a lot of pressure on the database by repeatedly inserting data with the same key;

    • Auditability: because data is implicitly updated in-place, the old version is lost and no way to reconstruct the state of the database at a certain point in time.