5.6. How to handle multiple versions of the same data

    5.6. How to handle multiple versions of the same data

    Let’s say you’re working on forecast data and that this forecast data is updated multiple times a day.

    While it could be tempting to “update” the value in your database to the latest forecast (and keep only that), you may be interested in maintaining all versions.

    For example, you may want to evaluate the accuracy of the different forecasts. In trading, another reason could be to do backtesting with the forecast value “as it was”.

    How to maintain multiple versions of a value?

    In this guide, we will:

    • See how you can model data to support multiple versions of a value for a given timestamp

    • How to query a specific version of the data present in the database

    • How to render a specific version of the data in a curve

    We will be using Python, Pandas, and plotly. You can achieve similar results with any programming language or tool that can query Quasar.

    5.6.1. Preparation

    5.6.1.1. Install and launch QuasarDB

    Please install & launch QuasarDB for your environment; the free community edition is sufficient.

    For more information and instructions on how to install QuasarDB, please refer to our installation guide.

    5.6.1.2. Prepare your Jupyter notebook

    Please follow the steps below to create and set up your Jupyter notebook:

    # Install requirements
    $ python3 -m pip install jupyter \
                             pandas \
                             numpy \
                             quasardb
    
    # Launch local notebook
    $ jupyter notebook ./qdb-howto.ipynb
    
    [I 17:16:02.496 NotebookApp] Serving notebooks from local directory: /home/user/qdb-howto/
    [I 17:16:02.496 NotebookApp] Jupyter Notebook 6.4.6 is running at:
    [I 17:16:02.496 NotebookApp] http://localhost:8888/?token=...
    [I 17:16:02.496 NotebookApp]  or http://127.0.0.1:8888/?token=...
    [I 17:16:02.496 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
    

    A new jupyter notebook should automatically open, otherwise please manually navigate and you can navigate your browser to http://localhost:8888/ with the environment prepared.

    You are now completed with the preparations.

    5.6.2. Modeling multiple versions

    Let’s say we’re tasked with storing temperature forecasts with the requirements:

    • Forecasts are made on a daily basis;

    • Each forecast describes the day ahead and the 7 days following;

    • Each forecast has data points in intervals of 1 hour;

    • We want to keep track of the entire history of all forecasts, i.e. old forecasts will never be deleted;

    • When querying data, we typically want to use only the most recent forecast.

    We can generate data for a single forecast as follows:

    import math
    import numpy as np
    import pandas as pd
    
    basetime = np.datetime64('2021-09-01')
    basetemp = 18
    d = 0.05
    def temp_for(i):
        # semi-natural temperature forecast
        t = basetemp / float(i + 1)
        return basetemp * (math.exp(-d/t))
    
    def gen_forecast(start):
        offset = (start - basetime) / np.timedelta64(1, 'h')
        return [{'When': start + np.timedelta64(i, 'h'),
                 'Forecast made at': start,
                 'Temperature': temp_for(i) + offset} for i in range(0, (24 * 60))]
    
    pd.DataFrame(data=gen_forecast(basetime))
    
    When Forecast made at Temperature
    0 2021-09-01 00:00:00 2021-09-01 17.950069
    1 2021-09-01 01:00:00 2021-09-01 17.900277
    2 2021-09-01 02:00:00 2021-09-01 17.850623
    3 2021-09-01 03:00:00 2021-09-01 17.801107
    4 2021-09-01 04:00:00 2021-09-01 17.751728
    ... ... ... ...
    1435 2021-10-30 19:00:00 2021-09-01 0.333365
    1436 2021-10-30 20:00:00 2021-09-01 0.332440
    1437 2021-10-30 21:00:00 2021-09-01 0.331518
    1438 2021-10-30 22:00:00 2021-09-01 0.330599
    1439 2021-10-30 23:00:00 2021-09-01 0.329681

    1440 rows × 3 columns

    In order to model this data, we can observe the following:

    • As the column When describes the time the data point describes, and is most likely to be queried for, we use this as our primary $timestamp index;

    • We can use Forecast made at to find the most recent forecast for a query: it effectively describes the version of the data.

    In other words, this data is bitemporal: one timestamp describing the time/temperature being forecasted, and another timestamp of when the forecast itself was made.

    In Quasar, we can store the data as-is, in a table with three columns:

    • The primary $timestamp index;

    • A version timestamp column;

    • A temperature double column.

    5.6.3. Writing to Quasar

    We can take this forecast data and write it to Quasar.

    Let’s first establish a connection with Quasar and create the table:

    import quasardb
    import quasardb.pandas as qdbpd
    
    conn = quasardb.Cluster('qdb://127.0.0.1:2836')
    table = conn.table('forecasts')
    
    try:
       table.remove()
    except:
       # Table did not yet exist
       pass
    
    cols = [quasardb.ColumnInfo(quasardb.ColumnType.Timestamp, "version"),
            quasardb.ColumnInfo(quasardb.ColumnType.Double, "temperature")]
    
    table.create(cols)
    

    Let’s generate 3 different forecasts, and insert them into Quasar:

    def insert_forecast(forecast):
      # Convert to DataFrame
      df = pd.DataFrame(data=forecast)
      df = df.rename(columns={'When': '$timestamp', 'Forecast made at': 'version', 'Temperature': 'temperature'})
      df = df.round({'temperature': 3})
      df = df.set_index('$timestamp').reindex()
    
      # Invoke QuasarDB Pandas API to store this forecast
      qdbpd.write_dataframe(df, conn, table)
    
    # Generate 10 forecasts, one for each hour starting at 2021-09-01
    timestamps = (basetime + np.timedelta64(i, 'h') for i in range(10))
    forecasts = [gen_forecast(np.datetime64(timestamp)) for timestamp in timestamps]
    
    # Use first 3 forecasts
    for forecast in forecasts[:3]:
       insert_forecast(forecast)
    

    5.6.4. How to query the data

    We can retrieve the first 10 rows of the data by using a simple Quasar query for the QuasarDB pandas API:

    q = "SELECT $timestamp, version, temperature FROM forecasts ORDER BY $timestamp LIMIT 10"
    qdbpd.query(conn, q)
    
    $timestamp version temperature
    0 2021-09-01 00:00:00 2021-09-01 00:00:00 17.950
    1 2021-09-01 01:00:00 2021-09-01 00:00:00 17.900
    2 2021-09-01 01:00:00 2021-09-01 01:00:00 18.950
    3 2021-09-01 02:00:00 2021-09-01 00:00:00 17.851
    4 2021-09-01 02:00:00 2021-09-01 01:00:00 18.900
    5 2021-09-01 02:00:00 2021-09-01 02:00:00 19.950
    6 2021-09-01 03:00:00 2021-09-01 00:00:00 17.801
    7 2021-09-01 03:00:00 2021-09-01 01:00:00 18.851
    8 2021-09-01 03:00:00 2021-09-01 02:00:00 19.900
    9 2021-09-01 04:00:00 2021-09-01 00:00:00 17.752

    As you can see above, we indeed have all the data in the database, and it should be apparant that we indeed have 3 versions of the data.

    In order to focus on a single version, your intuition might be to make a query with a WHERE clause specifying a specific version number or a specific forecast time. For example, the following could work:

    version = "2021-09-01T05:00:00"
    q = """
    SELECT
      $timestamp,
      version,
      temperature
    
    FROM forecasts
    WHERE version = DATETIME({})
    """.format(version)
    qdbpd.query(conn, q)
    
    0

    This is all well and good, but what if not every row has the same version of the forecast? In the data above we’re missing the first few rows, because of this reason.

    Often, what we’re interested in is “the last version of the forecast available, whatever that is”, but we don’t know the exact version timestamp.

    Fortunately, Quasar has a built-in query for this with the RESTRICT TO keyword. The RESTRICT TO keyword will restrict the query to a specific “version” based on the value of a given column.

    In our example, if we want to have the “last version” of the forecast available, whatever that is, the query becomes:

    q = """
    SELECT
      $timestamp,
      version,
      temperature
    
    FROM forecasts
    RESTRICT TO MAX(version)
    """
    qdbpd.query(conn, q)
    
    $timestamp version temperature
    0 2021-09-01 00:00:00 2021-09-01 00:00:00 17.950
    1 2021-09-01 01:00:00 2021-09-01 01:00:00 18.950
    2 2021-09-01 02:00:00 2021-09-01 02:00:00 19.950
    3 2021-09-01 03:00:00 2021-09-01 02:00:00 19.900
    4 2021-09-01 04:00:00 2021-09-01 02:00:00 19.851
    ... ... ... ...
    1437 2021-10-30 21:00:00 2021-09-01 02:00:00 2.333
    1438 2021-10-30 22:00:00 2021-09-01 02:00:00 2.332
    1439 2021-10-30 23:00:00 2021-09-01 02:00:00 2.332
    1440 2021-10-31 00:00:00 2021-09-01 02:00:00 2.331
    1441 2021-10-31 01:00:00 2021-09-01 02:00:00 2.330

    1442 rows × 3 columns

    As you can see, here it automatically does the right thing: for the first few rows, it uses the last available forecast at that time.

    5.6.5. Point in Time

    There is an additional advantage to using timestamped versions: it allows us to reconstruct the data as of a certain time. This is very useful functionality to back-test algorithms, resolve quality issues, and/or help with auditing.

    To achieve this, we take the following steps:

    • Before filtering for the most recent version, apply a WHERE clause that removes any rows with a version after a certain date;

    • Use the same RESTRICT TO syntax as described above to keep only the most recent version.

    If, for example, we want to figure out what the state of the database is at 2021-09-01T01:15, we can do so as follows:

    q = """
    SELECT
      $timestamp,
      version,
      temperature
    
    FROM forecasts
    WHERE version <= DATETIME(2021-09-01T01:15)
    RESTRICT TO MAX(version)
    """
    qdbpd.query(conn, q)
    
    $timestamp version temperature
    0 2021-09-01 00:00:00 2021-09-01 00:00:00 17.95
    1 2021-09-01 01:00:00 2021-09-01 01:00:00 18.95

    As you can see, it now returns the data as it was at that exact point in time.

    5.6.6. Plotting multiple versions

    Of course, we can use this data to plot multiple versions in the same chart. This can be very useful to visualize the behavior of a forecasting algorithm over time.

    Let’s start out with plotting multiple versions as multiple lines in a chart. We’ll be using plotly for these visualizations.

    import plotly.graph_objects as go
    
    q = """
    SELECT
      $timestamp,
      version,
      temperature
    
    FROM forecasts
    
    IN RANGE (2021-09-03T00:00:00, +1month)
    """
    df = qdbpd.query(conn, q).set_index('$timestamp').reindex()
    groups = df.groupby('version', axis=0)
    
    fig = go.Figure()
    for (version, df_) in groups:
      fig.add_trace(go.Scatter(x=df_.index, y=df_['temperature'], name='version={}'.format(str(version))))
    
    fig.update_layout(template='plotly_dark')
    fig.show()