# 5.6. How to handle multiple versions of the same data¶

Let’s say you’re working on forecast data and that this forecast data is updated multiple times a day.

While it could be tempting to “update” the value in your database to the latest forecast (and keep only that), you may be interested in maintaining all versions.

For example, you may want to evaluate the accuracy of the different forecasts. In trading, another reason could be to do backtesting with the forecast value “as it was”.

How to maintain multiple versions of a value?

In this guide, we will:

• See how you can model data to support multiple versions of a value for a given timestamp

• How to query a specific version of the data present in the database

• How to render a specific version of the data in a curve

We will be using Python, Pandas, and plotly. You can achieve similar results with any programming language or tool that can query Quasar.

## 5.6.1. Preparation¶

### 5.6.1.1. Install and launch QuasarDB¶

Please install & launch QuasarDB for your environment; the free community edition is sufficient.

### 5.6.1.2. Prepare your Jupyter notebook¶

# Install requirements
$python3 -m pip install jupyter \ pandas \ numpy \ quasardb # Launch local notebook$ jupyter notebook ./qdb-howto.ipynb

[I 17:16:02.496 NotebookApp] Serving notebooks from local directory: /home/user/qdb-howto/
[I 17:16:02.496 NotebookApp] Jupyter Notebook 6.4.6 is running at:
[I 17:16:02.496 NotebookApp] http://localhost:8888/?token=...
[I 17:16:02.496 NotebookApp]  or http://127.0.0.1:8888/?token=...
[I 17:16:02.496 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).


A new jupyter notebook should automatically open, otherwise please manually navigate and you can navigate your browser to http://localhost:8888/ with the environment prepared.

You are now completed with the preparations.

## 5.6.2. Modeling multiple versions¶

Let’s say we’re tasked with storing temperature forecasts with the requirements:

• Forecasts are made on a daily basis;

• Each forecast describes the day ahead and the 7 days following;

• Each forecast has data points in intervals of 1 hour;

• We want to keep track of the entire history of all forecasts, i.e. old forecasts will never be deleted;

• When querying data, we typically want to use only the most recent forecast.

We can generate data for a single forecast as follows:

import math
import numpy as np
import pandas as pd

basetime = np.datetime64('2021-09-01')
basetemp = 18
d = 0.05
def temp_for(i):
# semi-natural temperature forecast
t = basetemp / float(i + 1)
return basetemp * (math.exp(-d/t))

def gen_forecast(start):
offset = (start - basetime) / np.timedelta64(1, 'h')
return [{'When': start + np.timedelta64(i, 'h'),
'Temperature': temp_for(i) + offset} for i in range(0, (24 * 60))]

pd.DataFrame(data=gen_forecast(basetime))

0 2021-09-01 00:00:00 2021-09-01 17.950069
1 2021-09-01 01:00:00 2021-09-01 17.900277
2 2021-09-01 02:00:00 2021-09-01 17.850623
3 2021-09-01 03:00:00 2021-09-01 17.801107
4 2021-09-01 04:00:00 2021-09-01 17.751728
... ... ... ...
1435 2021-10-30 19:00:00 2021-09-01 0.333365
1436 2021-10-30 20:00:00 2021-09-01 0.332440
1437 2021-10-30 21:00:00 2021-09-01 0.331518
1438 2021-10-30 22:00:00 2021-09-01 0.330599
1439 2021-10-30 23:00:00 2021-09-01 0.329681

1440 rows × 3 columns

In order to model this data, we can observe the following:

• As the column When describes the time the data point describes, and is most likely to be queried for, we use this as our primary $timestamp index; • We can use Forecast made at to find the most recent forecast for a query: it effectively describes the version of the data. In other words, this data is bitemporal: one timestamp describing the time/temperature being forecasted, and another timestamp of when the forecast itself was made. In Quasar, we can store the data as-is, in a table with three columns: • The primary $timestamp index;

• A version timestamp column;

• A temperature double column.

## 5.6.3. Writing to Quasar¶

We can take this forecast data and write it to Quasar.

Let’s first establish a connection with Quasar and create the table:

import quasardb
import quasardb.pandas as qdbpd

conn = quasardb.Cluster('qdb://127.0.0.1:2836')
table = conn.table('forecasts')

try:
table.remove()
except:
# Table did not yet exist
pass

cols = [quasardb.ColumnInfo(quasardb.ColumnType.Timestamp, "version"),
quasardb.ColumnInfo(quasardb.ColumnType.Double, "temperature")]

table.create(cols)


Let’s generate 3 different forecasts, and insert them into Quasar:

def insert_forecast(forecast):
# Convert to DataFrame
df = pd.DataFrame(data=forecast)
df = df.rename(columns={'When': '$timestamp', 'Forecast made at': 'version', 'Temperature': 'temperature'}) df = df.round({'temperature': 3}) df = df.set_index('$timestamp').reindex()

# Invoke QuasarDB Pandas API to store this forecast
qdbpd.write_dataframe(df, conn, table)

# Generate 10 forecasts, one for each hour starting at 2021-09-01
timestamps = (basetime + np.timedelta64(i, 'h') for i in range(10))
forecasts = [gen_forecast(np.datetime64(timestamp)) for timestamp in timestamps]

# Use first 3 forecasts
for forecast in forecasts[:3]:
insert_forecast(forecast)

coercing deduplicate option from args =  {'deduplicate': False, 'deduplication_mode': 'drop'}
deduplication mode = {} drop
coercing deduplicate option from args =  {'deduplicate': False, 'deduplication_mode': 'drop'}
deduplication mode = {} drop
coercing deduplicate option from args =  {'deduplicate': False, 'deduplication_mode': 'drop'}
deduplication mode = {} drop


## 5.6.4. How to query the data¶

We can retrieve the first 10 rows of the data by using a simple Quasar query for the QuasarDB pandas API:

q = "SELECT $timestamp, version, temperature FROM forecasts ORDER BY$timestamp LIMIT 10"
qdbpd.query(conn, q)

$timestamp version temperature 0 2021-09-01 00:00:00 2021-09-01 00:00:00 17.950 1 2021-09-01 01:00:00 2021-09-01 00:00:00 17.900 2 2021-09-01 01:00:00 2021-09-01 01:00:00 18.950 3 2021-09-01 02:00:00 2021-09-01 00:00:00 17.851 4 2021-09-01 02:00:00 2021-09-01 01:00:00 18.900 5 2021-09-01 02:00:00 2021-09-01 02:00:00 19.950 6 2021-09-01 03:00:00 2021-09-01 00:00:00 17.801 7 2021-09-01 03:00:00 2021-09-01 01:00:00 18.851 8 2021-09-01 03:00:00 2021-09-01 02:00:00 19.900 9 2021-09-01 04:00:00 2021-09-01 00:00:00 17.752 As you can see above, we indeed have all the data in the database, and it should be apparant that we indeed have 3 versions of the data. In order to focus on a single version, your intuition might be to make a query with a WHERE clause specifying a specific version number or a specific forecast time. For example, the following could work: version = "2021-09-01T05:00:00" q = """ SELECT$timestamp,
version,
temperature

FROM forecasts
WHERE version = DATETIME({})
""".format(version)
qdbpd.query(conn, q)

0

This is all well and good, but what if not every row has the same version of the forecast? In the data above we’re missing the first few rows, because of this reason.

Often, what we’re interested in is “the last version of the forecast available, whatever that is”, but we don’t know the exact version timestamp.

Fortunately, Quasar has a built-in query for this with the RESTRICT TO keyword. The RESTRICT TO keyword will restrict the query to a specific “version” based on the value of a given column.

In our example, if we want to have the “last version” of the forecast available, whatever that is, the query becomes:

q = """
SELECT
$timestamp, version, temperature FROM forecasts RESTRICT TO MAX(version) """ qdbpd.query(conn, q) $timestamp version temperature
0 2021-09-01 00:00:00 2021-09-01 00:00:00 17.950
1 2021-09-01 01:00:00 2021-09-01 01:00:00 18.950
2 2021-09-01 02:00:00 2021-09-01 02:00:00 19.950
3 2021-09-01 03:00:00 2021-09-01 02:00:00 19.900
4 2021-09-01 04:00:00 2021-09-01 02:00:00 19.851
... ... ... ...
1437 2021-10-30 21:00:00 2021-09-01 02:00:00 2.333
1438 2021-10-30 22:00:00 2021-09-01 02:00:00 2.332
1439 2021-10-30 23:00:00 2021-09-01 02:00:00 2.332
1440 2021-10-31 00:00:00 2021-09-01 02:00:00 2.331
1441 2021-10-31 01:00:00 2021-09-01 02:00:00 2.330

1442 rows × 3 columns

As you can see, here it automatically does the right thing: for the first few rows, it uses the last available forecast at that time.

## 5.6.5. Point in Time¶

There is an additional advantage to using timestamped versions: it allows us to reconstruct the data as of a certain time. This is very useful functionality to back-test algorithms, resolve quality issues, and/or help with auditing.

To achieve this, we take the following steps:

• Before filtering for the most recent version, apply a WHERE clause that removes any rows with a version after a certain date;

• Use the same RESTRICT TO syntax as described above to keep only the most recent version.

If, for example, we want to figure out what the state of the database is at 2021-09-01T01:15, we can do so as follows:

q = """
SELECT
$timestamp, version, temperature FROM forecasts WHERE version <= DATETIME(2021-09-01T01:15) RESTRICT TO MAX(version) """ qdbpd.query(conn, q) $timestamp version temperature
0 2021-09-01 00:00:00 2021-09-01 00:00:00 17.95
1 2021-09-01 01:00:00 2021-09-01 01:00:00 18.95

As you can see, it now returns the data as it was at that exact point in time.

## 5.6.6. Plotting multiple versions¶

Of course, we can use this data to plot multiple versions in the same chart. This can be very useful to visualize the behavior of a forecasting algorithm over time.

Let’s start out with plotting multiple versions as multiple lines in a chart. We’ll be using plotly for these visualizations.

import plotly.graph_objects as go

q = """
SELECT
$timestamp, version, temperature FROM forecasts IN RANGE (2021-09-03T00:00:00, +1month) """ df = qdbpd.query(conn, q).set_index('$timestamp').reindex()
groups = df.groupby('version', axis=0)

fig = go.Figure()
for (version, df_) in groups: