5.5. Flexible sensor data modeling#
In data science, pivotting a table is an operation which one can use to “rotate” the representation of data. In this howto, we will walk you through a common use case for this operation, and how Quasar supports this.
More precisely, in this guide, we will:
Model your data as discrete timeseries instead of tables;
The advantages and disadvantages of modelling your data this way;
Query this data using the PIVOT query, to change the data representation.
For the example code, we will be using Python and Pandas. You can achieve similar results with any programming language or tool that can query Quasar.
5.5.1. Preparation#
5.5.1.1. Install and launch QuasarDB#
Please install & launch QuasarDB for your environment; the free community edition is sufficient.
For more information and instructions on how to install QuasarDB, please refer to our installation guide.
5.5.1.2. Prepare your Jupyter notebook#
Please follow the steps below to create and set up your Jupyter notebook:
# Install requirements
$ python3 -m pip install jupyter \
pandas \
numpy \
quasardb
# Launch local notebook
$ jupyter notebook ./qdb-howto.ipynb
[I 17:16:02.496 NotebookApp] Serving notebooks from local directory: /home/user/qdb-howto/
[I 17:16:02.496 NotebookApp] Jupyter Notebook 6.4.6 is running at:
[I 17:16:02.496 NotebookApp] http://localhost:8888/?token=...
[I 17:16:02.496 NotebookApp] or http://127.0.0.1:8888/?token=...
[I 17:16:02.496 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
A new jupyter notebook should automatically open, otherwise please manually navigate and you can navigate your browser to http://localhost:8888/ with the environment prepared.
You are now completed with the preparations.
5.5.2. Pivot tables#
From Wikipedia:
A pivot table is a table of grouped values that aggregates the individual items of a more extensive table (such as from a database, spreadsheet, or business intelligence program) within one or more discrete categories.
As a concrete example of how Quasar supports pivot tables, consider this dataset:
$timestamp |
dimension |
value |
---|---|---|
2022-01-01 |
VibrationX |
1.234 |
2022-01-01 |
VibrationY |
2.345 |
2022-01-01 |
VibrationZ |
3.456 |
2022-01-01 |
Temperature |
87.1 |
2022-01-02 |
VibrationX |
1.237 |
2022-01-02 |
VibrationY |
2.348 |
2022-01-02 |
VibrationZ |
3.459 |
2022-01-02 |
Temperature |
87.3 |
In this dataset, have modeled all data as a key/value pair with an associated timestamp index.
When pivotting this dataset, we take the key (sensor_id) and project the associated values as discrete column values:
$timestamp |
VibrationX |
VibrationY |
VibrationZ |
Temperature |
---|---|---|---|---|
2022-01-01 |
1.234 |
2.345 |
3.456 |
87.1 |
2022-01-02 |
1.237 |
2.348 |
3.459 |
87.3 |
5.5.3. Why key/value timeseries?#
An astute reader may observe that you can just model the table using those columns directly, which removes the need for pivotting in the first place. So why would you want to use this?
To answer this question, we need to look at the advantages of storing the data as timeseries of key/values.
5.5.3.1. Advantages#
Consider that you’re tasked with building an ingestion pipeline which captures data for a number of IoT devices.
The pipeline has (at least) the following characteristics:
Data for different dimensions arrives in different messages / streams;
Data may arrive out-of-order;
Additional dimensions may be added in the future.
In this case, using dedicated columns for each of the dimensions is suboptimal. To illustrate, let’s assume we receive two data payloads, one for the vibration X/Y/Z and another one for the temperature.
Creating the table and ingesting the first payload would look as follows:
CREATE TABLE measurements(VibrationX DOUBLE, VibrationY DOUBLE, VibrationZ DOUBLE, Temperature DOUBLE)
INSERT INTO measurements($timestamp, VibrationX, VibrationY, VibrationZ) VALUES (2022-01-01, 1.234, 2.345, 3.456)
SELECT * FROM measurements
Will yield:
$timestamp |
VibrationX |
VibrationY |
VibrationZ |
Temperature |
---|---|---|---|---|
2022-01-01 |
1.234 |
2.345 |
3.456 |
(void) |
Afterwards, we ingest the second payload with the temperature:
INSERT INTO measurements($timestamp, Temperature) VALUES (2022-01-01, 87.1)
SELECT * FROM measurements
Will yield:
$timestamp |
VibrationX |
VibrationY |
VibrationZ |
Temperature |
---|---|---|---|---|
2022-01-01 |
1.234 |
2.345 |
3.456 |
(void) |
2022-01-01 |
(void) |
(void) |
(void) |
87.1 |
We now have two separate rows for the same timestamp, with a lot of voids. What happened?
While Quasar indexes based on timestamps, these timestamps may not necessarily be unique. As such, it treats every new INSERT
as a new row.
There are a couple of ways we could work around this:
Instead of using two
INSERT
statements, useUPDATE
for the second payload.This creates significant complexity in our ingestion pipeline, because now need to know for each payload whether it is the first row or a later row.
Buffer data before insertion. This allows us to multiplex the different columns into a single row before insertion, avoiding the issue altogether.
This may work for certain use cases, but has several disadvantages:
Increased latency before the data is available within the database;
The buffering window must be large enough to capture all out-of-order data, which may be larger than the maximum acceptable latency;
Maintaining consistency between the input data streams and what’s stored inside the database complicates: if the ingestion process crashes, it’ll lose the active buffer, and would need to re-synchronize based on the state in the database.
The better alternative to this, however, is to model your data as timeseries of key/values, which addresses issues like these.
5.5.3.2. Disadvantages#
Of course, there are downsides to this type of data model:
It works best if you have the same column type for all dimensions: combining ints / doubles / strings can be tricky, and would likely require different columns for each of them;
Queries become more complex: while fast, a
PIVOT
isn’t free, and it causes additional complexity when querying;The compression will not be as efficient: Quasar uses delta compression for numerical compression, which works best if subsequent values in a column are in close range to each other. This is usually the case when you only consider the data for a single sensor / device, but when combining multiple of these in a single column, it’ll defeat that.
As such, we recommend to only use this type of data model if your use case needs it, but to avoid it if not.
5.5.4. Example#
Let’s consider a full example of using this data model. We’ll start by generating some data as it could arrive to your ingestion pipeline:
import numpy as np
payloads = [{'VibrationX': 1.234,
'VibrationY': 2.345,
'VibrationY': 3.456,
'Timestamp': np.datetime64('2022-01-01')},
{'VibrationX': 1.237,
'VibrationY': 2.348,
'VibrationY': 3.459,
'Timestamp': np.datetime64('2022-01-02')},
{'Temperature': 87.1,
'Timestamp': np.datetime64('2022-01-01')},
{'Temperature': 87.3,
'Timestamp': np.datetime64('2022-01-02')}]
That gives us 4 different payloads we want to insert.
Let’s continue with creating the table:
import quasardb
conn = quasardb.Cluster('qdb://127.0.0.1:2836')
table = conn.table('measurements')
try:
table.remove()
except:
# Table did not yet exist
pass
cols = [quasardb.ColumnInfo(quasardb.ColumnType.String, "dimension"),
quasardb.ColumnInfo(quasardb.ColumnType.Double, "value")]
table.create(cols)
Let’s insert the data. For demonstration purposes, we’re going to ingest data row-by-row, to mirror a real-world streaming use case where you cannot buffer data.
import pandas as pd
import quasardb.pandas as qdbpd
for payload in payloads:
xs = []
for (k,v) in payload.items():
if k != 'Timestamp':
xs.append({'dimension': k, 'value': v})
idx = np.full(len(xs), payload['Timestamp'])
df = pd.DataFrame(xs, index=idx)
qdbpd.write_dataframe(df, conn, table)
Great, we now have inserted all payloads in our table. Let’s retrieve all data from the table to verify it’s all there:
df = qdbpd.read_dataframe(table)
df
dimension |
value |
|
---|---|---|
2022-01-01 |
VibrationX |
1.234 |
2022-01-01 |
VibrationY |
3.456 |
2022-01-01 |
Temperature |
87.100 |
2022-01-02 |
VibrationX |
1.237 |
2022-01-02 |
VibrationY |
3.459 |
2022-01-02 |
Temperature |
87.300 |
Now that we verified the data is all there, we can put everything together and use the PIVOT clause to put the dimensions into discrete columns:
q = """
SELECT
$timestamp,
dimension,
value
FROM
measurements
LEFT ASOF JOIN measurements
PIVOT ON dimension
"""
df = qdbpd.query(conn, q, index='$timestamp')
df
Temperature.value |
VibrationX.value |
VibrationY.value |
|
---|---|---|---|
2022-01-01 |
87.1 |
1.234 |
3.456 |
2022-01-02 |
87.3 |
1.237 |
3.459 |
And that’s all there’s to it.
5.5.5. Conclusion#
In this howto you learned how to model your data as timeseries of key/values, and when it is appropriate to use:
When data that should be on the same row, arrives out-of-order;
When you need to flexibility to add columns / dimensions to your dataset without changing the table structure.
5.5.5.1. Further readings#
../api/tutorial on how to get started with Quasar;
PIVOT query documentation.