5.5. Flexible sensor data modeling

5.5. Flexible sensor data modeling

In data science, pivotting a table is an operation which one can use to “rotate” the representation of data. In this howto, we will walk you through a common use case for this operation, and how Quasar supports this.

More precisely, in this guide, we will:

  • Model your data as discrete timeseries instead of tables;

  • The advantages and disadvantages of modelling your data this way;

  • Query this data using the PIVOT query, to change the data representation.

For the example code, we will be using Python and Pandas. You can achieve similar results with any programming language or tool that can query Quasar.

5.5.1. Preparation

5.5.1.1. Install and launch QuasarDB

Please install & launch QuasarDB for your environment; the free community edition is sufficient.

For more information and instructions on how to install QuasarDB, please refer to our installation guide.

5.5.1.2. Prepare your Jupyter notebook

Please follow the steps below to create and set up your Jupyter notebook:

# Install requirements
$ python3 -m pip install jupyter \
                         pandas \
                         numpy \
                         quasardb

# Launch local notebook
$ jupyter notebook ./qdb-howto.ipynb

[I 17:16:02.496 NotebookApp] Serving notebooks from local directory: /home/user/qdb-howto/
[I 17:16:02.496 NotebookApp] Jupyter Notebook 6.4.6 is running at:
[I 17:16:02.496 NotebookApp] http://localhost:8888/?token=...
[I 17:16:02.496 NotebookApp]  or http://127.0.0.1:8888/?token=...
[I 17:16:02.496 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).

A new jupyter notebook should automatically open, otherwise please manually navigate and you can navigate your browser to http://localhost:8888/ with the environment prepared.

You are now completed with the preparations.

5.5.2. Pivot tables

From Wikipedia:

A pivot table is a table of grouped values that aggregates the individual items of a more extensive table (such as from a database, spreadsheet, or business intelligence program) within one or more discrete categories.

As a concrete example of how Quasar supports pivot tables, consider this dataset:

$timestamp

dimension

value

2022-01-01

VibrationX

1.234

2022-01-01

VibrationY

2.345

2022-01-01

VibrationZ

3.456

2022-01-01

Temperature

87.1

2022-01-02

VibrationX

1.237

2022-01-02

VibrationY

2.348

2022-01-02

VibrationZ

3.459

2022-01-02

Temperature

87.3

In this dataset, have modeled all data as a key/value pair with an associated timestamp index.

When pivotting this dataset, we take the key (sensor_id) and project the associated values as discrete column values:

$timestamp

VibrationX

VibrationY

VibrationZ

Temperature

2022-01-01

1.234

2.345

3.456

87.1

2022-01-02

1.237

2.348

3.459

87.3

5.5.3. Why key/value timeseries?

An astute reader may observe that you can just model the table using those columns directly, which removes the need for pivotting in the first place. So why would you want to use this?

To answer this question, we need to look at the advantages of storing the data as timeseries of key/values.

5.5.3.1. Advantages

Consider that you’re tasked with building an ingestion pipeline which captures data for a number of IoT devices.

The pipeline has (at least) the following characteristics:

  • Data for different dimensions arrives in different messages / streams;

  • Data may arrive out-of-order;

  • Additional dimensions may be added in the future.

In this case, using dedicated columns for each of the dimensions is suboptimal. To illustrate, let’s assume we receive two data payloads, one for the vibration X/Y/Z and another one for the temperature.

Creating the table and ingesting the first payload would look as follows:

CREATE TABLE measurements(VibrationX DOUBLE, VibrationY DOUBLE, VibrationZ DOUBLE, Temperature DOUBLE)

INSERT INTO measurements($timestamp, VibrationX, VibrationY, VibrationZ) VALUES (2022-01-01, 1.234, 2.345, 3.456)

SELECT * FROM measurements

Will yield:

$timestamp

VibrationX

VibrationY

VibrationZ

Temperature

2022-01-01

1.234

2.345

3.456

(void)

Afterwards, we ingest the second payload with the temperature:

INSERT INTO measurements($timestamp, Temperature) VALUES (2022-01-01, 87.1)

SELECT * FROM measurements

Will yield:

$timestamp

VibrationX

VibrationY

VibrationZ

Temperature

2022-01-01

1.234

2.345

3.456

(void)

2022-01-01

(void)

(void)

(void)

87.1

We now have two separate rows for the same timestamp, with a lot of voids. What happened?

While Quasar indexes based on timestamps, these timestamps may not necessarily be unique. As such, it treats every new INSERT as a new row.

There are a couple of ways we could work around this:

  • Instead of using two INSERT statements, use UPDATE for the second payload.

    This creates significant complexity in our ingestion pipeline, because now need to know for each payload whether it is the first row or a later row.

  • Buffer data before insertion. This allows us to multiplex the different columns into a single row before insertion, avoiding the issue altogether.

    This may work for certain use cases, but has several disadvantages:

  • Increased latency before the data is available within the database;

  • The buffering window must be large enough to capture all out-of-order data, which may be larger than the maximum acceptable latency;

  • Maintaining consistency between the input data streams and what’s stored inside the database complicates: if the ingestion process crashes, it’ll lose the active buffer, and would need to re-synchronize based on the state in the database.

The better alternative to this, however, is to model your data as timeseries of key/values, which addresses issues like these.

5.5.3.2. Disadvantages

Of course, there are downsides to this type of data model:

  • It works best if you have the same column type for all dimensions: combining ints / doubles / strings can be tricky, and would likely require different columns for each of them;

  • Queries become more complex: while fast, a PIVOT isn’t free, and it causes additional complexity when querying;

  • The compression will not be as efficient: Quasar uses delta compression for numerical compression, which works best if subsequent values in a column are in close range to each other. This is usually the case when you only consider the data for a single sensor / device, but when combining multiple of these in a single column, it’ll defeat that.

As such, we recommend to only use this type of data model if your use case needs it, but to avoid it if not.

5.5.4. Example

Let’s consider a full example of using this data model. We’ll start by generating some data as it could arrive to your ingestion pipeline:

import numpy as np

payloads = [{'VibrationX': 1.234,
             'VibrationY': 2.345,
             'VibrationY': 3.456,
             'Timestamp': np.datetime64('2022-01-01')},
            {'VibrationX': 1.237,
             'VibrationY': 2.348,
             'VibrationY': 3.459,
             'Timestamp': np.datetime64('2022-01-02')},
            {'Temperature': 87.1,
             'Timestamp': np.datetime64('2022-01-01')},
            {'Temperature': 87.3,
             'Timestamp': np.datetime64('2022-01-02')}]

That gives us 4 different payloads we want to insert.

Let’s continue with creating the table:

import quasardb

conn = quasardb.Cluster('qdb://127.0.0.1:2836')
table = conn.table('measurements')

try:
   table.remove()
except:
   # Table did not yet exist
   pass

cols = [quasardb.ColumnInfo(quasardb.ColumnType.String, "dimension"),
        quasardb.ColumnInfo(quasardb.ColumnType.Double, "value")]

table.create(cols)

Let’s insert the data. For demonstration purposes, we’re going to ingest data row-by-row, to mirror a real-world streaming use case where you cannot buffer data.

import pandas as pd
import quasardb.pandas as qdbpd

for payload in payloads:
   xs = []
   for (k,v) in payload.items():
     if k != 'Timestamp':
       xs.append({'dimension': k, 'value': v})

   idx = np.full(len(xs), payload['Timestamp'])
   df = pd.DataFrame(xs, index=idx)

   qdbpd.write_dataframe(df, conn, table)
coercing deduplicate option from args =  {'deduplicate': False, 'deduplication_mode': 'drop'}
deduplication mode = {} drop
coercing deduplicate option from args =  {'deduplicate': False, 'deduplication_mode': 'drop'}
deduplication mode = {} drop
coercing deduplicate option from args =  {'deduplicate': False, 'deduplication_mode': 'drop'}
deduplication mode = {} drop
coercing deduplicate option from args =  {'deduplicate': False, 'deduplication_mode': 'drop'}
deduplication mode = {} drop

Great, we now have inserted all payloads in our table. Let’s retrieve all data from the table to verify it’s all there:

qdbpd.read_dataframe(table)
dimension value
2022-01-01 VibrationX 1.234
2022-01-01 VibrationY 3.456
2022-01-01 Temperature 87.100
2022-01-02 VibrationX 1.237
2022-01-02 VibrationY 3.459
2022-01-02 Temperature 87.300

Now that we verified the data is all there, we can put everything together and use the PIVOT clause to put the dimensions into discrete columns:

q = """
SELECT
   $timestamp,
   dimension,
   value

FROM
   measurements

LEFT ASOF JOIN measurements

PIVOT ON dimension
"""
qdbpd.query(conn, q, index='$timestamp')
Temperature.value VibrationX.value VibrationY.value
2022-01-01 87.1 1.234 3.456
2022-01-02 87.3 1.237 3.459

And that’s all there’s to it.

5.5.5. Conclusion

In this howto you learned how to model your data as timeseries of key/values, and when it is appropriate to use:

  • When data that should be on the same row, arrives out-of-order;

  • When you need to flexibility to add columns / dimensions to your dataset without changing the table structure.

5.5.5.1. Further readings

  • ../api/tutorial on how to get started with Quasar;

  • PIVOT query documentation.