10. Aggregated tables#

10.1. Synopsis#

CREATE AGGREGATED TABLE [ IF NOT EXISTS ] <table_name> ( [
{ <name> <type> }
[, ... ]
] )
[ WATERMARK <duration> ]
[ TTL <duration> ]
{ TUMBLINGWINDOW (<duration>) | HOPPINGWINDOW (<duration>, <duration>)}
AS
SELECT $timestamp, { <aggregation> OVER $window AS <name> }

10.2. Definition#

Aggregated tables, also referred to as continuous queries in other platforms, offer a powerful feature that allows the database to store pre-aggregated results instead of the raw data. With this technique, the system calculates and preserves aggregate values like averages, sums, counts, maximums, or minimums over specific intervals or windows, leading to summarized data.

Key Points:

Aggregated computations involve storing summarized or aggregated results in the database.
Instead of retaining individual data points, the system saves pre-computed aggregate values for efficiency.
Aggregations are performed over specific intervals or windows, providing a structured way to analyze data.

10.3. Use Case: Optimized Aggregation and Reduced Storage Costs#

With QuasarDB’s Aggregated Tables, users can gain significant performance improvements and cost savings for use cases where raw data retention is less critical, and the primary focus is on aggregated results.

Key Benefits:

Optimized Data Storage: Aggregated Tables efficiently manage storage costs by storing precomputed aggregated results, reducing the need for extensive raw data storage.
Predictable Analytics: Aggregated Tables prove highly effective in environments where predefined queries and aggregations play a pivotal role, benefiting industries with steady query patterns. This includes monitoring applications, where consistent queries are well-defined and rarely change.
Efficient Resource Management: Aggregated Tables optimize data processing and storage in resource-constrained scenarios like embedded devices or systems with limited capacity. This is particularly advantageous for use cases where resources are restricted, such as IoT devices or low-memory systems.

By utilizing Aggregated Tables, users can achieve cost-effective storage, experience faster query responses, and benefit from optimal performance for well-defined aggregation requirements.

When Not to Use Aggregated Tables:

Exploratory Data Analytics: Aggregated Tables are less suited for exploratory data analysis tasks that involve ad-hoc queries (queries that are executed on-the-fly) and dynamic aggregations. In scenarios where the analysis requirements are continuously evolving and demand flexibility, traditional storage strategies might be more appropriate.
Unpredictable Query Patterns: If the queries or aggregations frequently change or exhibit highly dynamic patterns, Aggregated Tables may not be the best fit. Traditional storage approaches can better accommodate such situations, allowing for more agile and adaptable data processing.

10.4. WATERMARK#

The WATERMARK is a crucial feature in Aggregated Tables that allows users to handle out-of-order data arrival. In the first version of aggregated tables, it was assumed that all data arrived in chronological order, meaning that all new data being written has a $timestamp more recent than any previously written data. However, real-world scenarios often involve data that arrives out of sequence, which can pose challenges for data processing and analytics.

Purpose

The WATERMARK enables users to specify a time duration during which they want to allow out-of-order data to arrive. During this specified duration, the system will ignore any older data, ensuring that only data with timestamps within the defined WATERMARK window is considered for aggregation and storage. This feature is particularly useful for handling late-arriving data or data with unexpected delays.

10.5. Row-count based windows#

In the initial version of Aggregated Tables, the system only supported time-based windows, which meant aggregating data at regular time intervals (e.g., every 5 minutes). However, in response to real-world use cases, the latest version introduces support for row-count-based windows, a feature designed to meet specific needs and challenges.

Customized Aggregation Intervals:

For instance, in a continuous stream of data, you can aggregate every 1000 data points, ensuring alignment with specific analysis requirements.
Granularity for High-Frequency Data Streams:

For instance, data streams that arrive irregularly and rapidly. This granularity ensures data accuracy and facilitates insights even during frequent data updates.

10.6. Parameters#

table_name: The name of the aggregated table that you want to create. This parameter is mandatory.
name: The name of the column in the table. Each column must have a name associated with it, and this name must be unique within the table.
type: The data type of the column. It specifies the kind of data that can be stored in the column, such as INTEGER, FLOAT, STRING, etc.
duration: It represents a time interval or duration that is used in different contexts within the CREATE AGGREGATED TABLE statement. The syntax is described in section Durations. It is a placeholder for an actual time value that you need to specify based on your requirements. Please see the details in Note.
aggregation: The aggregation parameter in the CREATE AGGREGATED TABLE statement represents the specific aggregation function, such as AVG, MAX, MIN, SUM, or COUNT, that is applied to the data within the defined aggregation window specified by TUMBLINGWINDOW or HOPPINGWINDOW, and the result is aliased with the given name.

Note

TUMBLINGWINDOW(<duration>): In the context of TUMBLINGWINDOW, <duration> specifies the size of the tumbling window. It defines the time interval for which data will be grouped and aggregated together before being saved to the table.

Example: If you use TUMBLINGWINDOW(5min), it means that the tumbling window has a duration of 5 minutes. This means that data will be aggregated and saved in 5-minute intervals.

HOPPINGWINDOW(<duration>, <duration>): In the context of HOPPINGWINDOW, the first <duration> parameter specifies the size of the hopping window, and the second <duration> parameter specifies the hopping step or advance. The hopping window defines overlapping time intervals for aggregating data. It moves forward in fixed steps.

Example: If you use HOPPINGWINDOW(5min, 1min), it means that the hopping window has a size of 5 minutes, and it advances by 1 minute at a time. So, the windows will overlap by 4 minutes (5 minutes - 1 minute).

[ TTL <duration> ]: In the context of TTL, the <duration> parameter specifies the Time-To-Live for the data in the aggregated table. It represents the duration after which the data will be automatically removed or discarded from the table.

Example: If you use [ TTL 1d ], it means that the data will expire and be removed from the table after 1 day (24 hours).

[ WATERMARK <duration> ]: In the context of WATERMARK, the <duration> parameter specifies the allowable time delay for incoming data points. This parameter defines a temporal boundary within which data is considered valid for aggregation and storage. Data points with timestamps older than the specified <duration> will be disregarded, ensuring that only relatively recent data is processed and included in the aggregated results.

Example: If you set a WATERMARK(10min), it means that data points arriving with timestamps up to 10 minutes behind the latest timestamp in the table will be considered for aggregation. Any data points with timestamps exceeding this duration will be excluded.

Please note that in each case, you need to replace <duration> and <num> with actual time values and integers that suit your specific use case and requirements. The actual values you use for <duration> will depend on the time scale and granularity you want for your data aggregation.

10.7. Using “OVER” in Aggregated Tables#

In Aggregated Tables, “OVER” is used to specify how data is aggregated and stored within fixed time intervals, referred to as windows.

For a detailed explanation and comparison of “OVER” in Aggregated Tables and window functions, please see the Differentiating “OVER” for Window Functions and Aggregated Tables.

10.8. Examples#

To store the average by 5 minutes in an aggregated table named ‘t’, with a 1-day time-to-live (TTL), and calculated over TUMBLINGWINDOW, including the timestamp and computed average.

CREATE AGGREGATED TABLE t(col DOUBLE) TTL 1d TUMBLINGWINDOW(5min) AS
SELECT $timestamp, avg(col) OVER $window AS avg

To store data in an aggregated table named ‘t’, it uses a HOPPINGWINDOW with a size of 5 minutes and advances by 1 minute intervals. It calculates the average using the ‘d’ column and includes the $timestamp to perform the aggregation.

CREATE AGGREGATED TABLE t (col DOUBLE) HOPPINGWINDOW(5min, 1min) AS
SELECT $timestamp, avg(col) OVER $window AS avg

10.8.1. Tumbling Window with Sum#

Create a simple aggregated table with a tumbling window of 1 hour and calculate the sum of values.

CREATE AGGREGATED TABLE simple_sum_table (col DOUBLE) TUMBLINGWINDOW(1 hour)
AS SELECT $timestamp, sum(col) OVER $window AS sum

Insert some sample data into the aggregated table.

INSERT INTO simple_sum_table ($timestamp, col) VALUES
(2023-01-01T00:00:00, 10),
(2023-01-01T00:30:00, 20),
(2023-01-01T01:00:00, 15)

Query data from the aggregated table.

SELECT $timestamp, sum FROM simple_sum_table

$timestamp                                  sum
------------------------------------------------
2023-01-01T00:00:00.000000000Z               30
2023-01-01T01:00:00.000000000Z               15

Returned 2 rows in 2,026 us
Scanned 2 points in 2,026 us (986 rows/sec)

10.8.2. Hopping Window with Multiple Aggregations#

Create an aggregated table with a hopping window of 15 minutes and calculate multiple aggregates.

CREATE AGGREGATED TABLE complex_aggregation_table (col DOUBLE) HOPPINGWINDOW(15min, 5min)
AS SELECT $timestamp,
      count(col) OVER $window AS count_values,
      sum(col) OVER $window AS total_sum,
      avg(col) OVER $window AS average_value,
      max(col) OVER $window AS max_value,
      min(col) OVER $window AS min_value

Insert some sample data into the aggregated table.

INSERT INTO complex_aggregation_table ($timestamp, col) VALUES
   (2023-01-01T00:00:00, 10),
   (2023-01-01T00:05:00, 15),
   (2023-01-01T00:10:00, 5),
   (2023-01-01T00:15:00, 25)

Query data from the aggregated table.

SELECT $timestamp, count_values, total_sum, average_value, max_value, min_value FROM complex_aggregation_table

$timestamp                       count_values        total_sum    average_value        max_value        min_value
------------------------------------------------------------------------------------------------------------------
2022-12-31T23:50:00.000000000Z              1               10               10               10               10
2022-12-31T23:55:00.000000000Z              2               25             12.5               15               10
2023-01-01T00:00:00.000000000Z              3               30               10               15                5
2023-01-01T00:05:00.000000000Z              3               45               15               25                5
2023-01-01T00:10:00.000000000Z              2               30               15               25                5
2023-01-01T00:15:00.000000000Z              1               25               25               25               25

Returned 6 rows in 2,192 us
Scanned 30 points in 2,192 us (13,683 rows/sec)

10.8.3. Watermark#

Create an aggregated table with a watermark of 10 minutes and a tumbling window of 1 hour to perform calculations on the incoming data, such as counting and summing values over the specified time intervals, ensuring data accuracy and timeliness in the results.

CREATE AGGREGATED TABLE table_example_watermark (col DOUBLE) WATERMARK 10min TUMBLINGWINDOW(1hour)
AS SELECT $timestamp,
      count(col) OVER $window AS count,
      sum(col) OVER $window AS sum

Insert data points into the aggregated table.

INSERT INTO table_example_watermark ($timestamp, col) VALUES (2023-01-01T00:12:00, 2);
INSERT INTO table_example_watermark ($timestamp, col) VALUES
   (2023-01-01T00:01:00, 1),
   (2023-01-01T00:04:00, 4);

Query data from the aggregated table.

SELECT $timestamp, count, sum FROM table_example_watermark

$timestamp                       count              sum
--------------------------------------------------------
2023-01-01T00:00:00.000000000Z       2                6

Returned 1 row in 14,691 us
Scanned 2 points in 14,691 us (136 rows/sec)

10.8.4. Row-based window#

Create an aggregated table with a HOPPINGWINDOW of size 3 and an advance of 2

CREATE AGGREGATED TABLE table_row_based_example (col DOUBLE) HOPPINGWINDOW(3, 2)
AS SELECT $timestamp,
      count(col) OVER $window AS count,
      sum(col) OVER $window AS sum,
      min(col) OVER $window AS min,
      max(col) OVER $window AS max,
      avg(col) OVER $window AS avg

Explanation:

Window Size (3 rows): This means that each aggregation window, in this case, will include a total of 3 rows of data. These rows will be grouped together for aggregation calculations. So, when the aggregation process begins, it takes the first 3 rows it encounters and performs calculations on them.

Advancing (2 rows at a time): After processing the first window of 3 rows, the window advances by 2 rows. In other words, it moves forward in the data stream by skipping the next 2 rows and then creates a new window starting from the 4th row. This process continues as data arrives, creating overlapping windows that help in aggregating data at regular intervals.

Insert data points into the table_row_based_example table

INSERT INTO  table_row_based_example ($timestamp, col) VALUES
   (2023-01-01T00:01:00, 1),
   (2023-01-01T00:12:00, 2),
   (2023-01-01T00:40:00, 3),
   (2023-01-01T01:02:00, 4),
   (2023-01-01T01:10:00, 5)

Query the table_row_based_example table to retrieve aggregated results

SELECT $timestamp, count, sum, min, max, avg FROM  table_row_based_example;

$timestamp                       count              sum              min              max              avg
-----------------------------------------------------------------------------------------------------------
2023-01-01T00:01:00.000000000Z       3                6                1                3                2
2023-01-01T00:40:00.000000000Z       3               12                3                5                4
2023-01-01T01:10:00.000000000Z       1                5                5                5                5

Returned 3 rows in 2,408 us
Scanned 15 points in 2,408 us (6,226 rows/sec)

These results display a comprehensive snapshot of your data within specific time intervals. This provides you with a detailed overview of your data’s behavior over time, facilitating informed decision-making based on these summarized insights.