Columnar DB performance in action. DuckDB

This article is intended to represent only one simple thing is a fundamental difference between columnar family and standard row-based databases by an example. Roughly speaking, you may need a column-based database if working with a complex OLAP-oriented workload, reading vast data, grouping it, and computing metrics. And doing it you will definitely notice big time consumption. So columnar DB drastically reduces this time.

As an example has to be simple and easily reproducible we compare columnar DuckDB performance with SQLite as a row-based counterpart.

You may read the article as a simplified proof of concept.

Steps to do

  • Fill the tables of both databases with random data.
  • Emulate an OLAP-workload SQL query.
  • Measure how much time is consumed by the query in both cases.
  • Visualize performance.

Prerequisites

Simply install following python dependencies pip install duckdb plotly typer and SQLite CLI sudo apt install sqlite3. Along with the python duckdb bindings we will use DuckDB CLI, download it from v0.2.7-duckdb_cli-linux-amd64.zip

Fill the databases

In order to notice a difference in performance, the amount of stored data should be significantly big therefore we generate random data and write it into the databases.

Tables' columns are a group of an id field and 3 values' fields. All values are randomly generated integers.

Table schema:

random_data(
    grp UINTEGER,
    value1 UINTEGER,  
    value2 UINTEGER,  
    value3 UINTEGER); 

The scenario database.py provides the auxiliary commands to prepare an experimental stand. For now, we create the base table in duckdb and run data generation.

$ python database.py fill-duckdb --sample-size 1000 --samples-count 10000
...
Database random.duckdb is filled up

After a while, as the filling 10M records has been completed, we dump all data to a CSV file and load it into a new SQLite db. You can take the archived 165Mb file random_data.csv.tar.gz to pass data generating.

We get to SQLite db. Dump data from random.duckdb as a CSV file and import it into random.sqlite table with the same schema.

$ python database.py dump-csv --path dump
$ python database.py fill-sqlite dump/0_random_data.csv
Database random.sqlite is filled up

Base tables are filled and we are able to run the special SQL query to figure out if its performance differs, but having one result only, it doesn't reveal to us how performance is changing according to data amount.

For such a purpose to see execution time growth we can split the base tables into a series of incrementally enlarging tables by reading data and adding it into a new table.

Now we create the tables for both databases.

$ python database.py split-duckdb --tables-count 10
...
Table created:  random_data10
...
Table filling:   random_data10

$ python database.py split-sqlite --tables-count 10
...
Table created:  random_data10
...
Table filling:   random_data10

At the moment, everything is ready to run the query.

Run the query

As we know, OLAP implies querying a database and computing metrics which can lead to getting insights about a business field. Therefore, we emulate it and compute min(), max(), avg(), sum() values on each column. SQL code we run query.sql .

By using the shell scripting, we organize running the query measuring time in a for loop replacing the base table name with the split table name.

$ for i in `seq 1 10`; do \
sed 's/random_data/random_data'"$i"'/g' query.sql > edited-query.sql; \
time sqlite3 random.sqlite < edited-query.sql >/dev/null; \
done

sqlite3 random.sqlite < edited-query.sql > /dev/null  1,42s user 0,04s system 99% cpu 1,464 total
sqlite3 random.sqlite < edited-query.sql > /dev/null  2,84s user 0,12s system 99% cpu 2,966 total
sqlite3 random.sqlite < edited-query.sql > /dev/null  4,21s user 0,18s system 99% cpu 4,412 total
sqlite3 random.sqlite < edited-query.sql > /dev/null  5,75s user 0,22s system 99% cpu 5,971 total
sqlite3 random.sqlite < edited-query.sql > /dev/null  7,19s user 0,22s system 99% cpu 7,406 total
sqlite3 random.sqlite < edited-query.sql > /dev/null  8,66s user 0,26s system 99% cpu 8,951 total
sqlite3 random.sqlite < edited-query.sql > /dev/null  10,07s user 0,37s system 99% cpu 10,459 total
sqlite3 random.sqlite < edited-query.sql > /dev/null  11,59s user 0,42s system 99% cpu 12,015 total
sqlite3 random.sqlite < edited-query.sql > /dev/null  13,23s user 0,49s system 99% cpu 13,747 total
sqlite3 random.sqlite < edited-query.sql > /dev/null  14,52s user 0,58s system 99% cpu 15,126 total

The longest query time is 14.52 seconds which is sensitive for a human perception. What if you run hundreds of such queries?

Ok, we are at the key point, DuckDB comes into play.

$ for i in `seq 1 10`; do \
sed 's/random_data/random_data'"$i"'/g' query.sql > edited-query.sql; \
time duckdb random.duckdb < edited-query.sql >/dev/null; \
done

duckdb random.duckdb < edited-query.sql > /dev/null  0,12s user 0,07s system 99% cpu 0,186 total
duckdb random.duckdb < edited-query.sql > /dev/null  0,21s user 0,02s system 99% cpu 0,226 total
duckdb random.duckdb < edited-query.sql > /dev/null  0,23s user 0,07s system 99% cpu 0,296 total
duckdb random.duckdb < edited-query.sql > /dev/null  0,31s user 0,07s system 99% cpu 0,382 total
duckdb random.duckdb < edited-query.sql > /dev/null  0,36s user 0,07s system 99% cpu 0,431 total
duckdb random.duckdb < edited-query.sql > /dev/null  0,38s user 0,11s system 99% cpu 0,492 total
duckdb random.duckdb < edited-query.sql > /dev/null  0,49s user 0,09s system 99% cpu 0,586 total
duckdb random.duckdb < edited-query.sql > /dev/null  0,53s user 0,10s system 99% cpu 0,630 total
duckdb random.duckdb < edited-query.sql > /dev/null  0,56s user 0,17s system 99% cpu 0,728 total
duckdb random.duckdb < edited-query.sql > /dev/null  0,62s user 0,16s system 99% cpu 0,781 total

What would you say? Database fundamental difference is clearly evident. After all executions, DuckDB didn't even exceed 1 second. Meanwhile, SQLite exceeded 14 seconds.

Job is done and it's time to prettify output making each execution time visual.

Visualize the results

Here we take the result output, convert it to the python lists and build the performance chart upon it. The scripts convertation.py and chart.py do this job.

$ python chart.py

Look at the picture and try to extrapolate the measurements on 100M records and over...

Emphasize the concept

The experiment represents the main idea that getting an extreme performance boost can be reached by storing data on a disk as column values one by one in a case of complex computational queries(OLAP). Such a boost we see on the chart.

Columnize it all.