Schemaless

Schemaless Data Ingestion and Querying

ClapDB enterprise edition supports schemaless data ingestion and querying. This feature is designed to handle both flexible and performance.

What is Schemaless

The Role of a Schema

Perform schema validation during data ingestion.
Compress data appropriately based on type during storage.
Index data appropriately.
Use type-appropriate utility functions during queries.
Different data types have different comparison methods, affecting how data is ordered.

From the above points, the main functions of a schema are evident during data writing and querying phases.

Why Schemaless is Needed

Data in OLAP databases is read-only and historical data does not change. Therefore, when an alter table operation occurs, old data should retain its previous type and still be accessible. Users may frequently perform DDL (Data Definition Language) operations such as: Adding a new column at any time. Changing a column’s type at any time. Creating a new value for an enum type. For various reasons, data ingested may involve a mixture of different data types.

Schemaless in ClapDB

Constraints

Under normal working conditions, columns cannot be deleted.
Data can only relinquish its past schema definition after being archived.
At the same point in time, within the same database, tables cannot have duplicate names, and within the same table, columns cannot have duplicate names.

Definition

A database consists of several tables, which in turn are made up of several columns.
Columns are defined by type, but these are not unique definitions. Instead, they are dynamic types composed of multiple types. Each row of data should belong to one of these types. At any given moment, a column possesses a list of types. If the ingested data does not specify a type, the system will attempt to convert it using these types in order until successful.
When querying, you can specify which type of data in a column to query.
Modifications to the schema do not affect old data; they only constrain the validation and casting of new data during ingestion.

Concepts

Immutable

ClapDB is a data platform designed for store & query, and the data is immutable. Once data is written, it cannot be modified by default. This is a fundamental principle of ClapDB. So update is not supported in ClapDB, and the data is append-only. (user could just use the truncate or delete to remove the data.)

Table Schema Mode

Static

This mode behaves like a traditional database, the column schema is fixed, and the data type is fixed.

Dynamic

This mode allows the column schema to be changed, and the data type is flexible.

Dynamic table column schema constraint

In dynamic schema table, user can set the column dynamic detail manually.

If you want a column is a union of multiple types, you can __n as the suffix of the column name, and the n is the index of the union type, n should start from 0, and unique.

if you do not use __n, the column is struct type of a fixed type.

DDL

create table foo(
    score__0 int,
    score__1 double,
    score__2 text,
    age int,
    zip text,
    name text,
) with (schema_mode = 'dynamic');

Ingesting Data

Ingesting data can specify the type of the data, just use column_name__n as the column name, and the n is the index of the union type.
Ingesting data can do not specify the type of the data, the system will try to convert the data to the column type, the order is from 0 to largest index.
if the column’s last index type is text, the data will be always ingesting success(may be truncated to a shorter length).

Query Data

If the table is static, the ClapDB will behave like a traditional database.
With dynmaic schema table, you can query the data with the type of the column use the __n suffiex.
Or use ::type to specify the sub type of the column, if you do not want to use __n suffix(maybe forgotten).
If not use __n suffix either ::type, ClapDB will try to all sub types of the column with the query.
If ::type is used with __n suffix, ::type still works like a cast operator.

# will filter the score__0 column with int type and score__1 column with double type
select age from foo where score > 0;


# will just filter the score__0 column with int type
select age from foo where score__0 > 0;

# will just filter the score__0 column with int type
select age from foo where score::int > 0;

# will cast the score__1 column to int type and filter the score__1 column with int type
select age from foo where score__1::int > 0;