Transform Definition Language

Release 5.0, April 2025

Table of Contents

General Description
DS
··· DS Partitioning
··· DS Life Cycle
··· ··· Create DS
··· ··· Transform DS
··· ··· Store DS
Type System
··· DS Types
··· ··· PlainText
··· ··· Structured
··· ··· Columnar
··· ··· Spatial DS Types
··· ··· ··· Point
··· ··· ··· Track
··· ··· ··· Polygon
··· Attribute, Variable, and Parameter Types
··· ··· Names
··· ··· String Literals
··· ··· Numerics
··· ··· Boolean Type
··· ··· NULL
··· ··· Arrays
··· Custom Object Types
Script Structure
··· Top-level Structure
··· Expressions
··· ··· Automatic Evaluation in Strings and Names
··· ··· Expression Operators
··· ··· Expression Functions
··· Parameter List Expressions
Input and Output Scopes
··· Simple Input Scope
··· Simple Output Scope
··· Wildcard Input and Output Scopes
··· UNION Input Scope
··· JOIN Input Scope
Data Storage Interface Statements
··· CREATE DS Statement
··· COPY DS Statement
Data Manipulation Statements
··· ALTER DS Statement
··· SELECT Statement
··· ··· Query Constraints
··· CALL Statement
··· ··· Transform-type Pluggables
··· ··· Operation-type Pluggables
··· ··· Procedure Calls
··· ANALYZE Statement
Control Flow Statements
··· LET Statement
··· IF Statement
··· LOOP Statement
··· RAISE Statement
··· CREATE PROCEDURE Statement
··· DROP PROCEDURE Statement
··· CREATE FUNCTION Statement
··· DROP FUNCTION Statement
Environment Control
··· OPTIONS Statement

General Description

Transform Definition Language (TDL) is a dialect of SQL specifically customized for ETL processes. The strongest emphasis is on the Transformation phase.

While standard SQL is mostly declarative, TDL directly supports imperative programming with flow control operators and procedural elements, and has some other significant semantic differences due to Data Cooker's differences from traditional RDBMS. The key difference from SQL engines is that Data Cooker doesn't require global data catalog nor information schema. A partial schema is defined ad hoc by language statements only if required by current context. It doesn't imply control over data types, nor integrity, neither consistency, neither enforces any constraints on the data. Instead of SQL tables that consist of column-based rows with a fixed structure, where each column is strongly typed, Data Cooker operates with collections of Record Objects called Data Streams (DS for short), that are strongly typed on the level of DS, but Records themselves have loose/ad-hoc typing and structure.

Also, Transformation phase of ETL does not imply analytic workloads or complex data manipulation that traditional SQL facilities usually provide on record level. Instead, entire DS is somewhat modified on each step. And because TDL describes high level manipulation over DS as a whole, it intentionally omits features like aggregation, window functions, and anything that reduces several records of a given DS to a single value, except basic statistics via ANALYZE operator.

At the same time, Data Cooker's language interpreter also does provide built-in support for 'Pluggables' (written in Java) to allow more complex manipulations with DS. By implementing a Pluggable interface, developers are allowed to seamlessly integrate custom logic into ETL processes, including generation of new DS from the external data source, and changing data inside of them as they like. With Pluggables, it is also possible to change the type of DS Record Objects. There is also an interface to expose any Java-based Expression Functions and Operators into TDL context as an Evaluator implementation.

TDL does have a specific type system, customized for DS that contain not only relatively flat Records, but also complex Spatial Objects with embedded geometry data, because Data Cooker's heritage begins from a geo-informatics processing toolchain.

DS

In terms of Data Cooker, a DS is a basic entity of every ETL process.

DS is a typed collection of Record Objects, that may have or not complex type-dependent internal structure, and/or a set of simple Attributes, accessible from the ETL process by Attribute Names. Each Object together with a Key (which may be some opaque value or even another Object) form an elementary unit of processing, the Record.

DS Partitioning

DS are partitioned — split into chunks called Partitions, to balance the load of Spark executors because they process a single Partition at a time. Number of Partitions in most cases must be set manually, but if not, it could be defined by Input Adapter (a Pluggable that creates DS) using intrinsic properties of data source and/or execution environment.

Records are assigned to Partitions by hash codes of their Keys. Correct definition of Keys is crucial to avoid 'data skew' (a situation when a small number of Partitions contain most of the records while all others are empty).

In most cases, Data Cooker allows consumption of a DS either as a whole or as a subset of Partitions (any number from single part to all parts). If that subset is specified, new DS is transparently created from the parent one on the fly. For efficient processing on Partition basis, Keys must be stable and have a uniform (or, at least, predictable) distribution across entire DS.

DS Life Cycle

According to definition of ETL, DS can be created (once), transformed (many times), and stored (several times).

In Data Cooker, a DS can be created by several means from the external source of data or from other existing DSs.

Extraction Phase: Create DS

Create method #1. Pluggable Java-based Input Adapters extract data from the data source, which could be some object or file-based storage, a traditional database, or something else. Each Input Adapter supports creation of a single type of DS, but may support different data sources.

Initial Key of newly created Record is set by Input Adapter, then DS is partitioned using those initial keys. Rules for a concrete Input Adapter are documented by its developer.

Method #2. A DS can be derived from one or many existing by performing an SQL-like SELECT query. Type and partitioning of a derived DS is inherited from parent DS. See SELECT section of this manual for details.

Method #3 to create a DS is to call a Pluggable that generates new DS from existing ones using its own rules (either Transform-like or more complex ones). There are some common patterns described below in the CALL section, but exact rules are documented by the developer of the Pluggable.

In #2 and #3 a subset of Partitions of source DS can be taken, thus creating a DS from some part of source data.

Transformation Phase: Transform DS

After DS is created, it may be transformed by Pluggables and ALTER operator, changing its Record type and/or Key, as well as number of Partitions. Number of objects can also change if one source object corresponds to many resulting ones and vice versa. There can be performed as many consecutive transformations on a given DS as needed.

If Record Keys are changed by applying a transformation, affected Records may be assigned to other Partitions, if needed.

Loading Phase: Store DS

Final step of DS life cycle is storage to some external data store, performed by Output Adapter (a Pluggable that interfaces with external storage like Input Adapter but in opposite direction). An Output Adapter may store different types of DS into different formats. Rules, as usual, are documented by its developer.

Type System

Differing from SQL, TDL data types are defined first on the level of DS as a whole for each Record Object, and then by level of Record Attributes and language constructs ad hoc.

DS Types

Record Objects may have or not an internal structure, and if Record Object type does allow Attributes, each Attribute type is dynamically coalesced to match the context of expression that references it, if possible.

Supported DS types are:

PlainText

After loading a DS from unstructured text source, each line is represented as text in the system encoding (usually, UTF8), or opaque byte array. Before performing meaningful transformations on the data contained in that set, it should be converted to another, more structured type.

Vice versa, before saving structured DS to text-based file formats, it should be converted to PlainText.

Structured

Each object of a Structured DS is backed by a JSON representation, so it can be as arbitrary as it fits within JSON limitations. Namely, String, Numeric, Boolean, an NULL properties are supported, as well as Arrays. Top level object may be an Array itself.

Each top-level property is an Attribute itself (read-write), and nested properties' Attribute names are compound (read only). They may include array index or range part too.

For Structured SELECT, compound property names must be used to access nested properties using dot notation ("top.inner.innermost"), and zero-based array indices or ranges to access array members. For example, "prop.inner[0]" for 1st member of "prop.inner" array, or "[2].prop" for 3rd topmost array member's property "prop". Ranges may be specified only for innermost properties (no dots allowed afterward): "prop[*]" or "prop[]" for entire array "prop", "prop[1:]" from 2nd member onwards, "prop[:5]" up to 6th, "prop[2:4]" from 3rd to 5th. Range boundaries may be negative which means 'Nth member from array end'. If right range goes past array size, result will be NULL padded. If boundaries overlap or both go beyond beginning, empty array will be returned.

Columnar

Each object of DS of that type is a record consisting of uniquely named columns, enumerated in the specified order. After enumerated, they can be accessed via Attributes of same name. This type of DS is the closest to traditional SQL tables.

However, data contained in the column doesn't have an explicit type, nor it has any metadata except name. Columns, omitted from the record, have automatically assigned NULL values.

Spatial DS Types

Because Data Cooker inherited preceding project legacy which was a geo-informatics project, TDL natively supports geometry-specific Spatial data types implemented as extensions to JTS Topology Suite and GeoJSON types.

Point

Objects of that type are abstractions of Points a map with an optional set of arbitrary Attributes.

Being an extension of JTS Point, each object has mandatory internal attributes of geographic 'latitude' and 'longitude'. For purposes of GIS, Points also may have a 'radius' property to become 'Points of Interest' (POI).

Track

Object of a Track is an abstraction of a route traveled by some physical object over the real-world map. It consists of Points, grouped together by an appointed 'tracking' Attribute, and sorted by another 'timestamp' Attribute, optionally cut into Segments by 'segmentation' Attribute. If segmentation Attribute is not defined, there still be a single Segment that includes all Points of a Track.

Each level of nested objects (Track, Segments, Points) may have an arbitrary number of other named Attributes.

Polygon

A region on a map, defined as a collection of a single external border outline and any number of internal cut-out outlines, with an arbitrary number of other named attributes.

There is no direct access to vertices of Polygon outlines, but there is an automatically computed Centroid POI, with radius set to its most distant border vertex.

Attribute, Variable, and Parameter Types

Attributes of DS adhere to type system too, although not as strictly as in SQL. Outside DS, there also are script-level Variables, and Expression- / Pluggable-level Parameters that have same types, more or less similar to SQL: Names, String, Numeric, Boolean, NULL, and Arrays.

Names

Names are used to define and identify any entities that have to be referred to in the script, i.e. DS, Attributes, Parameters, Variables, and so on.

Canonical representation of a Name is a string of Latin alphabet letters, numbers (any position except first), and underscores, or a string of any valid Unicode characters enclosed in double quotes. If name should contain a character of double quote itself, it must be doubled. If name clashes with a TDL keyword, it must be quoted.

ALL Names are case-sensitive.

Quoted representation of names provides automatic expression evaluation (see later in a dedicated paragraph).

Names could be joined by a dot to create compound names not only to access Structured deep-level Attributes, but also encountered in JOIN context (we'll stop on this later).

Names of Variables are always explicitly identified by preceding 'dollar sign' $ sigil. Names of Parameters are likewise preceded by 'commercial at' @ sigil.

Data Cooker ETL pre-populates global Variables from system environment. These Variables have prefix ENV: ($"ENV:HOME", $"ENV:PATH" etc), and are read-only (can't be unset or overwritten). Also, special Variable for current working directory named $CWD is automatically created.

String Literals

String literals are fragments of arbitrary Unicode text enclosed in apostrophes. If a String should contain an apostrophe, it must be doubled.

String literals are always automatically evaluated (see later).

Numerics

There are different representations of immediate Numeric values allowed:

99 here means a sequence of decimal numbers, HH for hexadecimal numbers, s is sign sigil (+ or -). Parts in square braces are optional, symbols E, L, . (dot) and 0x are used as is.

Boolean Type

Literal TRUE is used for Boolean truth and literal FALSE for Boolean falseness.

NULL

Like in SQL, NULL is used for non-values.

Most of the expressions that include NULL evaluate to NULL, except ones that use truth table similar to OR operator, and some other specific cases (sse in the dedicated topic). As in SQL, no NULL is equal to another NULL.

A Variable or Attribute that is not defined in the current context always evaluates to NULL as well. Undefined Parameters may fall back to default value, if they're optional, or to NULL, if mandatory.

Arrays

TDL supports loosely-typed Arrays. Arrays may be used in LOOP control structures, IN operator, to pass collections to Pluggables as Parameters, and in other contexts.

Virtually anything can be collected into Arrays by Functions, but literal array constructors supports only simple literals. Constructor syntax is just a list of values separated by comma, enclosed in square brackets (for example, ... WHERE "month" IN ['Nov', 'Dec', 11, 12]). Optional keyword ARRAY may be used before the opening bracket (for example, LOOP $i ARRAY['a', 'b', 'c'] BEGIN ...). Arrays may be empty (just []).

Arrays of Names are supported only in context of Pluggable Parameters (CALL something(@affect_columns=["first","third","fourth"]) ...) and Variable definitions, and throw an error otherwise. Unlike other simple types, Names in the Array constructor cannot be mixed with other literals.

For Arrays that consist of Numeric values over some continuous range, there is a special constructor syntax as well: RANGE[left_boundary, right_boundary].

Ranges may be ascending or descending and always include both boundaries. For example, RANGE[1,-1] is equivalent of ARRAY[1,0,-1]. If both boundaries of Range have type of Integer, then all values in the Range will be Integer. If any is Long, then all will be Long (and Doubles are converted to Long).

Custom Object Types

Some Pluggables and even Functions define their own specific subtypes of data types that they interpret with their own set of rules. Those types are referred to as 'Something-like Objects'.

For example, date/time manipulation functions operate with DateTime-like Objects, which are represented by Long Epoch seconds or milliseconds, or String formatted as a ISO8601-alike timestamp.

Script Structure

Semantically, TDL is parsed as sequence of top-level language Operators, that include Expressions, and Parameter Lists.

Top-level Structure

TDL script is a sequence of any number of statements, each ending with a mandatory semicolon.

Whitespace characters are ignored in most cases and may be freely used for means of readability, except while in Expressions where they are required to distinguish the boundaries of Expression Operators. Script may be empty, there is no error.

Each simple language statement is executed in the current execution context, starting with top-level, while compound statements create their own context for nested statements. List of language statements is as of follows:

-- Data Storage Interface
CREATE DS
COPY DS
-- Data Manipulation
ALTER DS
SELECT
CALL
ANALYZE
-- Flow Control
LET $variable = expression
LET $variable = SELECT
LOOP
IF
RAISE
CREATE PROCEDURE
DROP PROCEDURE
CREATE FUNCTION
RETURN
DROP FUNCTION
-- Environment Control
OPTIONS

Expressions

Expressions evaluate in two distinct contexts. First is 'Loose Expression' that can refer to any script-level entities, except non-sigil Names. Another one is 'Record-level Expression' which can also refer to Record Attributes by using their Names and is encountered only when iterating Records within a DS Partition, or withing definition of parameters for a such process.

So, Record-level evaluation rules take place only where it makes sense (sub-clauses of SELECT, WHERE, KEY clauses, and arrays of Attribute Names in LET and CALL-like parameters).

Automatic Evaluation in Strings and Names

There is automatic expression evaluation, performed inside quoted Names and String literals. Any text between curly braces { and } inside quoted Name or String is evaluated as an expression, and result is substituted as a substring in-place.

To turn off automatic evaluation, braces should be prepended by a backslash \.

Please note that quoted entities evaluation happen in the context of parse time of language Operator, not the call time. As such, quoted entities won't re-evaluate themselves in SELECT clauses on per-Record basis. Use Expression Functions there instead.

Expression Operators

There are three SQL-specific 'peculiar' Operators implemented right inside TDL interpreter: IN, IS, and BETWEEN.

SyntaxPriorityAssociativity# of OperandsDescription
operand [NOT] IN array_expression35Right2TRUE if left operand is present in the right, cast to Array, FALSE otherwise. Vice versa for NOT variant
operand IS [NOT] NULL35Right1TRUE if left operand is NULL, FALSE otherwise. Vice versa for NOT variant
operand [NOT] BETWEEN numeric_min AND numeric_max35Right3TRUE if left operand is inclusively between min and max operands, cast to Numerics, FALSE otherwise. For NOT variant, TRUE if it is outside the range, excluding boundaries, FALSE otherwise

All other Operators that can be used in Expressions are either directly translated into Java operators (down to symbolic representation), or implemented elsewhere in the classpath as a kind of Pluggable Evaluator (and thus have a keyword-like name), and their list is extensible / dependent of distribution.

If an Operator requires operands of concrete data type, they will be directly cast (in case of type-casting Operators) or stringified-then-parsed to that type (in general purpose Operators). If cast or parsing is impossible, Data Cooker throws an error and halts script execution.

Operators with higher priority are evaluated before lower priority operators. In case of same priority, evaluation is left to right except if Operator is right-associative or unary.

Most Operators result in NULL if any of their operands are NULL, but some do handle NULL in a specific way.

For the exact list of available Operators, their Priority, Associativity, and NULL handling, look into automatically generated distro docs.

Expression Functions

A function call has the following syntax:

function_name([argument_expression[, argument_expression]...])

Functions may have any number of strictly positional Arguments (1st, 2nd, 3rd, and so on). Some have a fixed list of strongly typed Arguments, some accept any number of Arguments that are automatically cast to a single type. There are also Functions that have no Arguments, or even refer to Object Record implicitly as a whole and access its internal data, such as Spatial Object's geometry.

In any case, Arguments are evaluated from left to right, and parentheses are mandatory.

Expressions in parentheses are considered as a special case of function call of an identity function (that returns its only argument unchanged), to allow parentheses to change evaluation order of compound expressions. This allows usage of parentheses for subexpression grouping and changes in the evaluation order.

Record-level Functions can be called only from Record-level Attribute Expressions, and are invalid in the Loose Expression context. Calling them outside SELECT, WHERE, or KEY clauses will cause an error.

Expression Functions are provided in two ways. First, same as Operators, by Java implementations from the classpath. For the complete list of Pluggable Functions, their return and Argument types refer to your distro docs. Second, Language-level Functions may be defined on the script immediately, or in the library script. This method is discussed further in this document.

Parameter Lists

Parameter lists are used either to pass some set of named values to Pluggables, Procedures, Functions, and even execution context itself, or to define such a set.

Each parameter is a name to value pair:

@parameter_name[ = value_expression][, @parameter_name[ = value_expression]]...

Value expressions in the defining context evaluate to defaults for that Parameter. If omitted, that Parameter becomes mandatory, and must be present in the passing context.

In passing context, value expressions can't be omitted, but the entire pair can be skipped, if its Parameter was previously defined as optional.

Input and Output Scopes

Different language operators use unified logic to specify their input and output DS as input and output scopes.

Simple Input Scope

In the simplest form, input scope is a single DS, referred just by its Name, optionally narrowed to a range of parts by PARTITION specifier.

"ds_name" [PARTITION parts_expression]

Expression for partition specifier may be any expression that evaluates to an ARRAY of part Numbers (it may be a range or even discontinuous), or to a single part Number.

Simple Output Scope

Single output is referred by its Name.

"ds_name"

This doesn't have any specifiers.

Wildcard Input and Output Scopes

Input scope consisting of multiple DS with same prefix is also referred by Name, but followed by a * (wildcard Operator).

"ds_name_prefix" * [PARTITION parts]

If narrowed to a range of parts by PARTITION specifier, this is applied to all matching inputs independently, so they all must have referred parts.

For each wildcard input scope, a wildcard output scope is required.

"ds_name" *

For each matching input Name, a suffix is extracted, and appended to output prefix Name. For example, if there are inputs named "source/Jan", "source/Feb" and "source/Mar", input scope FROM "source/" * and output scope INTO "result/" * means producing of output DS named as "result/Jan", "result/Feb", and "result/Mar".

UNION Input Scope

UNION input scope is intended for consuming multiple DS at once as if they were one DS, referring either by list of names, or by prefix and a wildcard.

UNION [SPEC] "ds_1" [PARTITION parts], "ds_2" [PARTITION parts], ...
UNION [SPEC] "ds_name_prefix" * [PARTITION parts]

All participating DS must be of same type and required to have exactly same set of top-level Attributes. Scoped DS name is either inherited from first participating DS (in list syntax), or set to prefix (in wildcard syntax).

Specifier SPEC can be one of the following logical operators:

UNION SpecifierResult
CONCAT or CATAll participating DS records are appended to scoped DS, in order of occurrence (just like in standard SQL)
XOROnly records that are unique to their source DS appear in scoped DS. If any record appears more than in one DS, it is discarded
ANDOnly records that are occurred in all participating DS appear in scoped DS. Any record that doesn't appear in all DS is discarded

If omitted, CAT is used.

JOIN Input Scope

JOIN input scope is used to combine Records from several input DS into one by comparing their Keys.

[SPEC] JOIN "ds_1" [PARTITION parts], "ds_2" [PARTITION parts], ...

Resulting Record objects of straight JOINs have top-level Attributes from the first meaningful Object occurred in the DS iteration order, while additional Attributes are added from all subsequent/preceding DS. Names of Attributes in this case are compound, using dot notation (1st part is source DS name, then source Attribute name). For subtracting JOINs, however, only one DS is the source of resulting Objects, thus only that object Attributes are available, using simple names.

JOIN SpecifierKindScoped DS:
Record Attributes Evaluation RulesType and Name Inherited From
INNERStraightRecords that have matching keys in all participating DS are combined into one, other discarded. Records always have all Attributes from all participating DS1st DS
LEFTStraightDS are iterated from left to right. Resulting records always have Attributes from first DS, and then from subsequent DS records with same Keys. If there is no Record with matching Key in the next DS, NULL values will be used for all its Attributes1st DS
LEFT ANTISubtractingDS are iterated from left to right. Records from first DS that have no matching Keys in all subsequent DS go to scoped DS unchanged, other discarded1st DS
RIGHTStraightDS are iterated from right to left. Resulting records always have Attributes from last DS, and then from preceding DS records with same Keys. If there is no Record with matching Key in the next DS, NULL values will be used for all its AttributesLast DS
RIGHT ANTISubtractingDS are iterated from right to left. Records from last DS that have no matching Keys in all preceding DS go to scoped DS unchanged, other discardedLast DS
OUTERStraightDS are iterated from left to right. Resulting Records have Attributes from all DS Records with same Keys. If there is no Record with matching key in any DS, NULL values will be used for all its Attributes1st DS

If omitted, INNER is used.

Data Storage Interface Statements

There is a subset of Operators to link DS with their physical storage, external to Data Cooker's Spark context. Because they can be used to create or materialize a DS, they're roughly equivalent to DDL in standard SQL.

CREATE DS Statement

Creates a new DS from external data using an Input Adapter Pluggable:

(CREATE | LOAD) [DS] "new_ds_name" [*]
	input_adapter_name[([parameters])]
	[attributes]...
	FROM path_expression
	[PARTITION parts] [BY SPEC];

New DS name must be unique. Some Input Adapters support creation of multiple DS at once, if name is followed by a wildcard. In this case each resulting DS name use specified prefix, and suffixes are decided by Input Adapter using its own internal rules (and path may have a specific form).

List of available Input Adapters depends on the distribution, so look for their Names, as well as Parameters in the distro docs. Path expression evaluates to a path String that is also specific for an external data source. Each Input Adapter has its own syntax of path, which could be a URI, a database query, or something more specific.

Number of partitions is set by PARTITION expression, which evaluates to Java Integer. If this is not set, a single-partition DS is created.

Initial object keying, and thus, initial partitioning can be changed by the BY SPEC algorithm clause:

BY SpecifierResult
HASHCODESets initial Record Key to Java .hashcode() of the Record Object itself, so identical objects will be placed into same partitions
SOURCEDerives Keys from source file name (or storage-dependent analog, it if's not file-based), so all Records from same source 'file' will go to same partition
RANDOMSets keys to random values, effectively evenly shuffling records between partitions

Default is HASHCODE. In any case, initial Record Key is an Integer Numeric.

Attributes specification have one of the following syntaxes:

[SET] [Level] [ATTRIBUTES | COLUMNS | PROPERTIES] ("name"[, "name"]...)
[SET] [Level] [ATTRIBUTES | COLUMNS | PROPERTIES] ($array_of_names)

Depending on DS type, there can be multiple Attribute specifications (if there are multiple Record Object Levels in that type), required or optional (depending on Input Adapter), or none at all:

DS TypeLevel(s) of AttributesRecord Object Level
PlainTextNoneThere are no Attributes
ColumnarValueColumn names of the record object
StructuredValueProperty names of the record object
PointPoint or POI or ValueAttributes of Point object
TrackTrack or SegmentedTrack or ValueAttributes of Track (top-level Object)
Segment or TrackSegmentAttributes of Segments (level 2 objects)
Point or POIAttributes of Points (level 3 objects)
PolygonPolygon or ValueAttributes of Polygon object

SET Level keywords are all equivalent, but for expressiveness, COLUMNS could be used for Columnar, PROPERTIES for Spatial, and ATTRIBUTES for all oter DS types.

COPY DS Statement

Saves existing DS to an external storage via Output Adapter Pluggable.

(COPY | SAVE) (wildcard_input_scope | input_scope[, input_scope])
	output_adapter_name[([parameters])]
	[attributes]...
	INTO path_expression;

Likewise, INTO expression evaluates to a path String (a URI, database table name, and so on) pointing to a data storage location, with a syntax specific to each Output Adapter.

For each DS from Input scope, Pluggable is called with same set of Parameters, and same path. Usually, each DS is saved into a subdirectory (or analog) under that path named after DS.

Like Input Adapters, some Output Adapters could require Attributes specifications (maybe, even multiple), or none at all.

Data Manipulation Statements

Transformation phase of ETL maps to Operators that serve as more direct counterpart of DML in SQL.

ALTER DS Statement

Modifies properties of a DS and/or its Records. Can also invoke a Pluggable that converts it from one type to another, and/or its Records from one representation to another.

ALTER [DS] "ds_name" [*]
	[[TRANSFORM] pluggable_name[([parameters])]]
	[attributes]...
	[KEY keying_expression]
	[PARTITION [parts]];

If a star qualifier is used after DS Name, it is treated as a wildcard prefix, and all matching DS will be transformed with same parameters (independently of each other, in no particular order).

Exact list of transform-type Pluggables depends on a distribution, and thus they're documented separately. Some Pluggables could require Attributes (multiple, as well), or none at all. Some of them would change type of DS, while others only modify/filter/combine its Records using some internal set of rules. If there is no TRANSFORM, Attributes could still be specified to modify Records to include only the supplied list of Attributes.

If KEY Record-level expression is specified, each DS Record will be assigned with newly calculated Key. In most cases it refers to Record Object Attributes to their state after transform, but some Pluggables will change list to pre-existing DS Attributes (for example, if DS is transformed to PlainText, which has no Attributes).

If there is also a PARTITION clause present, Records will then be redistributed between new partitions according to their new Keys (according to number of partition expression evaluated to Integer). Be aware that raw result of KEY expression is used to set Record Key, so if it evaluates to a constant, all records will go into one partition. To keep existing number of partitions, just omit partitioning expression, leaving only a keyword.

Otherwise, if there is no PARTITION after KEY, only Keys will be changed, but no redistribution is performed.

But if statement includes only PARTITION clause without KEY, DS will be repartitioned to a desired number of partitions after transforming, but using HASHCODE partitioning over existing Keys.

SELECT Statement

Perform an SQL-like query on input DS scopes to create output scoped DS.

SELECT [DISTINCT] (* | item[, item]...)
	(FROM input_scope)...
	(INTO output_scope)...
	[WHERE [Level] where_expression]
	[LIMIT limit_expression[%]];

Query part is either a wildcard (with a meaning of 'select all available attributes') or a list of SELECT 'items':

record_level_expression [AS [Level] "resulting_attribute_name"]

Resulting DS Attributes are either directly named (if AS is specified) or as a String-ified expression (its text with all whitespace removed), and come to a specified Record Object Level, or created as top-level.

If same name is set for several expressions in the same Level, rightmost will overwrite result of all preceding (there can't be many Attributes on the same level with same name in any DS object).

Item expressions are bound to Record-level context for all DS participating in the SELECT.

Number and types of input and output scopes must match, counting from left to right. So, any wildcard input must have a corresponding wildcard output, and any simple/UNION/JOIN input must have a matching simple output.

Query Constraints

There are 3 types of constraints available to narrow down the query results: WHERE, DISTINCT, and LIMIT clauses.

WHERE constraint is a single Boolean expression (applicable only to one Level of Object Record Attributes; topmost, if unset), evaluated before all 'item' expressions and all other constraints for each Record. If result is TRUE, Record is queried, otherwise discarded.

DISTINCT, if specified, is applied after resulting query is evaluated. Of all Records that have been evaluated to exactly same key and object, there will be only one left, and any repeats are discarded.

LIMIT is applied at the very last, and comes in two flavors, by count and by percentage. Count can be specified in any valid Numeric format, and the resulting DS will have at most the specified number of Records, randomly sampled. If set by percentage, it must fall into a range from 0 (exclusive) to 100 (inclusive), and result will have randomly trimmed down to set percentage of records. If invalid percentage is specified, that will lead to an error.

To summarize, the order of SELECT evaluation is as of follows:

  1. Input scopes are created and corresponding output scoped DS checked for inexistence
  2. WHERE constraint expression calculated for each input Record
  3. 'Item' expressions calculated from left to right and output Records created
  4. DISTINCT constraint applied
  5. LIMIT constraint applied

CALL Statement

Serves to call a Pluggable to create a new DS from existing ones, or to invoke a Procedure.

Because each Pluggable may require different input and output scopes, and Procedures don't have them, syntax could be different.

Transform-type Pluggables

Pluggables that take a single input DS and produce single scoped DS work like TRANSFORM, just without setting additional output parameters:

CALL pluggable_name[([parameters])]
	(INPUT [FROM] wildcard_input_scope)...
	(OUTPUT [INTO] wildcard_output_scope)...;
CALL pluggable_name[([parameters])]
	(INPUT [FROM] input_scope[, input_scope]...)...
	(OUTPUT [INTO] simple_output_scope[, simple_output_scope]...)...;

Both types of INPUT/OUTPUT specifications can be repeated and mixed, but must match for each other (wildcard to wildcard, and simple/JOIN/UNION to simple). In case of many inputs and outputs, they are consumed and created in order of precedence from left to right, independently of each other.

Operation-type Pluggables

Pluggables can take more than one input scope at once and produce several scoped DS from them. These Pluggables are called Operations, and refer to different scopes by their internal names:

CALL pluggable_name[([parameters])]
	(INPUT internal_name [FROM] input_scope(, internal_name [FROM] input_scope)...)...
	(OUTPUT internal_name [INTO] output_scope(, internal_name [INTO] output_scope)...)...;

Some Operations can take one input scope at once but produce several output scoped DS from it, or vice versa. In that case internal name of the scope doesn't matter and should be omitted.

Some of the scopes may be even optional, in that case their entry could be entirely omitted from INPUT or OUTPUT list.

As well as with Transform-type Pluggables, multiple INPUT/OUTPUT pairs are consumed and created in order of precedence from left to right independently of each other.

Procedure Calls

Procedures are called in global script context, and always have access to all current DS. Because of that, they aren't restricted to scopes, and syntax is simplest:

CALL procedure_name[([parameters])];

If they refer to some DS within each invocation as output scopes, their Names should be passed as Parameters.

ANALYZE Statement

Performs calculation of basic statistics about the specified DS (or multiple DS in the wildcard syntax):

ANALYZE [DS] "ds_name" [*]
	[KEY key_expression]
	[PARTITION];

This statement uses results of KEY expression evaluated for each DS Record as 'counting values' for DS statistics, like in TRANSFORM statement. If KEY clause is omitted, Record's own Key is used automatically.

After the counting value is decided, all records in the DS are iterated over to calculate the following indicators:

Statistical indicatorMeaning
_nameDS name
_typeDS Record type
_partsNumber of DS partitions
_counterKEY expression
_totalCount of records in the DS, like COUNT * in standard SQL
_uniqueNumber of unique counting values
_averageAverage number of records per each of unique counting value
_medianMedian number of records per each of unique counting value

They are appended to a special Columnar DS named _metrics as a new Record. Its records keys are always names of analyzed DS.

Because each DS may be altered multiple times in their life cycle, and because ANALYZE may be called another time for same DS (i.e. with different KEY expression), each successive invocation of ANALYZE for same DS adds a new record to _metrics DS instead of replacing it, keeping history of all ANALYZE calls.

If PARTITION clause is specified, an additional round of statistics calculation is performed for each partition of the analyzed DS. In that case, an additional special Columnar DS is created for each analyzed DS, named _metrics_<DS name>. There are as many records in that DS as many partitions it has, and their fields contain the following indicators:

Statistical indicatorMeaning
_partPartition number, starting with 0
_counterKEY expression
_totalCount of records in the partition
_uniqueNumber of unique counting values per partition
_averageAverage number of records per each of unique counting value per partition
_medianMedian number of records per each of unique counting value per partition

Its records keys are always partition numbers of analyzed DSs. If some partition is emtpy, its indicators are amounted to zero.

Each subsequent ANALYZE PARTITION replaces per-partition statistics DS _metrics_<DS name>, and if multiple results are required for further analysis (e.g. with different KEY expressions), make sure to copy them under different name.

Control Flow Statements

Control Flow Statements are compound, having nested Statements inside their clauses that execute under different conditions such as branching.

LET Statement

Defines a new script Variable in the current execution context, or removes it.

LET $variable_name = expression;
LET $variable_name = SELECT [DISTINCT] record_level_expression FROM input_scope [WHERE where_expression] [LIMIT limit_expression[%]];

Resulting Variable type is defined by expression itself. Because context of expressions for Variable definitions is same as of Parameter expressions, it is possible to define a Name Array. If expression result is NULL, Variable is removed from the context.

To retrieve values derived from DS Records as an Array, a limited SELECT query can be used: only one expression in the 'item' list is allowed, and it doesn't have resulting attribute name. Query constraints do work as expected.

IF Statement

Performs control flow branching. Evaluates a Boolean expression, and executes statements nested in THEN clause for TRUE. Otherwise, executes statements in ELSE clause, if it is present, or just does nothing.

IF boolean_expression THEN
	statements...
[ELSE
	statements...]
END [IF];

Nested statements in either branches can be any number of TDL statements (even none).

LOOP Statement

Executes statements nested in BEGIN clause as many times as number values in the controlling Array, each time setting loop Variable to next value. If expression evaluates to NULL or array is empty, and ELSE clause is present, its nested statements will be executed once instead.

LOOP $controlling_variable IN array_expression BEGIN
	statements...
[ELSE
	statements...]
END [LOOP];

If there was an existing Variable with same name as loop Variable in the outer context, it will be unaffected, because LOOP creates a new nested context.

RAISE Statement

This statement can be used to log arbitrary messages in standard output or error streams, as well as to interrupt script execution with error message.

RAISE [Level] expression;

Supported log levels:

LevelBehavior
DEBUG, INFO, LOG, NOTICELog message to stdout
WARN, WARNINGLog message to stderr
ERROR, EXCEPTIONHalt execution and throw exception with message as a reason

If log level is omitted, it is treated by default as ERROR.

CREATE PROCEDURE Statement

Creates a sequence of statements in the current global execution context, that can be CALLed as a whole with its own context of Variables.

[CREATE [OR REPLACE]] PROCEDURE procedure_name[([parameters])] [AS] BEGIN
	statements...
END [PROCEDURE];

Procedures exist in the same namespace as Operations, and they can't override existing Pluggables. But they can be replaced further down in the script code. For example, if your library script defines some Procedures, you may redefine some in your script by using OR REPLACE clause.

Each @-Parameter becomes a $-Variable inside BEGIN clause with the value assigned at the CALL context.

Parameters can be optional, in that case they should have default value set by an expression evaluated in the current context on time of definition (not in the CALL context). If defined without default value (just as a @name), Parameters become mandatory and must always have a CALL value.

DROP PROCEDURE Statement

Use this statement to remove one or many Procedures from the current global execution context.

DROP PROCEDURE procedure_name[, procedure_name]...;

CREATE FUNCTION Statement

Language-level Functions provide a convenient way to call same expression many times but with different values passed as Parameters.

[CREATE [OR REPLACE]] FUNCTION function_name[([parameters])] [RECORD] [AS]
	RETURN expression;
[CREATE [OR REPLACE]] FUNCTION function_name[([parameters])] [RECORD] [AS] BEGIN
	control_flow_statements...
END [FUNCTION];

They must always return some value. In the simplest form, function body is just a single RETURN expression statement.

In the more complex case they are similar to Procedures in the sense they also are sequences of statements executed in their own context, but these statements are limited to LET (simple, no SELECT), IF, LOOP, RAISE, and RETURN expression. There will be an error if any Function execution path doesn't end with an explicit RETURN expression.

Language-level Functions exist in the same namespace as Expression Functions provided by Pluggable modules, and they can't override them. But they can be replaced further down in the script code. For example, if your library script defines some Functions, you may redefine some in your script by using OR REPLACE clause.

Functions that have a RECORD specifier become Record-level, must always be called in Record-level contexts, and can call other Record-level Functions. They receive Record Key and Object as implicit arguments.

Each @-Parameter becomes a $-Variable inside BEGIN clause with the value assigned at the CALL context. In the calling site Parameters are always positional, never named.

Parameters can be optional, in that case they should have default value set by an expression (evaluated in the current context, not in the calling site context) which is used if NULL is passed in the calling site. If defined without default value (just as a @name), Parameters become mandatory and must always have a calling site value.

DROP FUNCTION Statement

Use this statement to remove one or many Functions from the current global execution context.

DROP FUNCTION function_name[, function_name]...;

Environment Control

Like in traditional DBMS, Data Cooker supports limited control over its execution engine Parameters from within scripts.

OPTIONS Statement

Sets global execution context parameters to specified values.

OPTIONS parameter[, parameter]...;

Unlike all other statements, Data Cooker pulls all OPTIONS to topmost execution context and executes them once, before evaluating all other script contents. They are completely ignored thereafter, in the normal script execution flow.

Parameters, that may be set by OPTIONS, are dependent on Data Cooker distribution and version, so please see a dedicated doc in the distribution for exact definitions.