Transform Definition Language

Version 4 release 4.4, January 2025

Table of Contents

General Description
DS
··· DS Partitioning
··· DS Life Cycle
··· ··· Create DS
··· ··· Transform DS
··· ··· Store DS
Type System
··· DS Types
··· ··· PlainText
··· ··· Structured
··· ··· Passthru
··· ··· Columnar
··· ··· Spatial DS Types
··· ··· ··· Point
··· ··· ··· Track
··· ··· ··· Polygon
··· Attribute, Variable, and Parameter Types
··· ··· Names
··· ··· String Literals
··· ··· Numerics
··· ··· Boolean Type
··· ··· NULL
··· ··· Arrays
··· Custom Object Types
Script Structure
··· Top-level Structure
··· Expressions
··· ··· Automatic Evaluation in Strings and Names
··· ··· Expression Operators
··· ··· Expression Functions
··· Parameter List Expressions
Data Storage Interface Statements
··· CREATE DS Statement
··· COPY DS Statement
Data Transformation Statements
··· TRANSFORM DS Statement
Data Manipulation Statements
··· SELECT Statement
··· CALL Statement
··· ANALYZE Statement
··· ··· Per-partition
Compound Control Flow Statements
··· LET Statement
··· IF Statement
··· LOOP Statement
··· CREATE PROCEDURE Statement
··· DROP PROCEDURE Statement
··· CREATE FUNCTION Statement
··· DROP FUNCTION Statement
Settings Statements
··· OPTIONS Statement

General Description

Transform Definition Language version 4 (TDL4) is a domain-specific language designed to describe ETL processes to perform with Data Cooker ETL tool. The strongest emphasis is on the Transformation phase.

Syntax of TDL4 is inspired by and based on SQL, but has substantial semantic differences due to Data Cooker's differences from traditional RDBMS. Instead of SQL tables that consist of column-based rows, Data Cooker operates with strongly typed collections of Record Objects called 'DS' (for Data Sets / Data Streams). Also, Transformation phase of ETL does not imply analytic workloads or complex data manipulation that traditional SQL facilities usually provide.

As an SQL dialect, TDL4 is mainly declarative but also has substantial means of flow control, and even procedural elements.

Another difference from SQL engines is that Data Cooker doesn't require global data catalog nor information schema. A rudimentary partial schema is defined ad hoc by language statements only if required by current context. It doesn't imply control over data types, nor integrity, neither consistency, neither enforces data constraints.

TDL4 allows high level manipulation over DSs as a whole, and intentionally omits features like aggregation, window functions, and anything that reduces several records of given DS to a single value. Only record-level expression evaluations for record attributes are provided by SELECT and similar statements. Those Expression Operators and Functions are implemented as Java classes and thus are extensible.

At the same time, Data Cooker's language interpreter also does provide built-in support for pluggable 'Operations' and 'Transforms' to allow more complex manipulations with DS written in Java. By implementing an Operation interface, developers are allowed to seamlessly integrate custom logic into ETL processes, including generation of new DS and changing data inside of them as they like. With Transforms, it is possible to change type of DS Record Objects.

Another level of extensibility is provided by 'Storage Adapters' layer of plug-ins that covers Extract and Load phases of ETL processing.

TDL4 does have specific type system, customized for DS that contain not only relatively flat Records, but also complex Spatial Objects with embedded geometry data, because Data Cooker's heritage begins from a geo-informatic processing toolchain.

DS

In terms of Data Cooker, a DS (short for Data Stream and/or Data Set) is a basic entity of every ETL process.

DS is a typed collection of Objects, that may have or not complex type-dependent internal structure, and/or a set of simple Attributes, accessible from the ETL process by Names. Each Object together with a Key (which may be opaque value or even another Object) form an elementary unit of processing, the Record.

DS Partitioning

DSs are partitioned — split into chunks called Partitions, to balance the load of Spark executors because they process a single Partition at a time. Number of Partitions in most cases must be set manually, but if not, it'll be defined by Input Adapter using intrinsic properties of data source and execution environment.

Records are assigned to Partitions by hash codes of their Keys. Correct definition of Keys is crucial to avoid 'data skew' (a situation when a small number of Partitions contain most of the records while all others are empty).

In most cases, Data Cooker allows consumption of a DS either as a whole or as a subset of Partitions (from single part to all). If that subset is specified, new DS is transparently created from the parent one on the fly. For efficient processing on Partition basis, Keys must be stable and have a uniform distribution across entire DS.

DS Life Cycle

According to definition of ETL, DSs can be created (once), transformed (many times), and stored (several times).

In Data Cooker, a DS can be created by several means from the external source of data or from other existing DSs.

Create DS

Create method #1. Input Adapters extract data from the data source, which could be some object or file-based storage, a traditional database, or something else. Each Input Adapter supports creation of a single type of DS, but may support different data sources.

Initial Key of newly created Record is set by Input Adapter, then DS is partitioned using those initial keys. Rules for a concrete Input Adapter are documented by its developer.

Method #2. A DS can be derived from one or many existing by performing an SQL-like SELECT query. Type and partitioning of a derived DS is inherited from parent DSs. See SELECT section of this manual for details.

Method #3 to create a DS is to call an Operation that generates new DS from existing ones using its own rules. There are some common patterns described below in the CALL section, but exact rules are documented by the developer of the Operation.

Transform DS

After DS is created, it may be transformed by Transforms, changing its Record type and/or Key, as well as number of Partitions. Number of objects can also change if one source object corresponds to many resulting ones and vice versa. There can be performed as many consecutive Transforms on a given DS as needed. As Transforms are pluggable, transformation rules are documented by developer.

If Record Keys are changed by applying a Transform, affected Records may be assigned to other Partitions, if needed.

Store DS

Final step of DS's life cycle is storage to some external data store, performed by Output Adapter. An Output Adapter may store different types of DSs into different formats. Rules, as usual, are documented by developer.

Type System

Differing from SQL, TDL4 data types are defined first on the level of DS as a whole for each Record Object, and then by level of Record Attributes and language constructs.

DS Types

Record Objects may have or not an internal structure, and if Record Object type does allow Attributes, each Attribute type is dynamically coalesced to match the context of expression that references it, if possible.

Supported types are:

PlainText

After loading a DS from unstructured text source, each line is represented as text in the system encoding (usually, UTF8), or byte array. Before performing meaningful transformations on the data contained in that set, it should be converted to another, structured type.

Vice versa, before saving structured DSs to text-based file formats, it should be converted to PlainText.

Structured

Each object of a Structured DS is backed by a JSON representation, so it can be as arbitrary as it fits within JSON limitations. Namely, String, Numeric, Boolean, an NULL properties are supported, as well as homogenous arrays (with every member of same type). Top level object may be an array itself.

Each top-level property is an Attribute itself (read-write), and nested properties' Attribute names are compound (read only). They may include array index or range part too.

Passthru

This type is not a data type per se, but just a marker reported from API calls to plugin Transforms that means it doesn't change data type in the set.

Columnar

Each object of DS of that type is a record consisting of uniquely named columns, enumerated in the specified order. After enumerated, they can be accessed via Attributes of same name.

Data in the column doesn't have an explicit type, nor it has any metadata except name. Columns, omitted from the record, have automatically assigned NULL values.

Spatial DS Types

Because Data Cooker inherited preceding project legacy which was a geo-informatics project, TDL4 natively supports geometry-specific Spatial data types implemented as extensions to JTS Topology Suite and GeoJSON.

Point

Objects of that type are abstractions of Points a map with an optional set of arbitrary Attributes.

Being an extension of JTS Point, each object has mandatory internal attributes of geographic 'latitude' and 'longitude'.

Track

Object of a Track is an abstraction of a route traveled by some physical object over the real-world map. It consists of Points, grouped together by an appointed 'tracking' Attribute, and sorted by another 'timestamp' Attribute, optionally cut into segments by 'segmentation' Attribute. If segmentation Attribute is not defined, there still be a single segment that includes all points of a track.

Each level of nested objects (Track, Segments, Points) may have an arbitrary number of other named Attributes.

Polygon

A region on a map, defined as a collection of a single external border outline and any number of internal cut-out outlines, with an arbitrary number of other named attributes.

There is no direct access to vertices of Polygon outlines, but there is an automatically computed Centroid Point, with radius set to its most distant border vertex.

Attribute, Variable, and Parameter Types

Outside DSs, there are also script-level Variables, and Expression- / Pluggable Entity-level Parameters, that may have their own types, more or less similar to SQL: Names, String, Numeric, Boolean, NULL, and Arrays.

Simple literals are defined like this:

literal

Names

Names are used to define and identify any entities that have to be referred to in the script, i.e. DSs, Attributes, Parameters, Variables, and so on.

Canonical representation of a Name is a string of Latin alphabet letters, numbers (any position except first), and underscores, or a string of any valid Unicode characters enclosed in double quotes. If name should contain a character of double quote itself, it must be doubled. If name clashes with a TDL4 keyword, it must be quoted.

ALL Names are case-sensitive.

Quoted representation of names provides automatic expression evaluation (see later in a dedicated paragraph).

Names could be joined by a dot to create compound names, used in SELECTs with JOINs.

Names of Variables are always explicitly identified by preceding 'dollar sign' $ sigil. Names of Parameters are likewise preceded by 'commercial at' @ sigil.

String Literals

String literals are fragments of arbitrary Unicode text enclosed in apostrophes. If a String should contain an apostrophe, it must be doubled.

String literals are always automatically evaluated (see later).

Numerics

There are different representations of immediate Numeric values allowed:

99 here means a sequence of decimal numbers, HH for hexadecimal numbers, s is sign sigil (+ or -). Parts in square braces are optional, symbols E, L, . (dot) and 0x are used as is.

Boolean Type

Literal TRUE is used for Boolean truth and literal FALSE for Boolean falseness.

NULL

Like in SQL, NULL is used for non-value.

Most of the expressions that include NULL evaluate to NULL, except ones that use truth table similar to OR operator, and some other specific cases (sse in the dedicated topic). As in SQL, no NULL is equal to another NULL.

A Variable or parameter that is not defined in the current context evaluates to NULL as well.

Arrays

TDL supports loosely-typed Arrays. Arrays may be used in LOOP control structures, IN operator, to pass collections to Pluggable Entities as Parameters, and in other contexts.

Virtually anything can be collected into Arrays by Pluggable Entities such as Functions, but literal array constructors supports only simple literals. Constructor syntax is just a list of values separated by comma, enclosed in square brackets (for example, ... WHERE "month" IN ['Nov', 'Dec', 11, 12]). Optional keyword ARRAY may be used before the opening bracket (for example, LOOP $i ARRAY['a', 'b', 'c'] BEGIN ...). Arrays may be empty (just []).

Arrays of Names are supported only in context of Pluggable Entity Parameters (CALL something(@affect_columns=["first","third","fourth"]) ...), and throw an error otherwise. Names cannot be mixed with other literals.

For Arrays that consist of Numeric values over some continuous range, there is a special constructor syntax as well.

array

Ranges may be ascending or descending and always include both boundaries. For example, RANGE[1,-1] is equivalent of ARRAY[1,0,-1]. If both boundaries of Range have type of Integer, then all values in the Range will be Integer. If any is Long, then all will be Long (and Doubles are converted to Long).

Custom Object Types

Some Operations and even Expression Functions define their own specific subtypes of data types that they interpret with their own set of rules. Those types are referred to as 'Something-like Objects'.

For example, standard date/time manipulation functions operate with DateTime-like Objects, which are represented by Long Epoch seconds or milliseconds, or String formatted as a ISO8601-alike timestamp.

Another example, some spatial Operations may extend existing Point Record type with custom 'radius' attribute measured in meters, that makes them POI-like Objects (for Points of Interest).

Script Structure

Top-level Structure

TDL4 script is a sequence of any number of statements, each ending with a mandatory semicolon.

script

Each simple language statement is executed in current execution context, starting with top-level, while compound statements run nested statements in their own nested context. List of language statements is as of follows:

statement

Whitespace and tabulation characters are ignored and may be freely used for means of readability. Script may be empty, there is no error.

Expressions

Expressions can be evaluated in two distinct contexts. First is 'Loose Expression' that can refer to any script-level entities. Another one is 'Record-level Expression' which can also refer to Record Attributes by their Names.

expression

Record-level evaluation rules take place only where it makes sense (SELECT, WHERE, KEY clauses contexts).

Automatic Evaluation in Strings and Names

There is automatic expression evaluation performed inside quoted Names and String literals. Any text between curly braces { and } inside quoted Name or String is evaluated as an expression, and result is substituted as a substring in-place.

To turn off automatic evaluation, braces should be prepended by a backslash \.

Please note that quoted entities evaluation happen in the context of parse time of language Operator, not the call time. As such, quoted entities won't re-evaluate themselves in SELECT clauses on per-Record basis. Use Expression Functions there instead.

Expression Operators

There are three SQL-specific 'peculiar' Operators implemented right by TDL4 interpreter: IN, IS, and BETWEEN.

Symbol
(Variants)
PriorityAssociativity# of OperandsDescription
Operand [NOT] IN ARRAY[]
Operand [NOT] IN $array_var
Operand [NOT] IN array_attr
35Right2TRUE if left operand is present in the right, cast to Array, FALSE otherwise. Vice versa for NOT variant
Operand IS [NOT] NULL35Right1TRUE if left operand is NULL, FALSE otherwise. Vice versa for NOT variant
Operand [NOT] BETWEEN Min AND Max35Right3TRUE if left operand is inclusively between Min and Max operands, cast to Numerics, FALSE otherwise. For NOT variant, TRUE if it is outside the range, excluding boundaries

All other Operators that can be used in Expressions are modeled after Java operators (down to symbolic representation), and/or implemented elsewhere in the classpath (and thus have a keyword-like name), and their list is extensible / dependent of distribution.

If an Operator requires operands of concrete data type, they will be directly cast or stringified-then-parsed to that type. If cast or parsing is impossible, Data Cooker throws an error and halts script execution.

Operators with higher priority are evaluated before lower priority operators. In case of same priority, evaluation is left to right except if Operator is right-associative or unary. For subexpression grouping and changing of evaluation order, parentheses can be used.

Most Operators result in NULL if any of their operands are NULL, but some do handle NULLs in a specific way.

For the exact list of available Operators, their Priority, Associativity, and NULL handling, look into automatically generated distro docs.

Expression Functions

A function call has the following syntax:

func_call

Functions may have any number of strictly positional Arguments (1st, 2nd, 3rd, and so on). Some have a fixed list of strongly typed Arguments, some accept any number of Arguments that are automatically cast to a single type. There are also functions that have no arguments and therefore return some predefined value, or even refer to Object Record as a whole and access its internal data, such as Spatial Object's geometry.

In any case, Arguments are evaluated from left to right.

Expressions in parentheses are considered as a special case of function call of an identity function (that returns is only argument unchanged), to allow parentheses to change evaluation order of compound expressions.

Record-level Functions can be called only from Record-level Attribute Expressions, and are invalid in the Loose Expression context. Calling them outside SELECT, WHERE, or KEY clauses will cause an error.

Expression Functions are provided in two ways. First, same as Operators, by Java implementations from the classpath. For the complete list of Pluggable Functions, their return and Argument types refer to your distro docs. Second, Language-level Functions may be defined on the script immediately, or in the library script. This method is discussed further in this document.

Parameter List Expressions

Parameter lists are expressions composed of parameters (at least one), separated by commas.

params_expr

Each parameter is a name to value (which may be expression or array literal) pair.

param

Parameter lists are usually passed to named function-like expressions (TRANSFORM, CALL, and so on). In that case parameter lists are bound to Record-level context and can refer to DS Record Attributes.

Data Storage Interface Statements

CREATE DS Statement

Creates a new DS from data source. Its name must be new and unique.

create_stmt

Functional expression includes Input Adapter name and its parameter list. List of available Input Adapters depends on the distribution.

FROM expression evaluates to a path String pointing to data source. Each Input Adapter has its own syntax of path, which could be a URI, a query, or something more specific.

Number of partitions is set by PARTITION clause. Its expression evaluates to Java Integer number of partitions. If not set, single partition DS is created.

Initial object keying, and thus, initial partitioning can be changed by the BY algorithm clause. Default algorithm is HASHCODE, which sets initial Record Key to Java hashcode of the Record Object itself, so identical objects will be placed into same partitions. SOURCE derives Keys from source file name (or storage-dependent analog, it if's not file-based), so all Records from same source file will go to same partition. RANDOM sets keys to random values, effectively evenly shuffling records between partitions.

In any case, initial key is a Java Integer numeric.

COPY DS Statement

Saves existing DS to storage.

copy_stmt

Likewise, INTO expression evaluates to a path String (usually, a URI) pointing to a data storage location, with a syntax specific to each Output Adapter. List of available Output Adapters depends on the distribution.

If a star is used after DS name, it is treated as a prefix, and all matching DSs will be copied to storage (usually, into subdirectories under specified INTO path for file-based storages). Each partition would usually go to a separate file.

This Statement supports per-Partition processing.

ds_parts

If needed, DS name could have additional PARTITION specifier, which has an expression that evaluates to a single Integer Partition number, or to Integer ARRAY or RANGE of Partitions to address. If star-syntax is used, all prefixed DSs must have addressed partitions.

Data Transformation Statements

TRANSFORM DS Statement

Invokes a Transform to convert a DS from one type to another. Instead of keyword TRANSFORM, ALTER may be used.

transform_stmt

Is a star is used after DS name, it is treated as a prefix, and all matching DSs will be transformed with same parameters.

Functional expression includes name of Transform and list of its parameters. As transforms are pluggable, exact list depends on a distribution, and thus they're documented separately.

An item of column definitions of a certain type has the following structure:

columns_item

In case of the COLUMNS list with ($variable) syntax, referenced variable must be of type of Array of Names.

'Type of columns' in this context depend on the DS type.

DS TypeType of AttributesRecord Object Level
PlainTextVALUEThere are no Attributes
ColumnarVALUEColumn names of the record object
StructuredVALUEProperty names of the record object
PointPOINT or POI or VALUEAttributes of Point object
TrackTRACK or SEGMENTEDTRACK or VALUEAttributes of Track (level 1 object)
SEGMENT or TRACKSEGMENTAttributes of Segments (level 2 objects)
POINT or POIAttributes of Points (level 3 objects)
PolygonPOLYGON or VALUEAttributes of Polygon object

Key item is a Record-level expression, referring to existing DS attributes:

key_item

If KEY expression is specified, each DS Record will be assigned with newly calculated Key.

Key evaluation incidence depends on the Transform. Most evaluate keys before the record is transformed, but there are some that do it after (for example, if source Record is PlainText no Attributes are available before transformation). See Transforms documentation to find out.

If there is a PARTITION clause, Records will then be redistributed between new partitions according to their new Keys (according to number of PARTITION expression evaluated to Integer). Be aware that raw result of KEY expression is used to set Record Key, so if it evaluates to a constant, all records will go into one partition. If there is no PARTITION, only Keys will be changed, but no redistribution is performed.

To keep existing number of partitions, just omit PARTITION expression, leaving only a keyword.

If statement includes PARTITION clause without KEY, DS will be repartitioned to a desired number of partitions after transforming, but using HASHCODE partitioning over existing Keys.

Data Manipulation Statements

SELECT Statement

Perform an SQL-like query to one or more existing DSs and create a new, resulting DS. Its name, specified in INTO clause, must be new and unique.

select_stmt

Query part is a star (with meaning 'select all available attributes') or a list of 'what to select' expressions. Resulting DS attributes are named according to aliases (if specified) or stringified expression (expression text with all whitespace removed):

what_expr

If same alias is set for several expressions, rightmost will rewrite result of all preceding (there can't be many attributes with same name in DS object). Aliases are set per column type (same as in TRANSFORM).

SELECT expressions are bound to Record-level context for all DS participating in the SELECT.

For Structured SELECTs, compound property names must be used to access nested properties using dot notation ("top.inner.innermost"), and zero-based array indices or ranges to access array members. For example, "prop.inner[0]" for 1st menber of "prop.inner" array, or "[2].prop" for 3rd topmost array member's property "prop". Ranges may be specified only for innermost properties (no dots allowed afterward): "prop[*]" or "prop[]" for entire array "prop", "prop[1:]" from 2nd member onwards, "prop[:5]" up to 6th, "prop[2:4]" from 3rd to 5th. Range boundaries may be negative which means 'Nth member from array end'. If right range goes past array size, result will be NULL padded. If boundaries overlap or both go beyond beginning, empty array will be returned.

In case of straight JOINs, fully-qualified compound names for attributes from different DSs must be used in the same dot notation.

Source scope is specified in the FROM clause:

from_scope

Queries can be performed to a single or many (with UNION) DS of any type, or multiple JOINed KeyValue DSs.

Query to a single DS is equivalent of iteration over all its data and simple evaluation of what-to-select items, then applying constraints.

In the case of a single DS with * query and no other constraints, a 'fast shallow copy' of DS is performed (just a new name to same underlying data is assigned).

UNIONs could unite either all listed DSs, or if a star syntax is used, all DSs with names that have specified prefix. There are several types of UNION operation available:

union_op
UNION OperationResult
By default, CONCAT or CATRecords of all DSs are queried, in order of occurrence
XOROnly records that are unique to their source DSs are queried, other are discarded
ANDOnly records that are occurred in all participating DSs are queries, other are discarded

UNION can be applied to DSs of same type only. Also, for UNION to be successful, all top-level objects of participating DSs must have top-level attributes of same names.

JOINs can be used to form new objects from participating DSs Objects that have same Record keys. There are following JOIN operations available:

join_op

Resulting record objects of straight JOINs have internal data from the first meaningful Object occurred in the DS iteration order, while Attributes are read from all JOINed sets, using compound name with dot notation (1st part is set name, then attribute name). For subtracting JOINs, however, only one set is the source of resulting objects, thus only that object Attributes are available, using simple name.

JOIN OperationTypeResult Evaluation RulesResult Type
By default, INNERStraightRecords that have matching keys in all participating DSs are queried, other discarded. Records always have all Attributes from all participating sets1st set
LEFTStraightDSs are iterated from left to right. Result records always have columns form 1st DS, and from subsequent DSs records with same keys. If there is no record with matching key, NULLs will be used for its attributes1st set
LEFT ANTISubtractingDSs are iterated from left to right. Records from 1st DS that have no matching keys in all subsequent DSs will be queried, other discarded1st set
RIGHTStraightDSs are iterated from right to left. Result records always have columns form last DS, and from preceding DSs records with same keys. If there is no record with matching key, NULLs will be used for its attributesLast set
RIGHT ANTISubtractingDSs are iterated from right to left. Records from last DS that have no matching keys in all subsequent DSs will be queried, other discardedLast set
OUTERStraightDSs are iterated from left to right. Result records always have columns form all DSs records with same keys. If there is no record with matching key, NULLs will be used for its attributes1st set

Note that any referenced DS in the FROM clause may have it own PARTITION specifier, if needed. In case of star-syntax all prefixed DSs must have referenced parts.

There are 3 types constraints available to narrow down the query results: WHERE, DISTINCT, and LIMIT clauses.

WHERE constraint is a single Boolean expression (applicable only to one level of object selected by type alias, if there are many), evaluated before 'what-to-select' expressions and all other constraints for each record. If result is TRUE, record is queried, otherwise discarded.

where_expr

DISTINCT, if specified, is applied after resulting query is evaluated. Of all records that have been evaluated to exactly same key and object, there will be only one left, and any repeats discarded.

LIMIT is applied at the very last, and comes in two flavors, by count and by percentage. Count can be specified in any valid Numeric format, and the resulting DS will have at most the specified number of records, randomly selected. If set by percentage, it must fall into a range from 0 (exclusive) to 100 (inclusive), and result will have randomly trimmed down to set percentage of records. If invalid percentage is specified, that will lead to an error.

limit_expr

To summarize, the order of SELECT evaluation is as of follows:

  1. Check of existence of INTO DS name
  2. Iteration, JOIN, or UNION over source scope of DSs
  3. WHERE constraint expression
  4. 'What-to-select' expressions, from left to right
  5. DISTINCT constraint
  6. LIMIT constraint

CALL Statement

Serves to call pluggable Operations, akin to SQL stored procedures (but with a substantial semantic difference), or TDL Procedures (see further below).

call_stmt

Functional expression includes name of an Operation or Procedure and its parameters. Full list of available Operations depends on distribution, thus them and their parameters are documented separately. TDL Procedures are ephemeral and exist only in the current execution context.

func_expr

Operations always create new DSs from existing and require INPUT FROM and OUTPUT INTO DS references (in either order).

operation_io

Like SELECT, there is source scope of DSs, set by INPUT clause. There are two flavors of source scope, depending on Operation.

Positional INPUT is either a list of DS names, or a prefix, followed by a star. In case of a list, Operation tries to use matching inputs in the specified order. In case of star syntax, order doesn't matter. Some Operations may require an exact number of positional inputs, while some will use all positional inputs to create an equal number of positional outputs. Please refer to Operation documentation for its requirements.

from_positional

Named INPUTs have internal aliases, specific to Operations. In that case, each DS has an explicit role for the calling Operation. Some Operations may have listed some named inputs optional, and for the exact meaning of omitted input please see their respective docs.

from_named

Note that any referenced DS in the INOUT clause may have it own PARTITION specifier, if needed. In case of star-syntax all prefixed DSs must have referenced parts.

Quite unlike SELECT, an Operation may produce a number of resulting DSs. There are 3 distinct varieties of them. Filtered outputs have records internally unchanged from input, but selected according to some set of rules. Augmented outputs extend of otherwise change internal structure of records coming from inputs. Finally, there may be outputs consisting of records, generated from input anew. Please refer to Operation docs for each case.

Positional OUTPUT has meaning similar to positional INPUT. Specified star name is used for setting resulting DSs prefix (with suffixes coming from INPUT names), or a literal list of names is used, if they are comma-separated.

into_posotional

Named OUTPUTs are explicitly different, and for each of Operation OUTPUT alias, resulting DS generally is of a specified structure. Some operations may have optional named OUTPUTs.

into_named

An Operation may have positional INPUTs and named OUTPUTs, or vice versa.

ANALYZE Statement

Performs calculation of basic statistics about the specified DS (or multiple DSs, found by prefix in the star syntax).

analyze_stmt

This statement uses results of KEY expression evaluated for each DS Record as 'counting values' for DS statistics, like in TRANSFORM statement. If KEY clause is omitted, Record's own Key is used automatically.

After the counting value is decided, all records in the DS are iterated over to calculate the following indicators:

Statistical indicatorMeaning
_nameDS name
_typeDS Record type
_partsNumber of DS partitions
_counterKEY expression
_totalCount of records in the DS, like COUNT * in SQL
_uniqueNumber of unique counting values
_averageAverage number of records per each of unique counting value
_medianMedian number of records per each of unique counting value

They are appended to a special Columnar DS named "_metrics" as a new record. Its records keys are always names of analyzed DSs.

Because each DS may be altered multiple times in their life cycle, and because ANALYZE may be called another time for same DS (i.e. with different KEY expression), each successive invocation of ANALYZE for same DS adds a new record to "_metrics" DS instead of replacing it, keeping history of all ANALYZE calls.

Per-partition

If PARTITION clause is specified, an additional round of statistics calculation is performed for each partition of the analyzed DS. In that case, an additional special Columnar DS is created for each analyzed DS, named "_metrics_<DS name>". There are as many records in that DS as many partitions it has, and their fields contain the following indicators:

Statistical indicatorMeaning
_partPartition number, starting with 0
_counterKEY expression
_totalCount of records in the partition
_uniqueNumber of unique counting values per partition
_averageAverage number of records per each of unique counting value per partition
_medianMedian number of records per each of unique counting value per partition

Its records keys are always partition numbers of analyzed DSs. If some partition is emtpy, its indicators are amounted to zero.

Each subsequent ANALYZE PARTITION replaces per-partition statistics DS "_metrics_<DS name>", and if multiple results are required (e.g. with different KEY expressions), make sure to copy it under different names.

Compound Control Flow Statements

Compond statements have nested statements (including other compound statements) inside their clauses that execute under various conditions in a separate execution context.

LET Statement

Defines (or undefines) a new script Variable in the current execution context.

let_stmt

Simple expression of a Variable is same as of a Parameter expression (that means Name Array definitions are allowed). If expression result is NULL, Variable is removed from the context.

Arrays of values derived from DS Records may be returned as results from the limited SELECT query:

sub_query

There are a few differences from a regular SELECT query. First, only one DS must be queried (no JOINs nor UNIONs are allowed). Second, only one expression in the what-to-select list is allowed. Query constraints do work as expected, and PARTITION specifier is supported as well.

IF Statement

Performs control flow branching. Evaluates a Boolean expression, and executes statements nested in THEN clause for TRUE. Otherwise, executes statements in ELSE clause, if it is present, or just does nothing.

if_stmt

Nested statements in either branches can be any number of TDL4 statements (even none).

LOOP Statement

Executes statements nested in BEGIN clause as many times as number values in the array, each time setting loop Variable to next value. If expression evaluates to NULL or array is empty, and ELSE clause is present, its nested statements will be executed once instead.

loop_stmt

If there was an existing Variable with same name as loop Variable in the outer context, it will be restored to its prevoius value, because LOOP creates a new nested context.

CREATE PROCEDURE Statement

Creates a sequence of statements in the current global execution context, that can be CALLed as a whole with its own context of Variables.

create_proc

Procedures exist in the same namespace as Operations, and they can't override existing Operations. But they can be replaced further down in the script code. For example, if your library script defines some Procedures, you may redefine some in your script by using OR REPLACE clause.

Each @-Parameter becomes a $-Variable inside BEGIN clause with the value assigned at the CALL context. Because Parameters become Variables, they are always named, never positional.

Parameters can be optional, in that case they should have default value set by an expression (evaluated in the current context, not in the CALL context). If defined without default value (just as a @Name), Parameters become mandatory and must always have a CALL value.

proc_param

DROP PROCEDURE Statement

Use this statement to remove one or many PROCEDUREs from the current global execution context.

drop_proc

CREATE FUNCTION Statement

Language-level Functions provide a convenient way to call same expression many times but with different values passed as Variables.

create_func

They must always return some value. In the simplest form, function body is just a single "RETURN <expression>" statement.

return_func

In the more complex case they are similar to Procedures in the sense they also are sequences of statements executed in their own context, but these statements are limited to LET, IF, LOOP, and RETURN. There will be an error if Function execution path doesn't end with RETURN.

func_stmt

Compound Function statements are limited as well.

let_func

LET can't have subqueries.

if_func

IF branches are limited to other function statements.

loop_func

Same applies to LOOP.

Language-level Functions exist in the same namespace as Expression Functions provided by Pluggable modules, and they can't override them. But they can be replaced further down in the script code. For example, if your library script defines some Functions, you may redefine some in your script by using OR REPLACE clause.

Functions that have a RECORD specifier become Record-level, must always be called in Record-level contexts, and can call other Record-level Functions. They receive Record Key and Object as implicit arguments.

Each @-Parameter becomes a $-Variable inside BEGIN clause with the value assigned at the CALL context. In the calling site Parameters are always positional, never named.

Parameters can be optional, in that case they should have default value set by an expression (evaluated in the current context, not in the call context) which is used if NULL is passed in the calling site. If defined without default value (just as a @Name), Parameters become mandatory and must always have a calling site value.

DROP FUNCTION Statement

Use this statement to remove one or many FUNCTIONs from the current global execution context.

drop_func

Settings Statements

OPTIONS Statement

Sets internal Data Cooker parameters to specified values.

options_stmt

Unlike all other statements, Data Cooker pulls all OPTIONS to topmost execution context and executes them once, before evaluating all other script contents. They are completely ignored thereafter, in the normal script execution flow.

Parameters, that may be set by OPTIONS, is dependent on Data Cooker distribution and version, so please see a dedicated doc in the distribution for exact definitions.