Transform Definition Language

Version 4 release 4.2, August 2024

Table of Contents

General Description
DS
··· DS Life Cycle
··· ··· Create DS
··· ··· Transform DS
··· ··· Store DS
Type System
··· DS Types
··· ··· PlainText
··· ··· Structured
··· ··· Passthru
··· ··· Columnar
··· ··· Spatial DS Types
··· ··· ··· Point
··· ··· ··· Track
··· ··· ··· Polygon
··· Attribute, Variable, and Parameter Types
··· ··· Names
··· ··· String Literals
··· ··· Numerics
··· ··· Boolean Type
··· ··· NULL
··· ··· Typed Arrays
··· Custom Object Types
Script Structure
··· Top-level Structure
··· Expressions
··· ··· Automatic Evaluation in Strings and Names
··· ··· Expression Operators
··· ··· Expression Functions
··· Parameter List Expressions
Data Storage Interface Statements
··· CREATE DS Statement
··· COPY DS Statement
Data Transformation Statements
··· TRANSFORM DS Statement
Data Manipulation Statements
··· SELECT Statement
··· CALL Statement
··· ANALYZE Statement
Compound Control Flow Statements
··· LET Statement
··· IF Statement
··· LOOP Statement
··· CREATE PROCEDURE Statement
··· DROP PROCEDURE Statement
Settings Statements
··· OPTIONS Statement

General Description

Transform Definition Language version 4 (TDL4) is a domain-specific language designed to describe ETL processes to perform with Data Cooker ETL tool. The strongest emphasis is on the Transformation phase.

Syntax of TDL4 is inspired by and based on SQL, but has substantial semantic differences due to Data Cooker's differences from traditional RDBMS. Instead of SQL tables that consist of column-based rows, Data Cooker operates with strongly typed collections of Record Objects called 'DS' (for Data Sets / Data Streams). Also, Transformation phase of ETL does not imply analytic workloads or complex data manipulation that traditional SQL facilities usually provide.

As an SQL dialect, TDL4 is mainly declarative but also has substantial means of flow control, and even procedural elements.

Another difference from SQL engines is that Data Cooker doesn't require global data catalog nor information schema. A rudimentary partial schema is defined ad hoc by language statements only if required by current context. It doesn't imply control over data types, nor integrity, neither consistency, neither enforces data constraints.

TDL4 allows high level manipulation over DSs as a whole, and intentionally omits features like aggregation, window functions, and anything that reduces several records of given DS to a single value. Only record-level expression evaluations for record attributes are provided by SELECT and similar statements. Those Expression Operators and Functions are implemented as Java classes and thus are extensible.

At the same time, Data Cooker's language interpreter also does provide built-in support for pluggable 'Operations' and 'Transforms' to allow more complex manipulations with DS written in Java. By implementing an Operation interface, developers are allowed to seamlessly integrate custom logic into ETL processes, including generation of new DS and changing data inside of them as they like. With Transforms, it is possible to change type of DS Record Objects.

Another level of extensibility is provided by 'Storage Adapters' layer of plug-ins that covers Extract and Load phases of ETL processing.

TDL4 does have specific type system, customized for DS that contain not only relatively flat Records, but also complex Spatial Objects with embedded geometry data, because Data Cooker's heritage begins from a geo-informatic processing toolchain.

DS

In terms of Data Cooker, a DS is a basic entity of every ETL process.

DS is a typed collection of Objects, that may have or not complex type-dependent internal structure, and/or a set of simple Attributes, accessible from the process by Names. Each Object together with an opaque Key form an elementary unit of processing, the Record.

DSs are partitioned — split into chunks called Partitions, to balance the load of executors because they process a single partition at each time. Number of Partitions in most cases must be set manually, but if not, it'll be defined by Input Adapter using intrinsic properties of data source and execution environment. Records are assigned to Partitions by hashcodes of its Keys. Correct definition of Keys is crucial to avoid 'data skew' (a situation when a small number of Partitions contain most of the records while all others are empty).

DS Life Cycle

According to definition of ETL, DSs can be created (once), transformed (many times), and stored (several times).

In Data Cooker, a DS can be created by several means from the external source of data or from other existing DSs.

Create DS

Create method #1. Input Adapters extract data from the data source, which could be some object or file-based storage, a traditional database, or something else. Each Input Adapter supports creation of a single type of DS, but may support different data sources.

Initial Key of newly created Record is set by Input Adapter, then DS is partitioned using those initial keys. Rules for a concrete Input Adapter are documented by its developer.

Method #2. A DS can be derived from one or many existing by performing an SQL-like SELECT query. Type and partitioning of a derived DS is inherited from parent DSs. See SELECT section of this manual for details.

Method #3 to create a DS is to call an Operation that generates new DS from existing ones using its own rules. There are some common patterns described below in the CALL section, but exact rules are documented by the developer of the Operation.

Transform DS

After DS is created, it may be transformed by Transforms, changing its Record type and/or Key, as well as number of partitions. Number of objects can also change if one source object corresponds to many resulting ones and vice versa. There can be performed as many consecutive Transforms on a given DS as needed. As Transforms are pluggable, transformation rules are documented by developer.

If Record Keys are changed by applying a Transform, affected Records may be assigned to other partitions.

Store DS

Final step of DS's life cycle is storage to some external data store, performed by Output Adapter. An Output Adapter may store different types of DSs into different formats. Rules, as usual, are documented by developer.

Type System

Differing from SQL, TDL4 data types are defined first on the level of DS as a whole for each Record Object, and then by level of Record Attributes and language constructs.

DS Types

Record Objects may have or not an internal structure, and if Record Object type does allow Attributes, each Attribute type is dynamically coalesced to match the context of expression that references it, if possible.

Supported types are:

PlainText

After loading a DS from unstructured text source, each line is represented as text in the system encoding (usually, UTF8), or byte array. Before performing meaningful transformations on the data contained in that set, it should be converted to another, structured type.

Vice versa, before saving structured DSs to text-based file formats, it should be converted to PlainText.

Structured

Each object of a Structured DS is backed by a JSON representation, so it can be as arbitrary as it fits within JSON limitations. Namely, String, Numeric, Boolean, an NULL properties are supported, as well as homogenous arrays (with every member of same type). Top level object may be an array itself.

Each top-level property is an Attribute itself (read-write), and nested properties' Attribute names are compound (read only). They may include array index or range part too.

Passthru

This type is not a data type per se, but just a marker reported from API calls to plugin Transforms that means it doesn't change data type in the set.

Columnar

Each object of DS of that type is a record consisting of uniquely named columns, enumerated in the specified order. After enumerated, they can be accessed via Attributes of same name.

Data in the column doesn't have an explicit type, nor it has any metadata except name. Columns, omitted from the record, have automatically assigned NULL values.

Spatial DS Types

Because Data Cooker inherited preceding project legacy which was a geo-informatics project, TDL4 natively supports geometry-specific Spatial data types implemented as extensions to JTS Topology Suite and GeoJSON.

Point

Objects of that type are abstractions of Points a map with an optional set of arbitrary Attributes.

Being an extension of JTS Point, each object has mandatory internal attributes of geographic 'latitude' and 'longitude'.

Track

Object of a Track is an abstraction of a route traveled by some physical object over the real-world map. It consists of Points, grouped together by an appointed 'tracking' Attribute, and sorted by another 'timestamp' Attribute, optionally cut into segments by 'segmentation' Attribute. If segmentation Attribute is not defined, there still be a single segment that includes all points of a track.

Each level of nested objects (Track, Segments, Points) may have an arbitrary number of other named Attributes.

Polygon

A region on a map, defined as a collection of a single external border outline and any number of internal cut-out outlines, with an arbitrary number of other named attributes.

There is no direct access to vertices of Polygon outlines, but there is an automatically computed Centroid Point, with radius set to its most distant border vertex.

Attribute, Variable, and Parameter Types

Outside DSs, there are also script-level Variables, and Expression- / Pluggable Entity-level Parameters, that may have their own types, more or less similar to SQL: Names, String, Numeric, Boolean, NULL, and Typed Arrays.

Names

Names are used to define and identify any entities that have to be referred to in the script, i.e. DSs, Attributes, Parameters, Variables, and so on.

Canonical representation of a Name is a string of Latin alphabet letters, numbers (any position except first), and underscores, or a string of any valid Unicode characters enclosed in double quotes. If name should contain a character of double quote itself, it must be doubled. If name clashes with a TDL4 keyword, it must be quoted.

ALL Names are case-sensitive.

Quoted representation of names provides automatic expression evaluation (see later in a dedicated paragraph).

Names could be joined by a dot to create compound names, used in SELECTs with JOINs.

Names of Variables are always explicitly identified by preceding 'dollar sign' $ sigil. Names of Parameters are likewise preceded by 'commercial at' @ sigil.

String Literals

String literals are fragments of arbitrary Unicode text enclosed in apostrophes. If a String should contain an apostrophe, it must be doubled.

String literals are always automatically evaluated (see later).

Numerics

There are different representations of immediate Numeric values allowed:

99 here means a sequence of decimal numbers, HH for hexadecimal numbers, s is sign sigil (+ or -). Parts in square braces are optional, symbols E, L, . (dot) and 0x are used as is.

Boolean Type

Literal TRUE is used for Boolean truth and literal FALSE for Boolean falseness.

NULL

Like in SQL, NULL is used for non-value.

Most of the expressions that include NULL evaluate to NULL, except ones that use truth table similar to OR operator, and some other specific cases (sse in the dedicated topic). As in SQL, no NULL is equal to another NULL.

A Variable or parameter that is not defined in the current context evaluates to NULL as well.

Typed Arrays

Names, String Literals, and Numerics can be collected into Arrays. Syntax for an array constructor is a simple list of values using comma, enclosed in square brackets. Optional keyword ARRAY may be used before the opening bracket.

All values in the array must be of same type as the first one, otherwise Data Cooker will throw an error and halt script execution. Arrays may be empty.

For Numeric Arrays that consist of Long values over some continuous range, there is a special syntax as well.

ARRAY[]ARRAY[String,String]ARRAY[Numeric,Numeric]ARRAY[Name,Name]RANGE[Numeric,Numeric]

Ranges may be ascending or descending and include both boundaries. For example, RANGE[1,-1] is equivalent of ARRAY[1,0,-1].

Arrays may be used in LOOP control structures, IN operator, to pass collections to CALL parameters, and in other contexts.

Custom Object Types

Some Operations and even Expression Functions define their own specific subtypes of data types that they interpret with their own set of rules. Those types are referred to as 'Something-like Objects'.

For example, standard date/time manipulation functions operate with DateTime-like Objects, which are represented by Long Epoch seconds or milliseconds, or String formatted as a ISO8601-alike timestamp.

Another example, some spatial Operations may extend existing Point Record type with custom 'radius' attribute measured in meters, that makes them POI-like Objects (for Points of Interest).

Script Structure

Top-level Structure

TDL4 script is a sequence of any number of statements, each ending with a mandatory semicolon.

statement;EOF

Each simple language statement is executed in current execution context, starting with top-level, while compound statements run nested statements in their own nested context. List of language statements is as of follows:

create_stmttransform_stmtcopy_stmtlet_stmtloop_stmtif_stmtselect_stmtcall_stmtanalyze_stmtoptions_stmtcreate_procdrop_proc

Whitespace and tabulation characters are ignored and may be freely used for means of readability. Script may be empty, there is no error.

Expressions

Expressions can be evaluated in two distinct contexts. First is 'Loose Expression' that can refer to any script-level entities. Another one is 'Record-level Expression' which can also refer to Record Attributes by their Names.

Loose Expression contextRecord-level Expression context
is_opbetween_opin_opcomparison_opvar_nameNumericStringNULLTRUEFALSEexpression_opdigest_opbool_opdefault_opfunc_callarrayis_opbetween_opin_opcomparison_opvar_nameNumericStringNULLTRUEFALSEexpression_opdigest_opbool_opdefault_opfunc_attrarrayattr

Record-level evaluation rules take place only where it makes sense.

Automatic Evaluation in Strings and Names

There is automatic expression evaluation performed inside quoted Names and String literals. Any text between curly braces { and } inside quoted Name or String is evaluated as an expression, and result is substituted as a substring in-place.

To turn off automatic evaluation, opening brace should be prepended by a backslash \.

Expression Operators

There are three SQL-specific 'peculiar' Operators implemented right by TDL4 interpreter: IN, IS, and BETWEEN.

Symbol
(Variants)
PriorityAssociativity# of OperandsDescription
Operand [NOT] IN ARRAY[]
Operand [NOT] IN $array_var
Operand [NOT] IN array_attr
35Right2TRUE if left operand is present in the right, cast to Array, FALSE otherwise. Vice versa for NOT variant
Operand IS [NOT] NULL35Right1TRUE if left operand is NULL, FALSE otherwise. Vice versa for NOT variant
Operand [NOT] BETWEEN Min AND Max35Right3TRUE if left operand is inclusively between Min and Max operands, cast to Numerics, FALSE otherwise. For NOT variant, TRUE if it is outside the range, excluding boundaries

All other Operators that can be used in Expressions are modeled after Java operators, implemented elsewhere in the classpath, and their list is extensible / dependent of distribution.

If an Operator requires operands of concrete data type, they will be directly cast or stringified-then-parsed to that type. If cast or parsing is impossible, Data Cooker throws an error and halts script execution.

Operators with higher priority are evaluated before lower priority operators. In case of same priority, evaluation is left to right except if Operator is right-associative. For expression grouping and changing of priority, parentheses can be used.

Most Operators result in NULL if any of their operands are NULL, but some do handle NULLs in a specific way.

For exact list of available Operators, their Priority, Associativity, and NULL handling, look into automatically generated distro docs.

Expression Functions

A function call has the following syntax:

func(expression,expression)func()

Functions may have any number of strictly positional Arguments (1st, 2nd, 3rd, and so on). Some have a fixed list of strongly typed Arguments, some accept any number of Arguments that are automatically cast to a single type. There are also functions that have no arguments and therefore return some predefined value, or even refer to Object Record as a whole and access its internal data, such as Spatial Object's geometry.

In any case, Arguments are evaluated from left to right.

Record-level Functions can be called only from Record-level Expressions, and are invalid in the Loose Expression context.

Functions are provided in the same way as Operators, by Java implementations from the classpath. For the complete list of Functions, their return and Argument types refer to your distro docs.

Parameter List Expressions

Parameter lists are expressions composed of parameters (at least one), separated by commas.

param,param

Each parameter is a name to value (which may be expression or array literal) pair.

@Name=expression@Name=array

Parameter lists are usually passed to named function-like expressions (TRANSFORM, CALL, and so on).

func(params_expr)

Parameter lists are bound to Record-level context and can refer to DS Attributes.

Data Storage Interface Statements

CREATE DS Statement

Creates a new DS from data source. Its name must be new and unique.

CREATEDSds_namefunc_exprFROMexpressionpartitionBYHASHCODESOURCERANDOM

Functional expression includes Input Adapter name and its parameter list. List of available Input Adapters depends on the distribution.

FROM expression evaluates to a path String pointing to data source. Each Input Adapter has its own syntax of path, which could be a URI, a query, or something more specific.

Number of partitions is set by PARTITION clause. Its expression evaluates to Java Integer number of partitions. If not set, single partition DS is created.

PARTITIONexpression

Initial object keying, and thus, initial partitioning can be changed by the BY algorithm clause. Default algorithm is HASHCODE, which sets initial Record Key to Java hashcode of the Record Object itself, so identical objects will be placed into same partitions. SOURCE derives Keys from source file name (or storage-dependent analog, it if's not file-based), so all Records from same source file will go to same partition. RANDOM sets keys to random values, effectively evenly shuffling records between partitions.

In any case, initial key is a Java Integer numeric.

COPY DS Statement

Saves existing DS to storage.

COPYDSds_name*func_exprINTOexpression

Likewise, INTO expression evaluates to a path String (usually, a URI) pointing to a data storage location, with a syntax specific to each Output Adapter. List of available Output Adapters depends on the distribution.

If a star is used after DS name, it is treated as a prefix, and all matching DSs will be copied to storage (usually, into subdirectories under specified INTO path for file-based storages). Each partition would usually go to a separate file.

Data Transformation Statements

TRANSFORM DS Statement

Invokes a Transform to convert a DS from one type to another. Instead of keyword TRANSFORM, ALTER may be used.

TRANSFORMDSds_name*func_exprcolumns_itemkey_itempartition

Is a star is used after DS name, it is treated as a prefix, and all matching DSs will be transformed with same parameters.

Functional expression includes name of Transform and list of its parameters. As transforms are pluggable, exact list depends on a distribution, and thus they're documented separately.

An item of column definitions of a certain type has the following structure:

SETtype_columnsCOLUMNS(Name,Name)SETtype_columnsCOLUMNS(var_name)

In case of the COLUMNS list with ($variable) syntax, referenced variable must be of type of Array of Names.

'Type of columns' in this context depend on the DS type.

DS TypeType of AttributesRecord Object Level
PlainTextVALUEThere are no Attributes
ColumnarVALUEColumn names of the record object
StructuredVALUEProperty names of the record object
PointPOINT or POI or VALUEAttributes of Point object
TrackTRACK or SEGMENTEDTRACK or VALUEAttributes of Track (level 1 object)
SEGMENT or TRACKSEGMENTAttributes of Segments (level 2 objects)
POINT or POIAttributes of Points (level 3 objects)
PolygonPOLYGON or VALUEAttributes of Polygon object

Key item is a simple expression, referring to existing DS columns:

KEYexpression

If key is set, DS Records will be reassigned to new partitions according to new Keys. Be aware that raw result of key expression is used to set Record Key, so if it evaluates to a constant, all records will go into one partition.

Key evaluation incidence depends on the Transform. Most evaluate keys before the record is transformed, but there are some that do it after (for example, if source Record is PlainText no Attributes are available before transformation). See Transforms documentation to find out.

If statement includes partition clause, DS will be repartitioned to a desired number of partitions after both transforming and rekeying, even if that number is same as before.

Data Manipulation Statements

SELECT Statement

Perform an SQL-like query to one or more existing DSs and create a new, resulting DS. Its name, specified in INTO clause, must be new and unique.

SELECTDISTINCT*what_expr,what_exprINTOds_nameFROMfrom_scopeFROMfrom_scopeINTOds_nameWHEREwhere_exprLIMITlimit_expr

Query part is a star (with meaning 'select all available attributes') or a list of 'what to select' expressions. Resulting DS attributes are named according to aliases (if specified) or stringified expression (expression text with all whitespace removed):

expressionAStype_aliasalias

If same alias is set for several expressions, rightmost will rewrite result of all preceding (there can't be many attributes with same name in DS object). Aliases are set per column type (same as in TRANSFORM).

SELECT expressions are bound to Record-level context for all DS participating in the SELECT.

For Structured SELECTs, compound property names must be used to access nested properties using dot notation ("top.inner.innermost"), and zero-based array indices or ranges to access array members. For example, "prop.inner[0]" for 1st menber of "prop.inner" array, or "[2].prop" for 3rd topmost array member's property "prop". Ranges may be specified only for innermost properties (no dots allowed afterward): "prop[*]" or "prop[]" for entire array "prop", "prop[1:]" from 2nd member onwards, "prop[:5]" up to 6th, "prop[2:4]" from 3rd to 5th. Range boundaries may be negative which means 'Nth member from array end'. If right range goes past array size, result will be NULL padded. If boundaries overlap or both go beyond beginning, empty array will be returned.

In case of straight JOINs, fully-qualified compound names for attributes from different DSs must be used in the same dot notation.

Source scope is specified in the FROM clause:

ds_namejoin_opds_name,ds_nameunion_opds_name,ds_nameunion_opds_name*

Queries can be performed to a single or many (with UNION) DS of any type, or multiple JOINed KeyValue DSs.

Query to a single DS is equivalent of iteration over all its data and simple evaluation of what-to-select items, then applying constraints.

In the case of a single DS with * query and no other constraints, a 'fast shallow copy' of DS is performed (just a new name to same underlying data is assigned).

UNIONs could unite either all listed DSs, or if a star syntax is used, all DSs with names that have specified prefix. There are several types of UNION operation available:

UNIONCONCATXORAND
UNION OperationResult
By default, CONCAT or CATRecords of all DSs are queried, in order of occurrence
XOROnly records that are unique to their source DSs are queried, other are discarded
ANDOnly records that are occurred in all participating DSs are queries, other are discarded

UNION can be applied to DSs of same type only. Also, for UNION to be successful, all top-level objects of participating DSs must have top-level attributes of same names.

JOINs can be used to form new objects from participating DSs Objects that have same Record keys. There are following JOIN operations available:

INNERLEFTANTIRIGHTANTIOUTERJOIN

Resulting record objects of straight JOINs have internal data from the first meaningful Object occurred in the DS iteration order, while Attributes are read from all JOINed sets, using compound name with dot notation (1st part is set name, then attribute name). For subtracting JOINs, however, only one set is the source of resulting objects, thus only that object Attributes are available, using simple name.

JOIN OperationTypeResult Evaluation RulesResult Type
By default, INNERStraightRecords that have matching keys in all participating DSs are queried, other discarded. Records always have all Attributes from all participating sets1st set
LEFTStraightDSs are iterated from left to right. Result records always have columns form 1st DS, and from subsequent DSs records with same keys. If there is no record with matching key, NULLs will be used for its attributes1st set
LEFT ANTISubtractingDSs are iterated from left to right. Records from 1st DS that have no matching keys in all subsequent DSs will be queried, other discarded1st set
RIGHTStraightDSs are iterated from right to left. Result records always have columns form last DS, and from preceding DSs records with same keys. If there is no record with matching key, NULLs will be used for its attributesLast set
RIGHT ANTISubtractingDSs are iterated from right to left. Records from last DS that have no matching keys in all subsequent DSs will be queried, other discardedLast set
OUTERStraightDSs are iterated from left to right. Result records always have columns form all DSs records with same keys. If there is no record with matching key, NULLs will be used for its attributes1st set

There are 3 types constraints available to narrow down the query results: WHERE, DISTINCT, and LIMIT clauses.

WHERE constraint is a single Boolean expression (applicable only to one level of object selected by type alias, if there are many), evaluated before 'what-to-select' expressions and all other constraints for each record. If result is TRUE, record is queried, otherwise discarded.

type_aliasexpression

DISTINCT, if specified, is applied after resulting query is evaluated. Of all records that have been evaluated to exactly same key and object, there will be only one left, and any repeats discarded.

LIMIT is applied at the very last, and comes in two flavors, by count and by percentage. Count can be specified in any valid Numeric format, and the resulting DS will have at most the specified number of records, randomly selected. If set by percentage, it must fall into a range from 0 (exclusive) to 100 (inclusive), and result will have randomly trimmed down to set percentage of records. If invalid percentage is specified, that will lead to an error.

Numeric%

To summarize, the order of SELECT evaluation is as of follows:

  1. Check of existence of INTO DS name
  2. Iteration, JOIN, or UNION over source scope of DSs
  3. WHERE constraint expression
  4. 'What-to-select' expressions, from left to right
  5. DISTINCT constraint
  6. LIMIT constraint

CALL Statement

Serves to call pluggable Operations, akin to SQL stored procedures (but with a substantial semantic difference), or TDL Procedures (see further below).

CALLfunc_exproperation_ioCALLfunc_expr

Functional expression includes name of an Operation or Procedure and its parameters. Full list of available Operations depends on distribution, thus them and their parameters are documented separately. TDL Procedures are ephemeral and exist only in the current execution context.

Operations always create new DSs from existing and require INPUT FROM and OUTPUT INTO DS references (in either order).

from_positionalfrom_namedinto_positionalinto_namedinto_positionalinto_namedfrom_positionalfrom_named

Like SELECT, there is source scope of DSs, set by INPUT clause. There are two flavors of source scope, depending on Operation.

Positional INPUT is either a list of DS names, or a prefix, followed by a star. In case of a list, Operation tries to use matching inputs in the specified order. In case of star syntax, order doesn't matter. Some Operations may require an exact number of positional inputs, while some will use all positional inputs to create an equal number of positional outputs. Please refer to Operation documentation for its requirements.

INPUTFROMds_name*INPUTFROMds_name,ds_name

Named INPUTs have internal aliases, specific to Operations. In that case, each DS has an explicit role for the calling Operation. Some Operations may have listed some named inputs optional, and for the exact meaning of omitted input please see their respective docs.

INPUTds_aliasFROMds_name,ds_aliasFROMds_name

Quite unlike SELECT, an Operation may produce a number of resulting DSs. There are 3 distinct varieties of them. Filtered outputs have records internally unchanged from input, but selected according to some set of rules. Augmented outputs extend of otherwise change internal structure of records coming from inputs. Finally, there may be outputs consisting of records, generated from input anew. Please refer to Operation docs for each case.

Positional OUTPUT has meaning similar to positional INPUT. Specified star name is used for setting resulting DSs prefix (with suffixes coming from INPUT names), or a literal list of names is used, if they are comma-separated.

OUTPUTINTOds_name*OUTPUTINTOds_name,ds_name

Named OUTPUTs are explicitly different, and for each of Operation OUTPUT alias, resulting DS generally is of a specified structure. Some operations may have optional named OUTPUTs.

OUTPUTds_aliasINTOds_name,ds_aliasINTOds_name

An Operation may have positional INPUTs and named OUTPUTs, or vice versa.

ANALYZE Statement

Performs calculation of basic statistics about the specified DS (or multiple DSs, found by prefix in the star syntax).

ANALYZEDSds_name*KEYproperty_name

This statement uses a single selected Attribute as a 'counting key' for statistics. If not set, record's own key always is used automatically. After the 'counting key' is decided, all objects of the DS are iterated, and the following stats are calculated:

Statistical indicatorMeaning
_streamNameDS name
_streamTypeDS type
_counterColumnProperty name set in KEY clause, or NULL if inapplicable
_totalCountCount of objects in the DS, like COUNT * in SQL
_uniqueCountersNumber of unique 'counting key' values
_counterAverageAverage number of objects per each of unique 'counting key' value
_counterMedianMedian number of objects per each of unique 'counting key' value

They are appended to special Columnar DS named '_metrics' as a new record. Its records keys are always names of analyzed DSs.

Because each DS may be TRANSFORMed multiple times in their life cycle, and because ANALYZE may be called another time for a DS (i.e. with different 'counting key'), each successive invocation of ANALYZE for same DS adds new record to '_metrics' DS instead of replacing it.

Compound Control Flow Statements

Compond statements have nested statements (including other compound statements) inside their clauses that execute under various conditions in a separate execution context.

LET Statement

Defines (or undefines) a new script Variable in the current execution context.

LETvar_name=expressionLETvar_name=sub_query

Simple expression of a Variable is same as of a parameter expression, as well as simple Array definition. If expression result is NULL, Variable is removed from the context.

Arrays may be returned as results from the limited SELECT query:

SELECTDISTINCTwhat_exprFROMds_nameWHEREwhere_exprLIMITlimit_expr

There are a few differences from a regular SELECT query. First, only one DS must be queried (no JOINs nor UNIONs are allowed). Second, only one expression in the what-to-select list is allowed. Non-Numeric results will be stringified, because arrays don't support data types other than Numerics and Strings. Query constraints do work as expected.

IF Statement

Performs control flow branching. Evaluates a Boolean expression, and executes statements nested in THEN clause for TRUE. Otherwise, executes statements in ELSE clause, if it is present, or just does nothing.

IFlet_exprTHENstatementsELSEstatementsENDIF

Nested statements in either branches can be any number of TDL4 statements (even none).

LOOP Statement

Executes statements nested in BEGIN clause as many times as number values in the array, each time setting loop Variable to next value. If expression evaluates to NULL or array is empty, and ELSE clause is present, its nested statements will be executed once instead.

LOOPvar_nameINlet_exprBEGINstatementsELSEstatementsENDLOOP

If there was an existing Variable with same name as loop Variable in the outer context, it will be restored to its prevoius value, because LOOP creates a new nested context.

CREATE PROCEDURE Statement

Creates a sequence of statements in the current global execution context, that can be CALLed as a whole with its own context of Variables.

CREATEORREPLACEPROCEDUREfunc(proc_param,proc_param)ASBEGINstatementsENDPROCEDURE

Procedures exist in the same namespace as Operations, and they can't override existing Operations. But they can be replaced further down in the script code. For example, if your library script defines some Procedures, you may redefine some in your script by using OR REPLACE clause.

Each @-Parameter becomes a $-Variable inside BEGIN clause with the value assigned at the CALL context. Because Parameters become Variables, they are always named, never positional.

Parameters can be optional, in that case they should have default value set by an expression (evaluated in the current context, not in the CALL context). If defined without default value (just as a @Name), Parameters become mandatory and must always have a CALL value.

param@Name

DROP PROCEDURE Statement

Use this statement to remove one or many PROCEDUREs from the current global execution context.

DROPPROCEDUREfunc,func

Settings Statements

OPTIONS Statement

Sets internal Data Cooker parameters to specified values.

OPTIONSparams_expr

Unlike all other statements, Data Cooker pulls all OPTIONS to topmost execution context and executes them once, before evaluating all other script contents. They are completely ignored thereafter, in the normal script execution flow.

Parameters, that may be set by OPTIONS, is dependent on Data Cooker distribution and version, so please see a dedicated doc in the distribution for exact definitions.