TDL4 Specification

Transform Definition Language version 4 (TDL4) is a domain-specific language designed to describe ETL processes to perform with Data Cooker ETL tool. The strongest emphasis is on the Transformation phase.

Syntax of TDL4 is inspired by and based on SQL, but has substantial semantic differences due to Data Cooker's differences from traditional RDBMS. Instead of SQL tables that consist of column-based rows, Data Cooker operates with strongly typed collections of Record Objects called 'DS' (for Data Sets / Data Streams). Also, Transformation phase of ETL does not imply analytic workloads or complex data manipulation that traditional SQL facilities usually provide.

As an SQL dialect, TDL4 is mainly declarative but also has substantial means of flow control, and even procedural elements.

Another difference from SQL engines is that Data Cooker doesn't require global data catalog nor information schema. A rudimentary partial schema is defined ad hoc by language statements only if required by current context. It doesn't imply control over data types, nor integrity, neither consistency, neither enforces data constraints.

TDL4 allows high level manipulation over DSs as a whole, and intentionally omits features like aggregation, window functions, and anything that reduces several records of given DS to a single value. Only record-level expression evaluations for record attributes are provided by SELECT and similar statements. Those Expression Operators and Functions are implemented as Java classes and thus are extensible.

At the same time, Data Cooker's language interpreter also does provide built-in support for pluggable 'Operations' and 'Transforms' to allow more complex manipulations with DS written in Java. By implementing an Operation interface, developers are allowed to seamlessly integrate custom logic into ETL processes, including generation of new DS and changing data inside of them as they like. With Transforms, it is possible to change type of DS Record Objects.

Another level of extensibility is provided by 'Storage Adapters' layer of plug-ins that covers Extract and Load phases of ETL processing.

TDL4 does have specific type system, customized for DS that contain not only relatively flat Records, but also complex Spatial Objects with embedded geometry data, because Data Cooker's heritage begins from a geo-informatic processing toolchain.

DS ↑

In terms of Data Cooker, a DS (short for Data Stream and/or Data Set) is a basic entity of every ETL process.

DS is a typed collection of Objects, that may have or not complex type-dependent internal structure, and/or a set of simple Attributes, accessible from the ETL process by Names. Each Object together with a Key (which may be opaque value or even another Object) form an elementary unit of processing, the Record.

DS Partitioning ↑

DSs are partitioned — split into chunks called Partitions, to balance the load of Spark executors because they process a single Partition at a time. Number of Partitions in most cases must be set manually, but if not, it'll be defined by Input Adapter using intrinsic properties of data source and execution environment.

Records are assigned to Partitions by hash codes of their Keys. Correct definition of Keys is crucial to avoid 'data skew' (a situation when a small number of Partitions contain most of the records while all others are empty).

In most cases, Data Cooker allows consumption of a DS either as a whole or as a subset of Partitions (from single part to all). If that subset is specified, new DS is transparently created from the parent one on the fly. For efficient processing on Partition basis, Keys must be stable and have a uniform distribution across entire DS.

DS Life Cycle ↑

According to definition of ETL, DSs can be created (once), transformed (many times), and stored (several times).

In Data Cooker, a DS can be created by several means from the external source of data or from other existing DSs.

Create DS ↑

Create method #1. Input Adapters extract data from the data source, which could be some object or file-based storage, a traditional database, or something else. Each Input Adapter supports creation of a single type of DS, but may support different data sources.

Initial Key of newly created Record is set by Input Adapter, then DS is partitioned using those initial keys. Rules for a concrete Input Adapter are documented by its developer.

Method #2. A DS can be derived from one or many existing by performing an SQL-like SELECT query. Type and partitioning of a derived DS is inherited from parent DSs. See SELECT section of this manual for details.

Method #3 to create a DS is to call an Operation that generates new DS from existing ones using its own rules. There are some common patterns described below in the CALL section, but exact rules are documented by the developer of the Operation.

Transform DS ↑

After DS is created, it may be transformed by Transforms, changing its Record type and/or Key, as well as number of Partitions. Number of objects can also change if one source object corresponds to many resulting ones and vice versa. There can be performed as many consecutive Transforms on a given DS as needed. As Transforms are pluggable, transformation rules are documented by developer.

If Record Keys are changed by applying a Transform, affected Records may be assigned to other Partitions, if needed.

Store DS ↑

Final step of DS's life cycle is storage to some external data store, performed by Output Adapter. An Output Adapter may store different types of DSs into different formats. Rules, as usual, are documented by developer.

Type System ↑

Differing from SQL, TDL4 data types are defined first on the level of DS as a whole for each Record Object, and then by level of Record Attributes and language constructs.

DS Types ↑

Record Objects may have or not an internal structure, and if Record Object type does allow Attributes, each Attribute type is dynamically coalesced to match the context of expression that references it, if possible.

PlainText ↑

After loading a DS from unstructured text source, each line is represented as text in the system encoding (usually, UTF8), or byte array. Before performing meaningful transformations on the data contained in that set, it should be converted to another, structured type.

Vice versa, before saving structured DSs to text-based file formats, it should be converted to PlainText.

Structured ↑

Each object of a Structured DS is backed by a JSON representation, so it can be as arbitrary as it fits within JSON limitations. Namely, String, Numeric, Boolean, an NULL properties are supported, as well as homogenous arrays (with every member of same type). Top level object may be an array itself.

Each top-level property is an Attribute itself (read-write), and nested properties' Attribute names are compound (read only). They may include array index or range part too.

Passthru ↑

This type is not a data type per se, but just a marker reported from API calls to plugin Transforms that means it doesn't change data type in the set.

Columnar ↑

Each object of DS of that type is a record consisting of uniquely named columns, enumerated in the specified order. After enumerated, they can be accessed via Attributes of same name.

Data in the column doesn't have an explicit type, nor it has any metadata except name. Columns, omitted from the record, have automatically assigned NULL values.

Spatial DS Types ↑

Because Data Cooker inherited preceding project legacy which was a geo-informatics project, TDL4 natively supports geometry-specific Spatial data types implemented as extensions to JTS Topology Suite and GeoJSON.

Point ↑

Objects of that type are abstractions of Points a map with an optional set of arbitrary Attributes.

Being an extension of JTS Point, each object has mandatory internal attributes of geographic 'latitude' and 'longitude'.

Track ↑

Object of a Track is an abstraction of a route traveled by some physical object over the real-world map. It consists of Points, grouped together by an appointed 'tracking' Attribute, and sorted by another 'timestamp' Attribute, optionally cut into segments by 'segmentation' Attribute. If segmentation Attribute is not defined, there still be a single segment that includes all points of a track.

Each level of nested objects (Track, Segments, Points) may have an arbitrary number of other named Attributes.

Polygon ↑

A region on a map, defined as a collection of a single external border outline and any number of internal cut-out outlines, with an arbitrary number of other named attributes.

There is no direct access to vertices of Polygon outlines, but there is an automatically computed Centroid Point, with radius set to its most distant border vertex.

Attribute, Variable, and Parameter Types ↑

Outside DSs, there are also script-level Variables, and Expression- / Pluggable Entity-level Parameters, that may have their own types, more or less similar to SQL: Names, String, Numeric, Boolean, NULL, and Arrays.

Names ↑

Names are used to define and identify any entities that have to be referred to in the script, i.e. DSs, Attributes, Parameters, Variables, and so on.

Canonical representation of a Name is a string of Latin alphabet letters, numbers (any position except first), and underscores, or a string of any valid Unicode characters enclosed in double quotes. If name should contain a character of double quote itself, it must be doubled. If name clashes with a TDL4 keyword, it must be quoted.

Quoted representation of names provides automatic expression evaluation (see later in a dedicated paragraph).

Names could be joined by a dot to create compound names, used in SELECTs with JOINs.

Names of Variables are always explicitly identified by preceding 'dollar sign' $ sigil. Names of Parameters are likewise preceded by 'commercial at' @ sigil.

String Literals ↑

String literals are fragments of arbitrary Unicode text enclosed in apostrophes. If a String should contain an apostrophe, it must be doubled.

Numerics ↑

99 here means a sequence of decimal numbers, HH for hexadecimal numbers, s is sign sigil (+ or -). Parts in square braces are optional, symbols E, L, . (dot) and 0x are used as is.

Boolean Type ↑

Literal TRUE is used for Boolean truth and literal FALSE for Boolean falseness.

NULL ↑

Most of the expressions that include NULL evaluate to NULL, except ones that use truth table similar to OR operator, and some other specific cases (sse in the dedicated topic). As in SQL, no NULL is equal to another NULL.

A Variable or parameter that is not defined in the current context evaluates to NULL as well.

Arrays ↑

TDL supports loosely-typed Arrays. Arrays may be used in LOOP control structures, IN operator, to pass collections to Pluggable Entities as Parameters, and in other contexts.

Virtually anything can be collected into Arrays by Pluggable Entities such as Functions, but literal array constructors supports only simple literals. Constructor syntax is just a list of values separated by comma, enclosed in square brackets (for example, ... WHERE "month" IN ['Nov', 'Dec', 11, 12]). Optional keyword ARRAY may be used before the opening bracket (for example, LOOP $i ARRAY['a', 'b', 'c'] BEGIN ...). Arrays may be empty (just []).

Arrays of Names are supported only in context of Pluggable Entity Parameters (CALL something(@affect_columns=["first","third","fourth"]) ...), and throw an error otherwise. Names cannot be mixed with other literals.

For Arrays that consist of Numeric values over some continuous range, there is a special constructor syntax as well.

Ranges may be ascending or descending and always include both boundaries. For example, RANGE[1,-1] is equivalent of ARRAY[1,0,-1]. If both boundaries of Range have type of Integer, then all values in the Range will be Integer. If any is Long, then all will be Long (and Doubles are converted to Long).

Custom Object Types ↑

Some Operations and even Expression Functions define their own specific subtypes of data types that they interpret with their own set of rules. Those types are referred to as 'Something-like Objects'.

For example, standard date/time manipulation functions operate with DateTime-like Objects, which are represented by Long Epoch seconds or milliseconds, or String formatted as a ISO8601-alike timestamp.

Another example, some spatial Operations may extend existing Point Record type with custom 'radius' attribute measured in meters, that makes them POI-like Objects (for Points of Interest).

Script Structure ↑

Top-level Structure ↑

TDL4 script is a sequence of any number of statements, each ending with a mandatory semicolon.

Each simple language statement is executed in current execution context, starting with top-level, while compound statements run nested statements in their own nested context. List of language statements is as of follows:

Whitespace and tabulation characters are ignored and may be freely used for means of readability. Script may be empty, there is no error.

Expressions ↑

Expressions can be evaluated in two distinct contexts. First is 'Loose Expression' that can refer to any script-level entities. Another one is 'Record-level Expression' which can also refer to Record Attributes by their Names.

Record-level evaluation rules take place only where it makes sense (SELECT, WHERE, KEY clauses contexts).

Automatic Evaluation in Strings and Names ↑

There is automatic expression evaluation performed inside quoted Names and String literals. Any text between curly braces { and } inside quoted Name or String is evaluated as an expression, and result is substituted as a substring in-place.

Please note that quoted entities evaluation happen in the context of parse time of language Operator, not the call time. As such, quoted entities won't re-evaluate themselves in SELECT clauses on per-Record basis. Use Expression Functions there instead.

Expression Operators ↑

There are three SQL-specific 'peculiar' Operators implemented right by TDL4 interpreter: IN, IS, and BETWEEN.

Symbol (Variants)	Priority	Associativity	# of Operands	Description
Operand [NOT] IN ARRAY[] Operand [NOT] IN $array_var Operand [NOT] IN array_attr	35	Right	2	TRUE if left operand is present in the right, cast to Array, FALSE otherwise. Vice versa for NOT variant
Operand IS [NOT] NULL	35	Right	1	TRUE if left operand is NULL, FALSE otherwise. Vice versa for NOT variant
Operand [NOT] BETWEEN Min AND Max	35	Right	3	TRUE if left operand is inclusively between Min and Max operands, cast to Numerics, FALSE otherwise. For NOT variant, TRUE if it is outside the range, excluding boundaries

All other Operators that can be used in Expressions are modeled after Java operators (down to symbolic representation), and/or implemented elsewhere in the classpath (and thus have a keyword-like name), and their list is extensible / dependent of distribution.

If an Operator requires operands of concrete data type, they will be directly cast or stringified-then-parsed to that type. If cast or parsing is impossible, Data Cooker throws an error and halts script execution.

Operators with higher priority are evaluated before lower priority operators. In case of same priority, evaluation is left to right except if Operator is right-associative or unary. For subexpression grouping and changing of evaluation order, parentheses can be used.

Most Operators result in NULL if any of their operands are NULL, but some do handle NULLs in a specific way.

For the exact list of available Operators, their Priority, Associativity, and NULL handling, look into automatically generated distro docs.

Expression Functions ↑

Functions may have any number of strictly positional Arguments (1^st, 2^nd, 3^rd, and so on). Some have a fixed list of strongly typed Arguments, some accept any number of Arguments that are automatically cast to a single type. There are also functions that have no arguments and therefore return some predefined value, or even refer to Object Record as a whole and access its internal data, such as Spatial Object's geometry.

Expressions in parentheses are considered as a special case of function call of an identity function (that returns is only argument unchanged), to allow parentheses to change evaluation order of compound expressions.

Record-level Functions can be called only from Record-level Attribute Expressions, and are invalid in the Loose Expression context. Calling them outside SELECT, WHERE, or KEY clauses will cause an error.

Expression Functions are provided in two ways. First, same as Operators, by Java implementations from the classpath. For the complete list of Pluggable Functions, their return and Argument types refer to your distro docs. Second, Language-level Functions may be defined on the script immediately, or in the library script. This method is discussed further in this document.

Parameter List Expressions ↑

Parameter lists are expressions composed of parameters (at least one), separated by commas.

Each parameter is a name to value (which may be expression or array literal) pair.

Parameter lists are usually passed to named function-like expressions (TRANSFORM, CALL, and so on). In that case parameter lists are bound to Record-level context and can refer to DS Record Attributes.

Data Storage Interface Statements ↑

CREATE DS Statement ↑

Functional expression includes Input Adapter name and its parameter list. List of available Input Adapters depends on the distribution.

FROM expression evaluates to a path String pointing to data source. Each Input Adapter has its own syntax of path, which could be a URI, a query, or something more specific.

Number of partitions is set by PARTITION clause. Its expression evaluates to Java Integer number of partitions. If not set, single partition DS is created.

Initial object keying, and thus, initial partitioning can be changed by the BY algorithm clause. Default algorithm is HASHCODE, which sets initial Record Key to Java hashcode of the Record Object itself, so identical objects will be placed into same partitions. SOURCE derives Keys from source file name (or storage-dependent analog, it if's not file-based), so all Records from same source file will go to same partition. RANDOM sets keys to random values, effectively evenly shuffling records between partitions.

COPY DS Statement ↑

Likewise, INTO expression evaluates to a path String (usually, a URI) pointing to a data storage location, with a syntax specific to each Output Adapter. List of available Output Adapters depends on the distribution.

If a star is used after DS name, it is treated as a prefix, and all matching DSs will be copied to storage (usually, into subdirectories under specified INTO path for file-based storages). Each partition would usually go to a separate file.

If needed, DS name could have additional PARTITION specifier, which has an expression that evaluates to a single Integer Partition number, or to Integer ARRAY or RANGE of Partitions to address. If star-syntax is used, all prefixed DSs must have addressed partitions.

Data Transformation Statements ↑

TRANSFORM DS Statement ↑

Invokes a Transform to convert a DS from one type to another. Instead of keyword TRANSFORM, ALTER may be used.

Is a star is used after DS name, it is treated as a prefix, and all matching DSs will be transformed with same parameters.

Functional expression includes name of Transform and list of its parameters. As transforms are pluggable, exact list depends on a distribution, and thus they're documented separately.

In case of the COLUMNS list with ($variable) syntax, referenced variable must be of type of Array of Names.

DS Type	Type of Attributes	Record Object Level
PlainText	VALUE	There are no Attributes
Columnar	VALUE	Column names of the record object
Structured	VALUE	Property names of the record object
Point	POINT or POI or VALUE	Attributes of Point object
Track	TRACK or SEGMENTEDTRACK or VALUE	Attributes of Track (level 1 object)
	SEGMENT or TRACKSEGMENT	Attributes of Segments (level 2 objects)
	POINT or POI	Attributes of Points (level 3 objects)
Polygon	POLYGON or VALUE	Attributes of Polygon object

If KEY expression is specified, each DS Record will be assigned with newly calculated Key.

Key evaluation incidence depends on the Transform. Most evaluate keys before the record is transformed, but there are some that do it after (for example, if source Record is PlainText no Attributes are available before transformation). See Transforms documentation to find out.

If there is a PARTITION clause, Records will then be redistributed between new partitions according to their new Keys (according to number of PARTITION expression evaluated to Integer). Be aware that raw result of KEY expression is used to set Record Key, so if it evaluates to a constant, all records will go into one partition. If there is no PARTITION, only Keys will be changed, but no redistribution is performed.

To keep existing number of partitions, just omit PARTITION expression, leaving only a keyword.

If statement includes PARTITION clause without KEY, DS will be repartitioned to a desired number of partitions after transforming, but using HASHCODE partitioning over existing Keys.

Data Manipulation Statements ↑

SELECT Statement ↑

Perform an SQL-like query to one or more existing DSs and create a new, resulting DS. Its name, specified in INTO clause, must be new and unique.

Query part is a star (with meaning 'select all available attributes') or a list of 'what to select' expressions. Resulting DS attributes are named according to aliases (if specified) or stringified expression (expression text with all whitespace removed):

If same alias is set for several expressions, rightmost will rewrite result of all preceding (there can't be many attributes with same name in DS object). Aliases are set per column type (same as in TRANSFORM).

SELECT expressions are bound to Record-level context for all DS participating in the SELECT.

For Structured SELECTs, compound property names must be used to access nested properties using dot notation ("top.inner.innermost"), and zero-based array indices or ranges to access array members. For example, "prop.inner[0]" for 1st menber of "prop.inner" array, or "[2].prop" for 3rd topmost array member's property "prop". Ranges may be specified only for innermost properties (no dots allowed afterward): "prop[*]" or "prop[]" for entire array "prop", "prop[1:]" from 2nd member onwards, "prop[:5]" up to 6th, "prop[2:4]" from 3rd to 5th. Range boundaries may be negative which means 'Nth member from array end'. If right range goes past array size, result will be NULL padded. If boundaries overlap or both go beyond beginning, empty array will be returned.

In case of straight JOINs, fully-qualified compound names for attributes from different DSs must be used in the same dot notation.

Queries can be performed to a single or many (with UNION) DS of any type, or multiple JOINed KeyValue DSs.

Query to a single DS is equivalent of iteration over all its data and simple evaluation of what-to-select items, then applying constraints.

In the case of a single DS with * query and no other constraints, a 'fast shallow copy' of DS is performed (just a new name to same underlying data is assigned).

UNIONs could unite either all listed DSs, or if a star syntax is used, all DSs with names that have specified prefix. There are several types of UNION operation available:

UNION Operation	Result
By default, CONCAT or CAT	Records of all DSs are queried, in order of occurrence
XOR	Only records that are unique to their source DSs are queried, other are discarded
AND	Only records that are occurred in all participating DSs are queries, other are discarded

UNION can be applied to DSs of same type only. Also, for UNION to be successful, all top-level objects of participating DSs must have top-level attributes of same names.

JOINs can be used to form new objects from participating DSs Objects that have same Record keys. There are following JOIN operations available:

Resulting record objects of straight JOINs have internal data from the first meaningful Object occurred in the DS iteration order, while Attributes are read from all JOINed sets, using compound name with dot notation (1st part is set name, then attribute name). For subtracting JOINs, however, only one set is the source of resulting objects, thus only that object Attributes are available, using simple name.

JOIN Operation	Type	Result Evaluation Rules	Result Type
By default, INNER	Straight	Records that have matching keys in all participating DSs are queried, other discarded. Records always have all Attributes from all participating sets	1st set
LEFT	Straight	DSs are iterated from left to right. Result records always have columns form 1st DS, and from subsequent DSs records with same keys. If there is no record with matching key, NULLs will be used for its attributes	1st set
LEFT ANTI	Subtracting	DSs are iterated from left to right. Records from 1st DS that have no matching keys in all subsequent DSs will be queried, other discarded	1st set
RIGHT	Straight	DSs are iterated from right to left. Result records always have columns form last DS, and from preceding DSs records with same keys. If there is no record with matching key, NULLs will be used for its attributes	Last set
RIGHT ANTI	Subtracting	DSs are iterated from right to left. Records from last DS that have no matching keys in all subsequent DSs will be queried, other discarded	Last set
OUTER	Straight	DSs are iterated from left to right. Result records always have columns form all DSs records with same keys. If there is no record with matching key, NULLs will be used for its attributes	1st set

Note that any referenced DS in the FROM clause may have it own PARTITION specifier, if needed. In case of star-syntax all prefixed DSs must have referenced parts.

There are 3 types constraints available to narrow down the query results: WHERE, DISTINCT, and LIMIT clauses.

WHERE constraint is a single Boolean expression (applicable only to one level of object selected by type alias, if there are many), evaluated before 'what-to-select' expressions and all other constraints for each record. If result is TRUE, record is queried, otherwise discarded.

DISTINCT, if specified, is applied after resulting query is evaluated. Of all records that have been evaluated to exactly same key and object, there will be only one left, and any repeats discarded.

LIMIT is applied at the very last, and comes in two flavors, by count and by percentage. Count can be specified in any valid Numeric format, and the resulting DS will have at most the specified number of records, randomly selected. If set by percentage, it must fall into a range from 0 (exclusive) to 100 (inclusive), and result will have randomly trimmed down to set percentage of records. If invalid percentage is specified, that will lead to an error.

CALL Statement ↑

Serves to call pluggable Operations, akin to SQL stored procedures (but with a substantial semantic difference), or TDL Procedures (see further below).

Functional expression includes name of an Operation or Procedure and its parameters. Full list of available Operations depends on distribution, thus them and their parameters are documented separately. TDL Procedures are ephemeral and exist only in the current execution context.

Operations always create new DSs from existing and require INPUT FROM and OUTPUT INTO DS references (in either order).

Like SELECT, there is source scope of DSs, set by INPUT clause. There are two flavors of source scope, depending on Operation.

Positional INPUT is either a list of DS names, or a prefix, followed by a star. In case of a list, Operation tries to use matching inputs in the specified order. In case of star syntax, order doesn't matter. Some Operations may require an exact number of positional inputs, while some will use all positional inputs to create an equal number of positional outputs. Please refer to Operation documentation for its requirements.

Named INPUTs have internal aliases, specific to Operations. In that case, each DS has an explicit role for the calling Operation. Some Operations may have listed some named inputs optional, and for the exact meaning of omitted input please see their respective docs.

Note that any referenced DS in the INOUT clause may have it own PARTITION specifier, if needed. In case of star-syntax all prefixed DSs must have referenced parts.

Quite unlike SELECT, an Operation may produce a number of resulting DSs. There are 3 distinct varieties of them. Filtered outputs have records internally unchanged from input, but selected according to some set of rules. Augmented outputs extend of otherwise change internal structure of records coming from inputs. Finally, there may be outputs consisting of records, generated from input anew. Please refer to Operation docs for each case.

Positional OUTPUT has meaning similar to positional INPUT. Specified star name is used for setting resulting DSs prefix (with suffixes coming from INPUT names), or a literal list of names is used, if they are comma-separated.

Named OUTPUTs are explicitly different, and for each of Operation OUTPUT alias, resulting DS generally is of a specified structure. Some operations may have optional named OUTPUTs.

ANALYZE Statement ↑

Performs calculation of basic statistics about the specified DS (or multiple DSs, found by prefix in the star syntax).

This statement uses results of KEY expression evaluated for each DS Record as 'counting values' for DS statistics, like in TRANSFORM statement. If KEY clause is omitted, Record's own Key is used automatically.

After the counting value is decided, all records in the DS are iterated over to calculate the following indicators:

Statistical indicator	Meaning
_name	DS name
_type	DS Record type
_parts	Number of DS partitions
_counter	KEY expression
_total	Count of records in the DS, like COUNT * in SQL
_unique	Number of unique counting values
_average	Average number of records per each of unique counting value
_median	Median number of records per each of unique counting value

They are appended to a special Columnar DS named "_metrics" as a new record. Its records keys are always names of analyzed DSs.

Because each DS may be altered multiple times in their life cycle, and because ANALYZE may be called another time for same DS (i.e. with different KEY expression), each successive invocation of ANALYZE for same DS adds a new record to "_metrics" DS instead of replacing it, keeping history of all ANALYZE calls.

Per-partition ↑

If PARTITION clause is specified, an additional round of statistics calculation is performed for each partition of the analyzed DS. In that case, an additional special Columnar DS is created for each analyzed DS, named "_metrics_<DS name>". There are as many records in that DS as many partitions it has, and their fields contain the following indicators:

Statistical indicator	Meaning
_part	Partition number, starting with 0
_counter	KEY expression
_total	Count of records in the partition
_unique	Number of unique counting values per partition
_average	Average number of records per each of unique counting value per partition
_median	Median number of records per each of unique counting value per partition

Its records keys are always partition numbers of analyzed DSs. If some partition is emtpy, its indicators are amounted to zero.

Each subsequent ANALYZE PARTITION replaces per-partition statistics DS "_metrics_<DS name>", and if multiple results are required (e.g. with different KEY expressions), make sure to copy it under different names.

Compound Control Flow Statements ↑

Compond statements have nested statements (including other compound statements) inside their clauses that execute under various conditions in a separate execution context.

LET Statement ↑

Simple expression of a Variable is same as of a Parameter expression (that means Name Array definitions are allowed). If expression result is NULL, Variable is removed from the context.

Arrays of values derived from DS Records may be returned as results from the limited SELECT query:

There are a few differences from a regular SELECT query. First, only one DS must be queried (no JOINs nor UNIONs are allowed). Second, only one expression in the what-to-select list is allowed. Query constraints do work as expected, and PARTITION specifier is supported as well.

IF Statement ↑

Performs control flow branching. Evaluates a Boolean expression, and executes statements nested in THEN clause for TRUE. Otherwise, executes statements in ELSE clause, if it is present, or just does nothing.

Nested statements in either branches can be any number of TDL4 statements (even none).

LOOP Statement ↑

Executes statements nested in BEGIN clause as many times as number values in the array, each time setting loop Variable to next value. If expression evaluates to NULL or array is empty, and ELSE clause is present, its nested statements will be executed once instead.

If there was an existing Variable with same name as loop Variable in the outer context, it will be restored to its prevoius value, because LOOP creates a new nested context.

CREATE PROCEDURE Statement ↑

Creates a sequence of statements in the current global execution context, that can be CALLed as a whole with its own context of Variables.

Procedures exist in the same namespace as Operations, and they can't override existing Operations. But they can be replaced further down in the script code. For example, if your library script defines some Procedures, you may redefine some in your script by using OR REPLACE clause.

Each @-Parameter becomes a $-Variable inside BEGIN clause with the value assigned at the CALL context. Because Parameters become Variables, they are always named, never positional.

Parameters can be optional, in that case they should have default value set by an expression (evaluated in the current context, not in the CALL context). If defined without default value (just as a @Name), Parameters become mandatory and must always have a CALL value.

DROP PROCEDURE Statement ↑

Use this statement to remove one or many PROCEDUREs from the current global execution context.

CREATE FUNCTION Statement ↑

Language-level Functions provide a convenient way to call same expression many times but with different values passed as Variables.

They must always return some value. In the simplest form, function body is just a single "RETURN <expression>" statement.

In the more complex case they are similar to Procedures in the sense they also are sequences of statements executed in their own context, but these statements are limited to LET, IF, LOOP, and RETURN. There will be an error if Function execution path doesn't end with RETURN.

Language-level Functions exist in the same namespace as Expression Functions provided by Pluggable modules, and they can't override them. But they can be replaced further down in the script code. For example, if your library script defines some Functions, you may redefine some in your script by using OR REPLACE clause.

Functions that have a RECORD specifier become Record-level, must always be called in Record-level contexts, and can call other Record-level Functions. They receive Record Key and Object as implicit arguments.

Each @-Parameter becomes a $-Variable inside BEGIN clause with the value assigned at the CALL context. In the calling site Parameters are always positional, never named.

Parameters can be optional, in that case they should have default value set by an expression (evaluated in the current context, not in the call context) which is used if NULL is passed in the calling site. If defined without default value (just as a @Name), Parameters become mandatory and must always have a calling site value.

DROP FUNCTION Statement ↑

Use this statement to remove one or many FUNCTIONs from the current global execution context.

Settings Statements ↑

OPTIONS Statement ↑

Unlike all other statements, Data Cooker pulls all OPTIONS to topmost execution context and executes them once, before evaluating all other script contents. They are completely ignored thereafter, in the normal script execution flow.

Parameters, that may be set by OPTIONS, is dependent on Data Cooker distribution and version, so please see a dedicated doc in the distribution for exact definitions.

Transform Definition Language

Table of Contents

General Description ↑