Traditional ETL (Extract Transform Load) tools broadly used in Data Warehouse environments tend to have two common deficiencies:
- emphasis on graphical user interface (and lack of a more efficient code interface) makes the design process slow and inflexible;
- dedicated ETL server generally means one extra hop for the data being transferred, which might be unacceptable considering today's data loads.
Enter XDTL (eXtensible Data Transformation Language). XDTL is an XML based descriptional language designed for specifying data transformations from one database/storage to another. XDTL syntax is defined in an XML Schema document (http://xdtl.org/xdtl). XML Schema of XDTL has semantic annotations linking it to XDTL ontology model. XDTL documents are interpreted by an XDTL Runtime Engine. XDTL/XDTL Runtime Engine is built not from the perspective of a slick IDE or a cool engine, but an efficient language for describing the data transformations. The goal is to produce a lightweight ETL development/runtime environment that would handle most of the common requirements with better efficiency than traditional jack-of-all-trades tools... see more.
(1) Runtime Engine interpreting XDTL scripts. XDTL is just a language used to build the scripts describing data transformations, so it needs an execution mechanism to (pre)process and run those scripts. An XDTL engine (interpreter) assembles the scripts, mappings and templates into a series of executable commands basically consisting of file and database operations and runs them. There can be more than one XDTL runtime, each one designed for its own purpose and implementing a specific subset of the language definition. An XDTL runtime could also beembedded into another system to provide the low-level plumbing for an application that has to accomplish some ELT functions internally.
(2) Mappings, stored either in MMX Repository or directly inside XDTL script. Mappings' concept is based on the ideas laid out in [Stöhr, T.; Müller, R.; Rahm, E.: An Integrative and Uniform Model for Metadata Management in Data Warehousing Environments, 1999]. Mappings express the structural dependencies and data dependencies between data 'sources' and 'targets' during different stages of a transformation process and "...describe all data movement-related dependencies as a set of Mapping instances. One instance represents a "single-stage" link between a set of source Entity instances and a set of target Entity instances where every entity plays only a single role, either source ot target."
(3) SQL templates turned into executable SQL statements. Being an ELT language, SQL statement represents its single most important functional part. The big question with SQL automation is: how far you want to go with substituting SQL code with something more abstract? In theory, you could decompose all your SQL statements into series of binary expressions, store them away, and assemble into SQL statements again as needed, driven by syntax of one particular SQL dialect. However, usually this approach fails to produce something useful... see more.
SQL is and probably remains the main workforce behind any ETL (and especially ELT flavour of ETL) tool. Automating SQL generation has arguably always been the biggest obstacle in building an ideal ETL tool - ie. completely metadata-driven, with small foot-print, multiple platform support on single code base... and, naturally, capable of generating complex SQL in an easy and flexible manner, with no rocket scientists required nearby. While SQL stands for Structured Query Language, ironically the language itself is not too well 'structured', and the abundance of vendor dialects and extensions does not help either.
Based on our experience (and tons of ETL code written) we have extracted a set of SQL 'patterns' common to ETL (ELT) tasks. The patterns are converted into templates for processing by a template engine (eg. Apache Velocity), each one realizing a separate SQL fragment, a full SQL statement or a complete sequence of commands implementing a complex process. The important thing about this is: templates are generic and can be used with multiple different mappings/data structures. The mappings are generic as well and can be used in multiple different patterns/templates... see more.
XDTL Runtime Version 1.0 (XDTL RT 1.0) is finished and running live! The runtime is written in Java (Sun JRE 1.6 required) and uses Velocity (http://velocity.apache.org/) for template processing. So here's a short primer.
There are two individually addressable units of execution in XDTL: a package is a container containing tasks, both of them can be invoked by name. A task consists of steps denoting individual commands that are executed sequentially and cannot be addressed individually. Besides tasks, a package contains three collections: parameters, variables and connections. There are file commands and database commands in XDTL... see more.