Data Lineage

Contents

Overview

Data lineage is metadata added to records and fields indicating where they were loaded from. It can be useful for audits and reconciliation as well as troubleshooting. See examples of data lineage in action.

Enabling Data Lineage

Data lineage is a feature specific to DataReaders. It is turned off by default since it does add a small processing and memory overhead to each job when enabled. Call setSaveLineage(true) on your instance of DataReader before using it in a job.

Not all DataReaders support data lineage. In some cases, we planning to add support soon, while in other cases, like MemoryReader, it just isn't appropriate. You can test whether a particular DataReader supports data lineage by calling isLineageSupported(). An exception will be thrown if you attempt to enable data lineage on a DataReader that doesn't support it.

Data Lineage Properties

There are two types of data lineage in DataPipeline: record lineage and field lineage. Record lineage informs you of the starting location where the record was loaded from. While field lineage informs you of the starting location for each individual field.

Data lineage properties are stored as key-value pairs on Record and Field using the Session interface. They are prefixed with "dataLineage." to prevent collision with your own metadata. While you can access these properties directly via their keys, we've created RecordLineage and FieldLineage wrapper classes to ease setting and retrieving lineage properties.

Read Record Lineage Properties

Read Field Lineage Properties

Read Data Lineage Properties In A Job

Record Lineage Properties

# Property Name Java Type Description
1 file File The java.io.File, if one was used to create the DataReader.
2 fileLineNumber Long The line number in the input file starting with 0.
3 fileColumnNumber Long The column number in the input file starting with 0.
4 recordNumber Long The sequential record number starting with 0.
5 databaseUrl String The JDBC connection URL.
6 databaseQuery String The SQL statement used to query the database.
7 recordSchema String The schema for ORC and Parquet files.

Field Lineage Properties

Field lineage inherits all the properties of record lineage and adds the following ones of its own.

# Property Name Java Type Description
1 originalFieldIndex Integer The index of a field set by the DataReader before any transformation or operation was performed.
2 originalFieldName String The name of a field set by the DataReader before any transformation or operation was performed.
3 databaseColumnName String The column name returned from the database for this field.
4 fieldSchema String The field's schema for ORC and Parquet files.
5 databaseColumnSqlType Integer The column type returned from the database for this field.
6 databaseColumnDisplaySize Integer The column display size returned from the database for this field.
7 databaseColumnScale Integer The column scale returned from the database for this field.
8 databaseColumnPrecision Integer The column precision returned from the database for this field.
9 databaseColumnSchemaName String The schema name of a column returned from the database for this field.
10 databaseColumnCatalogName String The catalog name of a column returned from the database for this field.
11 databaseColumnTableName String The table name of a column returned from the database for this field.
12 databaseColumnTypeName String The column type name returned from the database for this field.
13 databaseColumnClassName String The class name of a column returned from the database for this field.
14 databaseColumnSigned Boolean The column isSigned value returned from the database for this field.
15 databaseColumnNullable Boolean The column isNullable value returned from the database for this field.
16 databaseColumnJavaType Class The Java type mapped from java.sql.Types.
17 databaseColumnFieldType FieldType The FieldType mapped from the field's Java type.

Supported Record Lineage Properties By Data Reader

The properties in this table are set at both the record and field levels.

# Reader/Record Lineage file fileLineNumber fileColumnNumber recordNumber databaseUrl databaseQuery recordSchema
1 CombinedLogReader      
2 CSVReader      
3 ExcelReader      
4 FixedWidthReader      
5 JavaBeanReader      
6 JdbcReader          
7 JsonReader      
8 JsonRecordReader      
9 OrcDataReader          
10 ParquetDataReader          
11 SimpleJsonReader      
12 SimpleXmlReader      
13 XmlReader      
14 XmlRecordReader      

Supported Field Lineage Properties By Data Reader

# Reader/Field Lineage originalFieldIndex originalFieldName databaseColumnName fieldSchema databaseColumnSqlType databaseColumnDisplaySize databaseColumnScale databaseColumnPrecision databaseColumnSchemaName databaseColumnCatalogName databaseColumnTableName databaseColumnTypeName databaseColumnClassName databaseColumnSigned databaseColumnNullable databaseColumnJavaType databaseColumnFieldType
1 CombinedLogReader                              
2 CSVReader                              
3 ExcelReader                              
4 FixedWidthReader                              
5 JavaBeanReader                              
6 JdbcReader  
7 JsonReader                              
8 JsonRecordReader                              
9 OrcDataReader                            
10 ParquetDataReader                            
11 SimpleJsonReader                              
12 SimpleXmlReader                              
13 XmlReader                              
14 XmlRecordReader                              

Mobile Analytics