Data Lineage

Contents

Overview

Data lineage is metadata added to records and fields indicating where they were loaded from. It can be useful for audits and reconciliation as well as troubleshooting. See examples of data lineage in action.

Enabling Data Lineage

Data lineage is a feature specific to DataReaders. It is turned off by default since it does add a small processing and memory overhead to each job when enabled. Call setSaveLineage(true) on your instance of DataReader before using it in a job.

Not all DataReaders support data lineage. In some cases, we planning to add support soon, while in other cases, like MemoryReader, it just isn't appropriate. You can test whether a particular DataReader supports data lineage by calling isLineageSupported(). An exception will be thrown if you attempt to enable data lineage on a DataReader that doesn't support it.

Data Lineage Properties

There are two types of data lineage in DataPipeline: record lineage and field lineage. Record lineage informs you of the starting location where the record was loaded from. While field lineage informs you of the starting location for each individual field.

Data lineage properties are stored as key-value pairs on Record and Field using the Session interface. They are prefixed with "dataLineage." to prevent collision with your own metadata. While you can access these properties directly via their keys, we've created RecordLineage and FieldLineage wrapper classes to ease setting and retrieving lineage properties.

Read Record Lineage Properties

Read Field Lineage Properties

Read Data Lineage Properties In A Job

Record Lineage Properties

# Property Name Java Type Description
1 file File The java.io.File, if one was used to create the DataReader.
2 fileLineNumber Long The line number in the input file starting with 0.
3 fileColumnNumber Long The column number in the input file starting with 0.
4 recordNumber Long The sequential record number starting with 0.
5 databaseUrl String The JDBC connection URL.
6 databaseQuery String The SQL statement used to query the database.
7 recordSchema String The schema for ORC and Parquet files.

Field Lineage Properties

Field lineage inherits all the properties of record lineage and adds the following ones of its own.

# Property Name Java Type Description
1 originalFieldIndex Integer The index of a field set by the DataReader before any transformation or operation was performed.
2 originalFieldName String The name of a field set by the DataReader before any transformation or operation was performed.
3 databaseColumnName String The column name returned from the database for this field.
4 fieldSchema String The field's schema for ORC and Parquet files.

Supported Record Lineage Properties By Data Reader

The properties in this table are set at both the record and field levels.

# Reader/Record Lineage file fileLineNumber fileColumnNumber recordNumber databaseUrl databaseQuery recordSchema
1 CombinedLogReader      
2 CSVReader      
3 ExcelReader      
4 FixedWidthReader      
5 JavaBeanReader      
6 JdbcReader          
7 JsonReader      
8 JsonRecordReader      
9 OrcDataReader          
10 ParquetDataReader          
11 SimpleJsonReader      
12 SimpleXmlReader      
13 XmlReader      
14 XmlRecordReader      

Supported Field Lineage Properties By Data Reader

# Reader/Field Lineage originalFieldIndex originalFieldName databaseColumnName fieldSchema
1 CombinedLogReader    
2 CSVReader    
3 ExcelReader    
4 FixedWidthReader    
5 JavaBeanReader    
6 JdbcReader  
7 JsonReader    
8 JsonRecordReader    
9 OrcDataReader  
10 ParquetDataReader  
11 SimpleJsonReader    
12 SimpleXmlReader    
13 XmlReader    
14 XmlRecordReader    

Mobile Analytics