Data Lineage

Contents

Overview

Data lineage is metadata added to records and fields indicating where they were loaded from. It can be useful for audits and reconciliation as well as troubleshooting. See examples of data lineage in action.

Enabling Data Lineage

Data lineage is a feature specific to DataReaders. It is turned off by default since it does add a small processing and memory overhead to each job when enabled. Call setSaveLineage(true) on your instance of DataReader before using it in a job.

Not all DataReaders support data lineage. In some cases, we planning to add support soon, while in other cases, like MemoryReader, it just isn't appropriate. You can test whether a particular DataReader supports data lineage by calling isLineageSupported(). An exception will be thrown if you attempt to enable data lineage on a DataReader that doesn't support it.

Data Lineage Properties

There are two types of data lineage in DataPipeline: record lineage and field lineage. Record lineage informs you of the starting location where the record was loaded from. While field lineage informs you of the starting location for each individual field.

Data lineage properties are stored as key-value pairs on Record and Field using the Session interface. They are prefixed with "dataLineage." to prevent collision with your own metadata. While you can access these properties directly via their keys, we've created RecordLineage and FieldLineage wrapper classes to ease setting and retrieving lineage properties.

Read Record Lineage Properties

Read Field Lineage Properties

Read Data Lineage Properties In A Job

Record Lineage Properties

#	Property Name	Java Type	Description
1	file	File	The java.io.File, if one was used to create the DataReader.
2	fileLineNumber	Long	The line number in the input file starting with 0.
3	fileColumnNumber	Long	The column number in the input file starting with 0.
4	recordNumber	Long	The sequential record number starting with 0.
5	databaseUrl	String	The JDBC connection URL.
6	databaseQuery	String	The SQL statement used to query the database.
7	recordSchema	String	The schema for ORC and Parquet files.

Field Lineage Properties

Field lineage inherits all the properties of record lineage and adds the following ones of its own.

#	Property Name	Java Type	Description
1	originalFieldIndex	Integer	The index of a field set by the DataReader before any transformation or operation was performed.
2	originalFieldName	String	The name of a field set by the DataReader before any transformation or operation was performed.
3	databaseColumnName	String	The column name returned from the database for this field.
4	fieldSchema	String	The field's schema for ORC and Parquet files.
5	databaseColumnSqlType	Integer	The column type returned from the database for this field.
6	databaseColumnDisplaySize	Integer	The column display size returned from the database for this field.
7	databaseColumnScale	Integer	The column scale returned from the database for this field.
8	databaseColumnPrecision	Integer	The column precision returned from the database for this field.
9	databaseColumnSchemaName	String	The schema name of a column returned from the database for this field.
10	databaseColumnCatalogName	String	The catalog name of a column returned from the database for this field.
11	databaseColumnTableName	String	The table name of a column returned from the database for this field.
12	databaseColumnTypeName	String	The column type name returned from the database for this field.
13	databaseColumnClassName	String	The class name of a column returned from the database for this field.
14	databaseColumnSigned	Boolean	The column isSigned value returned from the database for this field.
15	databaseColumnNullable	Boolean	The column isNullable value returned from the database for this field.
16	databaseColumnJavaType	Class	The Java type mapped from java.sql.Types.
17	databaseColumnFieldType	FieldType	The FieldType mapped from the field's Java type.

Supported Record Lineage Properties By Data Reader

The properties in this table are set at both the record and field levels.

#	Reader/Record Lineage	file	fileLineNumber	fileColumnNumber	recordNumber	databaseUrl	databaseQuery	recordSchema
1	CombinedLogReader
2	CSVReader
3	ExcelReader
4	FixedWidthReader
5	JavaBeanReader
6	JdbcReader
7	JsonReader
8	JsonRecordReader
9	OrcDataReader
10	ParquetDataReader
11	SimpleJsonReader
12	SimpleXmlReader
13	XmlReader
14	XmlRecordReader

Supported Field Lineage Properties By Data Reader

#	Reader/Field Lineage	originalFieldIndex	originalFieldName	databaseColumnName	fieldSchema	databaseColumnSqlType	databaseColumnDisplaySize	databaseColumnScale	databaseColumnPrecision	databaseColumnSchemaName	databaseColumnCatalogName	databaseColumnTableName	databaseColumnTypeName	databaseColumnClassName	databaseColumnSigned	databaseColumnNullable	databaseColumnJavaType	databaseColumnFieldType
1	CombinedLogReader
2	CSVReader
3	ExcelReader
4	FixedWidthReader
5	JavaBeanReader
6	JdbcReader
7	JsonReader
8	JsonRecordReader
9	OrcDataReader
10	ParquetDataReader
11	SimpleJsonReader
12	SimpleXmlReader
13	XmlReader
14	XmlRecordReader