Data Lineage
Overview
Data lineage is metadata added to records and fields indicating where they were loaded from. It can be useful for audits and reconciliation as well as troubleshooting. See examples of data lineage in action.
Enabling Data Lineage
Data lineage is a feature specific to DataReaders. It is turned off by default since it does add a small processing and memory overhead to each job when
enabled. Call setSaveLineage(true)
on your instance of DataReader before using it in a job.
Not all DataReaders support data lineage. In some cases, we planning to add support soon, while in other cases, like MemoryReader, it just isn't
appropriate. You can test whether a particular DataReader supports data lineage by calling isLineageSupported()
. An exception will be thrown if
you attempt to enable data lineage on a DataReader that doesn't support it.
Data Lineage Properties
There are two types of data lineage in DataPipeline: record lineage and field lineage. Record lineage informs you of the starting location where the record was loaded from. While field lineage informs you of the starting location for each individual field.
Data lineage properties are stored as key-value pairs on Record and Field using the Session interface. They are prefixed with "dataLineage." to prevent collision with your own metadata. While you can access these properties directly via their keys, we've created RecordLineage and FieldLineage wrapper classes to ease setting and retrieving lineage properties.
Read Record Lineage Properties
Read Field Lineage Properties
Read Data Lineage Properties In A Job
Record Lineage Properties
# | Property Name | Java Type | Description |
---|---|---|---|
1 | file | File | The java.io.File, if one was used to create the DataReader. |
2 | fileLineNumber | Long | The line number in the input file starting with 0. |
3 | fileColumnNumber | Long | The column number in the input file starting with 0. |
4 | recordNumber | Long | The sequential record number starting with 0. |
5 | databaseUrl | String | The JDBC connection URL. |
6 | databaseQuery | String | The SQL statement used to query the database. |
7 | recordSchema | String | The schema for ORC and Parquet files. |
Field Lineage Properties
Field lineage inherits all the properties of record lineage and adds the following ones of its own.
# | Property Name | Java Type | Description |
---|---|---|---|
1 | originalFieldIndex | Integer | The index of a field set by the DataReader before any transformation or operation was performed. |
2 | originalFieldName | String | The name of a field set by the DataReader before any transformation or operation was performed. |
3 | databaseColumnName | String | The column name returned from the database for this field. |
4 | fieldSchema | String | The field's schema for ORC and Parquet files. |
5 | databaseColumnSqlType | Integer | The column type returned from the database for this field. |
6 | databaseColumnDisplaySize | Integer | The column display size returned from the database for this field. |
7 | databaseColumnScale | Integer | The column scale returned from the database for this field. |
8 | databaseColumnPrecision | Integer | The column precision returned from the database for this field. |
9 | databaseColumnSchemaName | String | The schema name of a column returned from the database for this field. |
10 | databaseColumnCatalogName | String | The catalog name of a column returned from the database for this field. |
11 | databaseColumnTableName | String | The table name of a column returned from the database for this field. |
12 | databaseColumnTypeName | String | The column type name returned from the database for this field. |
13 | databaseColumnClassName | String | The class name of a column returned from the database for this field. |
14 | databaseColumnSigned | Boolean | The column isSigned value returned from the database for this field. |
15 | databaseColumnNullable | Boolean | The column isNullable value returned from the database for this field. |
16 | databaseColumnJavaType | Class | The Java type mapped from java.sql.Types. |
17 | databaseColumnFieldType | FieldType | The FieldType mapped from the field's Java type. |
Supported Record Lineage Properties By Data Reader
The properties in this table are set at both the record and field levels.
# | Reader/Record Lineage | file | fileLineNumber | fileColumnNumber | recordNumber | databaseUrl | databaseQuery | recordSchema |
---|---|---|---|---|---|---|---|---|
1 | CombinedLogReader | |||||||
2 | CSVReader | |||||||
3 | ExcelReader | |||||||
4 | FixedWidthReader | |||||||
5 | JavaBeanReader | |||||||
6 | JdbcReader | |||||||
7 | JsonReader | |||||||
8 | JsonRecordReader | |||||||
9 | OrcDataReader | |||||||
10 | ParquetDataReader | |||||||
11 | SimpleJsonReader | |||||||
12 | SimpleXmlReader | |||||||
13 | XmlReader | |||||||
14 | XmlRecordReader |
Supported Field Lineage Properties By Data Reader
# | Reader/Field Lineage | originalFieldIndex | originalFieldName | databaseColumnName | fieldSchema | databaseColumnSqlType | databaseColumnDisplaySize | databaseColumnScale | databaseColumnPrecision | databaseColumnSchemaName | databaseColumnCatalogName | databaseColumnTableName | databaseColumnTypeName | databaseColumnClassName | databaseColumnSigned | databaseColumnNullable | databaseColumnJavaType | databaseColumnFieldType |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | CombinedLogReader | |||||||||||||||||
2 | CSVReader | |||||||||||||||||
3 | ExcelReader | |||||||||||||||||
4 | FixedWidthReader | |||||||||||||||||
5 | JavaBeanReader | |||||||||||||||||
6 | JdbcReader | |||||||||||||||||
7 | JsonReader | |||||||||||||||||
8 | JsonRecordReader | |||||||||||||||||
9 | OrcDataReader | |||||||||||||||||
10 | ParquetDataReader | |||||||||||||||||
11 | SimpleJsonReader | |||||||||||||||||
12 | SimpleXmlReader | |||||||||||||||||
13 | XmlReader | |||||||||||||||||
14 | XmlRecordReader |