Use Data Lineage with ExcelReader
Updated: May 30, 2022
Data lineage is a metadata added to records and fields indicating where they were loaded from. It is useful for audits and reconciliation as well as troubleshooting.
Data lineage can also be used with other readers, for example Data Lineage with CSVReader and Data Lineage with FixedWidthReader.
Java Code listing
package com.northconcepts.datapipeline.examples.cookbook;
import java.io.File;
import com.northconcepts.datapipeline.core.DataReader;
import com.northconcepts.datapipeline.core.DataWriter;
import com.northconcepts.datapipeline.core.Field;
import com.northconcepts.datapipeline.core.Record;
import com.northconcepts.datapipeline.excel.ExcelDocument;
import com.northconcepts.datapipeline.excel.ExcelReader;
import com.northconcepts.datapipeline.job.Job;
import com.northconcepts.datapipeline.lineage.FieldLineage;
import com.northconcepts.datapipeline.lineage.RecordLineage;
public class UseDataLineageWithExcelReader {
public static void main(String[] args) {
ExcelDocument document = new ExcelDocument().open(new File("example/data/input/credit-balance-01.xls"));
DataReader reader = new ExcelReader(document)
.setSheetName("credit-balance")
.setFieldNamesInFirstRow(true)
.setSaveLineage(true);
Job.run(reader, new LineageWriter());
}
public final static class LineageWriter extends DataWriter {
@Override
protected void writeImpl(Record record) throws Throwable {
System.out.println(record);
RecordLineage recordLineage = new RecordLineage().setRecord(record);
System.out.println("Record Lineage");
System.out.println(" File: " + recordLineage.getFile());
System.out.println(" File Line: " + recordLineage.getFileLineNumber());
System.out.println(" File Column: " + recordLineage.getFileColumnNumber());
System.out.println(" Record: " + recordLineage.getRecordNumber());
System.out.println();
FieldLineage fieldLineage = new FieldLineage();
System.out.println("Field Lineage");
for (int i = 0; i < record.getFieldCount(); i++) {
Field field = record.getField(i);
fieldLineage.setField(field);
System.out.println(" " + field.getName());
System.out.println(" File: " + fieldLineage.getFile());
System.out.println(" File Line: " + fieldLineage.getFileLineNumber());
System.out.println(" File Column: " + fieldLineage.getFileColumnNumber());
System.out.println(" Record: " + fieldLineage.getRecordNumber());
System.out.println(" Field Index: " + fieldLineage.getOriginalFieldIndex());
System.out.println(" Field Name: " + fieldLineage.getOriginalFieldName());
}
System.out.println("---------------------------------------------------------");
System.out.println();
}
}
}
Code walkthrough
- ExcelDocument is an in-memory abstraction for an Excel workbook. We use it to open
credit-balance-01.xls - ExcelReaderis used to obtain the record from the Microsoft Excel document.
setSaveLineage(true)enable lineage support since it is turned off by defaultJob.run()method which transfers data from thereaderto theLineageWriter()is then called.
RecordLineage
RecordLineageinforms us of the starting location where the record was loaded.recordLineage.getFile()- The java.io.File, if one was used to create the DataReader.recordLineage.getFileLineNumber()-The line number in the input file starting with 0.recordLineage.getFileColumnNumber()-The column number in the input file starting with 0.recordLineage.getRecordNumber()-The sequential record number starting with 0.
FieldLineage
FieldLineageinforms us of the starting location for each individual fieldfieldLineage.getOriginalFieldIndex()-The index of a field set by the DataReader before any transformation or operation was performed.fieldLineage.getOriginalFieldName()-The name of a field set by the DataReader before any transformation or operation was performed.
Output
Record {
0:[Account]:INT=[101]:Integer
1:[LastName]:STRING=[Reeves]:String
2:[FirstName]:STRING=[Keanu]:String
3:[Balance]:DOUBLE=[9315.45]:Double
4:[CreditLimit]:INT=[10000]:Integer
5:[AccountCreated]:STRING=[1/17/1998]:String
}
Record Lineage
File: example\data\input\credit-balance-01.xls
File Line: 1
File Column: 0
Record: 0
Field Lineage
Account
File: example\data\input\credit-balance-01.xls
File Line: 1
File Column: 0
Record: 0
Field Index: 0
Field Name: Account
LastName
File: example\data\input\credit-balance-01.xls
File Line: 1
File Column: 1
Record: 0
Field Index: 1
Field Name: LastName
FirstName
File: example\data\input\credit-balance-01.xls
File Line: 1
File Column: 2
Record: 0
Field Index: 2
Field Name: FirstName
Balance
File: example\data\input\credit-balance-01.xls
File Line: 1
File Column: 3
Record: 0
Field Index: 3
Field Name: Balance
CreditLimit
File: example\data\input\credit-balance-01.xls
File Line: 1
File Column: 4
Record: 0
Field Index: 4
Field Name: CreditLimit
AccountCreated
File: example\data\input\credit-balance-01.xls
File Line: 1
File Column: 5
Record: 0
Field Index: 5
Field Name: AccountCreated
---------------------------------------------------------
Record {
0:[Account]:INT=[312]:Integer
1:[LastName]:STRING=[Butler]:String
2:[FirstName]:STRING=[Gerard]:String
3:[Balance]:INT=[90]:Integer
4:[CreditLimit]:INT=[1000]:Integer
5:[AccountCreated]:DATE=[2003-06-08]:Date
}
Record Lineage
File: example\data\input\credit-balance-01.xls
File Line: 2
File Column: 0
Record: 1
Field Lineage
Account
File: example\data\input\credit-balance-01.xls
File Line: 2
File Column: 0
Record: 1
Field Index: 0
Field Name: Account
LastName
File: example\data\input\credit-balance-01.xls
File Line: 2
File Column: 1
Record: 1
Field Index: 1
Field Name: LastName
FirstName
File: example\data\input\credit-balance-01.xls
File Line: 2
File Column: 2
Record: 1
Field Index: 2
Field Name: FirstName
Balance
File: example\data\input\credit-balance-01.xls
File Line: 2
File Column: 3
Record: 1
Field Index: 3
Field Name: Balance
CreditLimit
File: example\data\input\credit-balance-01.xls
File Line: 2
File Column: 4
Record: 1
Field Index: 4
Field Name: CreditLimit
AccountCreated
File: example\data\input\credit-balance-01.xls
File Line: 2
File Column: 5
Record: 1
Field Index: 5
Field Name: AccountCreated
---------------------------------------------------------
Record {
0:[Account]:INT=[868]:Integer
1:[LastName]:STRING=[Hewitt]:String
2:[FirstName]:STRING=[Jennifer Love]:String
3:[Balance]:INT=[0]:Integer
4:[CreditLimit]:INT=[17000]:Integer
5:[AccountCreated]:STRING=[5/25/1985]:String
}
Record Lineage
File: example\data\input\credit-balance-01.xls
File Line: 3
File Column: 0
Record: 2
Field Lineage
Account
File: example\data\input\credit-balance-01.xls
File Line: 3
File Column: 0
Record: 2
Field Index: 0
Field Name: Account
LastName
File: example\data\input\credit-balance-01.xls
File Line: 3
File Column: 1
Record: 2
Field Index: 1
Field Name: LastName
FirstName
File: example\data\input\credit-balance-01.xls
File Line: 3
File Column: 2
Record: 2
Field Index: 2
Field Name: FirstName
Balance
File: example\data\input\credit-balance-01.xls
File Line: 3
File Column: 3
Record: 2
Field Index: 3
Field Name: Balance
CreditLimit
File: example\data\input\credit-balance-01.xls
File Line: 3
File Column: 4
Record: 2
Field Index: 4
Field Name: CreditLimit
AccountCreated
File: example\data\input\credit-balance-01.xls
File Line: 3
File Column: 5
Record: 2
Field Index: 5
Field Name: AccountCreated
---------------------------------------------------------
Record {
0:[Account]:INT=[761]:Integer
1:[LastName]:STRING=[Pinkett-Smith]:String
2:[FirstName]:STRING=[Jada]:String
3:[Balance]:DOUBLE=[49654.87]:Double
4:[CreditLimit]:INT=[100000]:Integer
5:[AccountCreated]:DATE=[2006-05-12]:Date
}
Record Lineage
File: example\data\input\credit-balance-01.xls
File Line: 4
File Column: 0
Record: 3
Field Lineage
Account
File: example\data\input\credit-balance-01.xls
File Line: 4
File Column: 0
Record: 3
Field Index: 0
Field Name: Account
LastName
File: example\data\input\credit-balance-01.xls
File Line: 4
File Column: 1
Record: 3
Field Index: 1
Field Name: LastName
FirstName
File: example\data\input\credit-balance-01.xls
File Line: 4
File Column: 2
Record: 3
Field Index: 2
Field Name: FirstName
Balance
File: example\data\input\credit-balance-01.xls
File Line: 4
File Column: 3
Record: 3
Field Index: 3
Field Name: Balance
CreditLimit
File: example\data\input\credit-balance-01.xls
File Line: 4
File Column: 4
Record: 3
Field Index: 4
Field Name: CreditLimit
AccountCreated
File: example\data\input\credit-balance-01.xls
File Line: 4
File Column: 5
Record: 3
Field Index: 5
Field Name: AccountCreated
---------------------------------------------------------
Record {
0:[Account]:INT=[317]:Integer
1:[LastName]:STRING=[Murray]:String
2:[FirstName]:STRING=[Bill]:String
3:[Balance]:DOUBLE=[789.65]:Double
4:[CreditLimit]:INT=[5000]:Integer
5:[AccountCreated]:DATE=[2007-05-02]:Date
}
Record Lineage
File: example\data\input\credit-balance-01.xls
File Line: 5
File Column: 0
Record: 4
Field Lineage
Account
File: example\data\input\credit-balance-01.xls
File Line: 5
File Column: 0
Record: 4
Field Index: 0
Field Name: Account
LastName
File: example\data\input\credit-balance-01.xls
File Line: 5
File Column: 1
Record: 4
Field Index: 1
Field Name: LastName
FirstName
File: example\data\input\credit-balance-01.xls
File Line: 5
File Column: 2
Record: 4
Field Index: 2
Field Name: FirstName
Balance
File: example\data\input\credit-balance-01.xls
File Line: 5
File Column: 3
Record: 4
Field Index: 3
Field Name: Balance
CreditLimit
File: example\data\input\credit-balance-01.xls
File Line: 5
File Column: 4
Record: 4
Field Index: 4
Field Name: CreditLimit
AccountCreated
File: example\data\input\credit-balance-01.xls
File Line: 5
File Column: 5
Record: 4
Field Index: 5
Field Name: AccountCreated
---------------------------------------------------------
