Use Data Lineage with ExcelReader
Updated: May 30, 2022
Data lineage is a metadata added to records and fields indicating where they were loaded from. It is useful for audits and reconciliation as well as troubleshooting.
Data lineage can also be used with other readers, for example Data Lineage with CSVReader and Data Lineage with FixedWidthReader.
Java Code listing
package com.northconcepts.datapipeline.examples.cookbook; import java.io.File; import com.northconcepts.datapipeline.core.DataReader; import com.northconcepts.datapipeline.core.DataWriter; import com.northconcepts.datapipeline.core.Field; import com.northconcepts.datapipeline.core.Record; import com.northconcepts.datapipeline.excel.ExcelDocument; import com.northconcepts.datapipeline.excel.ExcelReader; import com.northconcepts.datapipeline.job.Job; import com.northconcepts.datapipeline.lineage.FieldLineage; import com.northconcepts.datapipeline.lineage.RecordLineage; public class UseDataLineageWithExcelReader { public static void main(String[] args) { ExcelDocument document = new ExcelDocument().open(new File("example/data/input/credit-balance-01.xls")); DataReader reader = new ExcelReader(document) .setSheetName("credit-balance") .setFieldNamesInFirstRow(true) .setSaveLineage(true); Job.run(reader, new LineageWriter()); } public final static class LineageWriter extends DataWriter { @Override protected void writeImpl(Record record) throws Throwable { System.out.println(record); RecordLineage recordLineage = new RecordLineage().setRecord(record); System.out.println("Record Lineage"); System.out.println(" File: " + recordLineage.getFile()); System.out.println(" File Line: " + recordLineage.getFileLineNumber()); System.out.println(" File Column: " + recordLineage.getFileColumnNumber()); System.out.println(" Record: " + recordLineage.getRecordNumber()); System.out.println(); FieldLineage fieldLineage = new FieldLineage(); System.out.println("Field Lineage"); for (int i = 0; i < record.getFieldCount(); i++) { Field field = record.getField(i); fieldLineage.setField(field); System.out.println(" " + field.getName()); System.out.println(" File: " + fieldLineage.getFile()); System.out.println(" File Line: " + fieldLineage.getFileLineNumber()); System.out.println(" File Column: " + fieldLineage.getFileColumnNumber()); System.out.println(" Record: " + fieldLineage.getRecordNumber()); System.out.println(" Field Index: " + fieldLineage.getOriginalFieldIndex()); System.out.println(" Field Name: " + fieldLineage.getOriginalFieldName()); } System.out.println("---------------------------------------------------------"); System.out.println(); } } }
Code walkthrough
- ExcelDocument is an in-memory abstraction for an Excel workbook. We use it to open
credit-balance-01.xls
- ExcelReaderis used to obtain the record from the Microsoft Excel document.
setSaveLineage(true)
enable lineage support since it is turned off by defaultJob.run()
method which transfers data from thereader
to theLineageWriter()
is then called.
RecordLineage
RecordLineage
informs us of the starting location where the record was loaded.recordLineage.getFile()
- The java.io.File, if one was used to create the DataReader.recordLineage.getFileLineNumber()
-The line number in the input file starting with 0.recordLineage.getFileColumnNumber()
-The column number in the input file starting with 0.recordLineage.getRecordNumber()
-The sequential record number starting with 0.
FieldLineage
FieldLineage
informs us of the starting location for each individual fieldfieldLineage.getOriginalFieldIndex()
-The index of a field set by the DataReader before any transformation or operation was performed.fieldLineage.getOriginalFieldName()
-The name of a field set by the DataReader before any transformation or operation was performed.
Output
Record { 0:[Account]:INT=[101]:Integer 1:[LastName]:STRING=[Reeves]:String 2:[FirstName]:STRING=[Keanu]:String 3:[Balance]:DOUBLE=[9315.45]:Double 4:[CreditLimit]:INT=[10000]:Integer 5:[AccountCreated]:STRING=[1/17/1998]:String } Record Lineage File: example\data\input\credit-balance-01.xls File Line: 1 File Column: 0 Record: 0 Field Lineage Account File: example\data\input\credit-balance-01.xls File Line: 1 File Column: 0 Record: 0 Field Index: 0 Field Name: Account LastName File: example\data\input\credit-balance-01.xls File Line: 1 File Column: 1 Record: 0 Field Index: 1 Field Name: LastName FirstName File: example\data\input\credit-balance-01.xls File Line: 1 File Column: 2 Record: 0 Field Index: 2 Field Name: FirstName Balance File: example\data\input\credit-balance-01.xls File Line: 1 File Column: 3 Record: 0 Field Index: 3 Field Name: Balance CreditLimit File: example\data\input\credit-balance-01.xls File Line: 1 File Column: 4 Record: 0 Field Index: 4 Field Name: CreditLimit AccountCreated File: example\data\input\credit-balance-01.xls File Line: 1 File Column: 5 Record: 0 Field Index: 5 Field Name: AccountCreated --------------------------------------------------------- Record { 0:[Account]:INT=[312]:Integer 1:[LastName]:STRING=[Butler]:String 2:[FirstName]:STRING=[Gerard]:String 3:[Balance]:INT=[90]:Integer 4:[CreditLimit]:INT=[1000]:Integer 5:[AccountCreated]:DATE=[2003-06-08]:Date } Record Lineage File: example\data\input\credit-balance-01.xls File Line: 2 File Column: 0 Record: 1 Field Lineage Account File: example\data\input\credit-balance-01.xls File Line: 2 File Column: 0 Record: 1 Field Index: 0 Field Name: Account LastName File: example\data\input\credit-balance-01.xls File Line: 2 File Column: 1 Record: 1 Field Index: 1 Field Name: LastName FirstName File: example\data\input\credit-balance-01.xls File Line: 2 File Column: 2 Record: 1 Field Index: 2 Field Name: FirstName Balance File: example\data\input\credit-balance-01.xls File Line: 2 File Column: 3 Record: 1 Field Index: 3 Field Name: Balance CreditLimit File: example\data\input\credit-balance-01.xls File Line: 2 File Column: 4 Record: 1 Field Index: 4 Field Name: CreditLimit AccountCreated File: example\data\input\credit-balance-01.xls File Line: 2 File Column: 5 Record: 1 Field Index: 5 Field Name: AccountCreated --------------------------------------------------------- Record { 0:[Account]:INT=[868]:Integer 1:[LastName]:STRING=[Hewitt]:String 2:[FirstName]:STRING=[Jennifer Love]:String 3:[Balance]:INT=[0]:Integer 4:[CreditLimit]:INT=[17000]:Integer 5:[AccountCreated]:STRING=[5/25/1985]:String } Record Lineage File: example\data\input\credit-balance-01.xls File Line: 3 File Column: 0 Record: 2 Field Lineage Account File: example\data\input\credit-balance-01.xls File Line: 3 File Column: 0 Record: 2 Field Index: 0 Field Name: Account LastName File: example\data\input\credit-balance-01.xls File Line: 3 File Column: 1 Record: 2 Field Index: 1 Field Name: LastName FirstName File: example\data\input\credit-balance-01.xls File Line: 3 File Column: 2 Record: 2 Field Index: 2 Field Name: FirstName Balance File: example\data\input\credit-balance-01.xls File Line: 3 File Column: 3 Record: 2 Field Index: 3 Field Name: Balance CreditLimit File: example\data\input\credit-balance-01.xls File Line: 3 File Column: 4 Record: 2 Field Index: 4 Field Name: CreditLimit AccountCreated File: example\data\input\credit-balance-01.xls File Line: 3 File Column: 5 Record: 2 Field Index: 5 Field Name: AccountCreated --------------------------------------------------------- Record { 0:[Account]:INT=[761]:Integer 1:[LastName]:STRING=[Pinkett-Smith]:String 2:[FirstName]:STRING=[Jada]:String 3:[Balance]:DOUBLE=[49654.87]:Double 4:[CreditLimit]:INT=[100000]:Integer 5:[AccountCreated]:DATE=[2006-05-12]:Date } Record Lineage File: example\data\input\credit-balance-01.xls File Line: 4 File Column: 0 Record: 3 Field Lineage Account File: example\data\input\credit-balance-01.xls File Line: 4 File Column: 0 Record: 3 Field Index: 0 Field Name: Account LastName File: example\data\input\credit-balance-01.xls File Line: 4 File Column: 1 Record: 3 Field Index: 1 Field Name: LastName FirstName File: example\data\input\credit-balance-01.xls File Line: 4 File Column: 2 Record: 3 Field Index: 2 Field Name: FirstName Balance File: example\data\input\credit-balance-01.xls File Line: 4 File Column: 3 Record: 3 Field Index: 3 Field Name: Balance CreditLimit File: example\data\input\credit-balance-01.xls File Line: 4 File Column: 4 Record: 3 Field Index: 4 Field Name: CreditLimit AccountCreated File: example\data\input\credit-balance-01.xls File Line: 4 File Column: 5 Record: 3 Field Index: 5 Field Name: AccountCreated --------------------------------------------------------- Record { 0:[Account]:INT=[317]:Integer 1:[LastName]:STRING=[Murray]:String 2:[FirstName]:STRING=[Bill]:String 3:[Balance]:DOUBLE=[789.65]:Double 4:[CreditLimit]:INT=[5000]:Integer 5:[AccountCreated]:DATE=[2007-05-02]:Date } Record Lineage File: example\data\input\credit-balance-01.xls File Line: 5 File Column: 0 Record: 4 Field Lineage Account File: example\data\input\credit-balance-01.xls File Line: 5 File Column: 0 Record: 4 Field Index: 0 Field Name: Account LastName File: example\data\input\credit-balance-01.xls File Line: 5 File Column: 1 Record: 4 Field Index: 1 Field Name: LastName FirstName File: example\data\input\credit-balance-01.xls File Line: 5 File Column: 2 Record: 4 Field Index: 2 Field Name: FirstName Balance File: example\data\input\credit-balance-01.xls File Line: 5 File Column: 3 Record: 4 Field Index: 3 Field Name: Balance CreditLimit File: example\data\input\credit-balance-01.xls File Line: 5 File Column: 4 Record: 4 Field Index: 4 Field Name: CreditLimit AccountCreated File: example\data\input\credit-balance-01.xls File Line: 5 File Column: 5 Record: 4 Field Index: 5 Field Name: AccountCreated ---------------------------------------------------------