Use Data Lineage with ExcelReader

Data lineage is a metadata added to records and fields indicating where they were loaded from. It is useful for audits and reconciliation as well as troubleshooting.

Data lineage can also be used with other readers, for example Data Lineage with CSVReader and Data Lineage with FixedWidthReader.

Java Code listing

package com.northconcepts.datapipeline.examples.cookbook;

import java.io.File;

import com.northconcepts.datapipeline.core.DataReader;
import com.northconcepts.datapipeline.core.DataWriter;
import com.northconcepts.datapipeline.core.Field;
import com.northconcepts.datapipeline.core.Record;
import com.northconcepts.datapipeline.excel.ExcelDocument;
import com.northconcepts.datapipeline.excel.ExcelReader;
import com.northconcepts.datapipeline.job.Job;
import com.northconcepts.datapipeline.lineage.FieldLineage;
import com.northconcepts.datapipeline.lineage.RecordLineage;

public class UseDataLineageWithExcelReader {
    public static void main(String[] args) {
        ExcelDocument document = new ExcelDocument().open(new File("example/data/input/credit-balance-01.xls"));
        DataReader reader = new ExcelReader(document)
                .setSheetName("credit-balance")
                .setFieldNamesInFirstRow(true)
                .setSaveLineage(true);

        Job.run(reader, new LineageWriter());
    }

    public final static class LineageWriter extends DataWriter {

        @Override
        protected void writeImpl(Record record) throws Throwable {
            System.out.println(record);

            RecordLineage recordLineage = new RecordLineage().setRecord(record);

            System.out.println("Record Lineage");
            System.out.println("    File: " + recordLineage.getFile());
            System.out.println("    File Line: " + recordLineage.getFileLineNumber());
            System.out.println("    File Column: " + recordLineage.getFileColumnNumber());
            System.out.println("    Record: " + recordLineage.getRecordNumber());

            System.out.println();

            FieldLineage fieldLineage = new FieldLineage();

            System.out.println("Field Lineage");
            for (int i = 0; i < record.getFieldCount(); i++) {
                Field field = record.getField(i);
                fieldLineage.setField(field);
                System.out.println("    " + field.getName());
                System.out.println("        File: " + fieldLineage.getFile());
                System.out.println("        File Line: " + fieldLineage.getFileLineNumber());
                System.out.println("        File Column: " + fieldLineage.getFileColumnNumber());
                System.out.println("        Record: " + fieldLineage.getRecordNumber());
                System.out.println("        Field Index: " + fieldLineage.getOriginalFieldIndex());
                System.out.println("        Field Name: " + fieldLineage.getOriginalFieldName());
            }
            System.out.println("---------------------------------------------------------");
            System.out.println();
        }

    }
}

Code walkthrough

  1. ExcelDocument is an in-memory abstraction for an Excel workbook. We use it to open credit-balance-01.xls
  2. ExcelReaderis used to obtain the record from the Microsoft Excel document.
  3. setSaveLineage(true) enable lineage support since it is turned off by default
  4. Job.run() method which transfers data from the reader to the LineageWriter()is then called.

RecordLineage

  1. RecordLineage informs us of the starting location where the record was loaded.
  2. recordLineage.getFile() - The java.io.File, if one was used to create the DataReader.
  3. recordLineage.getFileLineNumber() -The line number in the input file starting with 0.
  4. recordLineage.getFileColumnNumber() -The column number in the input file starting with 0.
  5. recordLineage.getRecordNumber() -The sequential record number starting with 0.

FieldLineage

  1. FieldLineage informs us of the starting location for each individual field
  2. fieldLineage.getOriginalFieldIndex() -The index of a field set by the DataReader before any transformation or operation was performed.
  3. fieldLineage.getOriginalFieldName() -The name of a field set by the DataReader before any transformation or operation was performed.

Output

Record {
    0:[Account]:INT=[101]:Integer
    1:[LastName]:STRING=[Reeves]:String
    2:[FirstName]:STRING=[Keanu]:String
    3:[Balance]:DOUBLE=[9315.45]:Double
    4:[CreditLimit]:INT=[10000]:Integer
    5:[AccountCreated]:STRING=[1/17/1998]:String
}

Record Lineage
    File: example\data\input\credit-balance-01.xls
    File Line: 1
    File Column: 0
    Record: 0

Field Lineage
    Account
        File: example\data\input\credit-balance-01.xls
        File Line: 1
        File Column: 0
        Record: 0
        Field Index: 0
        Field Name: Account
    LastName
        File: example\data\input\credit-balance-01.xls
        File Line: 1
        File Column: 1
        Record: 0
        Field Index: 1
        Field Name: LastName
    FirstName
        File: example\data\input\credit-balance-01.xls
        File Line: 1
        File Column: 2
        Record: 0
        Field Index: 2
        Field Name: FirstName
    Balance
        File: example\data\input\credit-balance-01.xls
        File Line: 1
        File Column: 3
        Record: 0
        Field Index: 3
        Field Name: Balance
    CreditLimit
        File: example\data\input\credit-balance-01.xls
        File Line: 1
        File Column: 4
        Record: 0
        Field Index: 4
        Field Name: CreditLimit
    AccountCreated
        File: example\data\input\credit-balance-01.xls
        File Line: 1
        File Column: 5
        Record: 0
        Field Index: 5
        Field Name: AccountCreated
---------------------------------------------------------

Record {
    0:[Account]:INT=[312]:Integer
    1:[LastName]:STRING=[Butler]:String
    2:[FirstName]:STRING=[Gerard]:String
    3:[Balance]:INT=[90]:Integer
    4:[CreditLimit]:INT=[1000]:Integer
    5:[AccountCreated]:DATE=[2003-06-08]:Date
}

Record Lineage
    File: example\data\input\credit-balance-01.xls
    File Line: 2
    File Column: 0
    Record: 1

Field Lineage
    Account
        File: example\data\input\credit-balance-01.xls
        File Line: 2
        File Column: 0
        Record: 1
        Field Index: 0
        Field Name: Account
    LastName
        File: example\data\input\credit-balance-01.xls
        File Line: 2
        File Column: 1
        Record: 1
        Field Index: 1
        Field Name: LastName
    FirstName
        File: example\data\input\credit-balance-01.xls
        File Line: 2
        File Column: 2
        Record: 1
        Field Index: 2
        Field Name: FirstName
    Balance
        File: example\data\input\credit-balance-01.xls
        File Line: 2
        File Column: 3
        Record: 1
        Field Index: 3
        Field Name: Balance
    CreditLimit
        File: example\data\input\credit-balance-01.xls
        File Line: 2
        File Column: 4
        Record: 1
        Field Index: 4
        Field Name: CreditLimit
    AccountCreated
        File: example\data\input\credit-balance-01.xls
        File Line: 2
        File Column: 5
        Record: 1
        Field Index: 5
        Field Name: AccountCreated
---------------------------------------------------------

Record {
    0:[Account]:INT=[868]:Integer
    1:[LastName]:STRING=[Hewitt]:String
    2:[FirstName]:STRING=[Jennifer Love]:String
    3:[Balance]:INT=[0]:Integer
    4:[CreditLimit]:INT=[17000]:Integer
    5:[AccountCreated]:STRING=[5/25/1985]:String
}

Record Lineage
    File: example\data\input\credit-balance-01.xls
    File Line: 3
    File Column: 0
    Record: 2

Field Lineage
    Account
        File: example\data\input\credit-balance-01.xls
        File Line: 3
        File Column: 0
        Record: 2
        Field Index: 0
        Field Name: Account
    LastName
        File: example\data\input\credit-balance-01.xls
        File Line: 3
        File Column: 1
        Record: 2
        Field Index: 1
        Field Name: LastName
    FirstName
        File: example\data\input\credit-balance-01.xls
        File Line: 3
        File Column: 2
        Record: 2
        Field Index: 2
        Field Name: FirstName
    Balance
        File: example\data\input\credit-balance-01.xls
        File Line: 3
        File Column: 3
        Record: 2
        Field Index: 3
        Field Name: Balance
    CreditLimit
        File: example\data\input\credit-balance-01.xls
        File Line: 3
        File Column: 4
        Record: 2
        Field Index: 4
        Field Name: CreditLimit
    AccountCreated
        File: example\data\input\credit-balance-01.xls
        File Line: 3
        File Column: 5
        Record: 2
        Field Index: 5
        Field Name: AccountCreated
---------------------------------------------------------

Record {
    0:[Account]:INT=[761]:Integer
    1:[LastName]:STRING=[Pinkett-Smith]:String
    2:[FirstName]:STRING=[Jada]:String
    3:[Balance]:DOUBLE=[49654.87]:Double
    4:[CreditLimit]:INT=[100000]:Integer
    5:[AccountCreated]:DATE=[2006-05-12]:Date
}

Record Lineage
    File: example\data\input\credit-balance-01.xls
    File Line: 4
    File Column: 0
    Record: 3

Field Lineage
    Account
        File: example\data\input\credit-balance-01.xls
        File Line: 4
        File Column: 0
        Record: 3
        Field Index: 0
        Field Name: Account
    LastName
        File: example\data\input\credit-balance-01.xls
        File Line: 4
        File Column: 1
        Record: 3
        Field Index: 1
        Field Name: LastName
    FirstName
        File: example\data\input\credit-balance-01.xls
        File Line: 4
        File Column: 2
        Record: 3
        Field Index: 2
        Field Name: FirstName
    Balance
        File: example\data\input\credit-balance-01.xls
        File Line: 4
        File Column: 3
        Record: 3
        Field Index: 3
        Field Name: Balance
    CreditLimit
        File: example\data\input\credit-balance-01.xls
        File Line: 4
        File Column: 4
        Record: 3
        Field Index: 4
        Field Name: CreditLimit
    AccountCreated
        File: example\data\input\credit-balance-01.xls
        File Line: 4
        File Column: 5
        Record: 3
        Field Index: 5
        Field Name: AccountCreated
---------------------------------------------------------

Record {
    0:[Account]:INT=[317]:Integer
    1:[LastName]:STRING=[Murray]:String
    2:[FirstName]:STRING=[Bill]:String
    3:[Balance]:DOUBLE=[789.65]:Double
    4:[CreditLimit]:INT=[5000]:Integer
    5:[AccountCreated]:DATE=[2007-05-02]:Date
}

Record Lineage
    File: example\data\input\credit-balance-01.xls
    File Line: 5
    File Column: 0
    Record: 4

Field Lineage
    Account
        File: example\data\input\credit-balance-01.xls
        File Line: 5
        File Column: 0
        Record: 4
        Field Index: 0
        Field Name: Account
    LastName
        File: example\data\input\credit-balance-01.xls
        File Line: 5
        File Column: 1
        Record: 4
        Field Index: 1
        Field Name: LastName
    FirstName
        File: example\data\input\credit-balance-01.xls
        File Line: 5
        File Column: 2
        Record: 4
        Field Index: 2
        Field Name: FirstName
    Balance
        File: example\data\input\credit-balance-01.xls
        File Line: 5
        File Column: 3
        Record: 4
        Field Index: 3
        Field Name: Balance
    CreditLimit
        File: example\data\input\credit-balance-01.xls
        File Line: 5
        File Column: 4
        Record: 4
        Field Index: 4
        Field Name: CreditLimit
    AccountCreated
        File: example\data\input\credit-balance-01.xls
        File Line: 5
        File Column: 5
        Record: 4
        Field Index: 5
        Field Name: AccountCreated
---------------------------------------------------------
Mobile Analytics