Use Data Lineage with ExcelReader

Data lineage is a metadata added to records and fields indicating where they were loaded from. It is useful for audits and reconciliation as well as troubleshooting.

Data lineage can also be used with other readers, for example Data Lineage with CSVReader and Data Lineage with FixedWidthReader.

Java Code listing

  1. package com.northconcepts.datapipeline.examples.cookbook;
  2.  
  3. import java.io.File;
  4.  
  5. import com.northconcepts.datapipeline.core.DataReader;
  6. import com.northconcepts.datapipeline.core.DataWriter;
  7. import com.northconcepts.datapipeline.core.Field;
  8. import com.northconcepts.datapipeline.core.Record;
  9. import com.northconcepts.datapipeline.excel.ExcelDocument;
  10. import com.northconcepts.datapipeline.excel.ExcelReader;
  11. import com.northconcepts.datapipeline.job.Job;
  12. import com.northconcepts.datapipeline.lineage.FieldLineage;
  13. import com.northconcepts.datapipeline.lineage.RecordLineage;
  14.  
  15. public class UseDataLineageWithExcelReader {
  16. public static void main(String[] args) {
  17. ExcelDocument document = new ExcelDocument().open(new File("example/data/input/credit-balance-01.xls"));
  18. DataReader reader = new ExcelReader(document)
  19. .setSheetName("credit-balance")
  20. .setFieldNamesInFirstRow(true)
  21. .setSaveLineage(true);
  22.  
  23. Job.run(reader, new LineageWriter());
  24. }
  25.  
  26. public final static class LineageWriter extends DataWriter {
  27.  
  28. @Override
  29. protected void writeImpl(Record record) throws Throwable {
  30. System.out.println(record);
  31.  
  32. RecordLineage recordLineage = new RecordLineage().setRecord(record);
  33.  
  34. System.out.println("Record Lineage");
  35. System.out.println(" File: " + recordLineage.getFile());
  36. System.out.println(" File Line: " + recordLineage.getFileLineNumber());
  37. System.out.println(" File Column: " + recordLineage.getFileColumnNumber());
  38. System.out.println(" Record: " + recordLineage.getRecordNumber());
  39.  
  40. System.out.println();
  41.  
  42. FieldLineage fieldLineage = new FieldLineage();
  43.  
  44. System.out.println("Field Lineage");
  45. for (int i = 0; i < record.getFieldCount(); i++) {
  46. Field field = record.getField(i);
  47. fieldLineage.setField(field);
  48. System.out.println(" " + field.getName());
  49. System.out.println(" File: " + fieldLineage.getFile());
  50. System.out.println(" File Line: " + fieldLineage.getFileLineNumber());
  51. System.out.println(" File Column: " + fieldLineage.getFileColumnNumber());
  52. System.out.println(" Record: " + fieldLineage.getRecordNumber());
  53. System.out.println(" Field Index: " + fieldLineage.getOriginalFieldIndex());
  54. System.out.println(" Field Name: " + fieldLineage.getOriginalFieldName());
  55. }
  56. System.out.println("---------------------------------------------------------");
  57. System.out.println();
  58. }
  59.  
  60. }
  61. }
  62.  

Code walkthrough

  1. ExcelDocument is an in-memory abstraction for an Excel workbook. We use it to open credit-balance-01.xls
  2. ExcelReaderis used to obtain the record from the Microsoft Excel document.
  3. setSaveLineage(true) enable lineage support since it is turned off by default
  4. Job.run() method which transfers data from the reader to the LineageWriter()is then called.

RecordLineage

  1. RecordLineage informs us of the starting location where the record was loaded.
  2. recordLineage.getFile() - The java.io.File, if one was used to create the DataReader.
  3. recordLineage.getFileLineNumber() -The line number in the input file starting with 0.
  4. recordLineage.getFileColumnNumber() -The column number in the input file starting with 0.
  5. recordLineage.getRecordNumber() -The sequential record number starting with 0.

FieldLineage

  1. FieldLineage informs us of the starting location for each individual field
  2. fieldLineage.getOriginalFieldIndex() -The index of a field set by the DataReader before any transformation or operation was performed.
  3. fieldLineage.getOriginalFieldName() -The name of a field set by the DataReader before any transformation or operation was performed.

Output

  1. Record {
  2. 0:[Account]:INT=[101]:Integer
  3. 1:[LastName]:STRING=[Reeves]:String
  4. 2:[FirstName]:STRING=[Keanu]:String
  5. 3:[Balance]:DOUBLE=[9315.45]:Double
  6. 4:[CreditLimit]:INT=[10000]:Integer
  7. 5:[AccountCreated]:STRING=[1/17/1998]:String
  8. }
  9.  
  10. Record Lineage
  11. File: example\data\input\credit-balance-01.xls
  12. File Line: 1
  13. File Column: 0
  14. Record: 0
  15.  
  16. Field Lineage
  17. Account
  18. File: example\data\input\credit-balance-01.xls
  19. File Line: 1
  20. File Column: 0
  21. Record: 0
  22. Field Index: 0
  23. Field Name: Account
  24. LastName
  25. File: example\data\input\credit-balance-01.xls
  26. File Line: 1
  27. File Column: 1
  28. Record: 0
  29. Field Index: 1
  30. Field Name: LastName
  31. FirstName
  32. File: example\data\input\credit-balance-01.xls
  33. File Line: 1
  34. File Column: 2
  35. Record: 0
  36. Field Index: 2
  37. Field Name: FirstName
  38. Balance
  39. File: example\data\input\credit-balance-01.xls
  40. File Line: 1
  41. File Column: 3
  42. Record: 0
  43. Field Index: 3
  44. Field Name: Balance
  45. CreditLimit
  46. File: example\data\input\credit-balance-01.xls
  47. File Line: 1
  48. File Column: 4
  49. Record: 0
  50. Field Index: 4
  51. Field Name: CreditLimit
  52. AccountCreated
  53. File: example\data\input\credit-balance-01.xls
  54. File Line: 1
  55. File Column: 5
  56. Record: 0
  57. Field Index: 5
  58. Field Name: AccountCreated
  59. ---------------------------------------------------------
  60.  
  61. Record {
  62. 0:[Account]:INT=[312]:Integer
  63. 1:[LastName]:STRING=[Butler]:String
  64. 2:[FirstName]:STRING=[Gerard]:String
  65. 3:[Balance]:INT=[90]:Integer
  66. 4:[CreditLimit]:INT=[1000]:Integer
  67. 5:[AccountCreated]:DATE=[2003-06-08]:Date
  68. }
  69.  
  70. Record Lineage
  71. File: example\data\input\credit-balance-01.xls
  72. File Line: 2
  73. File Column: 0
  74. Record: 1
  75.  
  76. Field Lineage
  77. Account
  78. File: example\data\input\credit-balance-01.xls
  79. File Line: 2
  80. File Column: 0
  81. Record: 1
  82. Field Index: 0
  83. Field Name: Account
  84. LastName
  85. File: example\data\input\credit-balance-01.xls
  86. File Line: 2
  87. File Column: 1
  88. Record: 1
  89. Field Index: 1
  90. Field Name: LastName
  91. FirstName
  92. File: example\data\input\credit-balance-01.xls
  93. File Line: 2
  94. File Column: 2
  95. Record: 1
  96. Field Index: 2
  97. Field Name: FirstName
  98. Balance
  99. File: example\data\input\credit-balance-01.xls
  100. File Line: 2
  101. File Column: 3
  102. Record: 1
  103. Field Index: 3
  104. Field Name: Balance
  105. CreditLimit
  106. File: example\data\input\credit-balance-01.xls
  107. File Line: 2
  108. File Column: 4
  109. Record: 1
  110. Field Index: 4
  111. Field Name: CreditLimit
  112. AccountCreated
  113. File: example\data\input\credit-balance-01.xls
  114. File Line: 2
  115. File Column: 5
  116. Record: 1
  117. Field Index: 5
  118. Field Name: AccountCreated
  119. ---------------------------------------------------------
  120.  
  121. Record {
  122. 0:[Account]:INT=[868]:Integer
  123. 1:[LastName]:STRING=[Hewitt]:String
  124. 2:[FirstName]:STRING=[Jennifer Love]:String
  125. 3:[Balance]:INT=[0]:Integer
  126. 4:[CreditLimit]:INT=[17000]:Integer
  127. 5:[AccountCreated]:STRING=[5/25/1985]:String
  128. }
  129.  
  130. Record Lineage
  131. File: example\data\input\credit-balance-01.xls
  132. File Line: 3
  133. File Column: 0
  134. Record: 2
  135.  
  136. Field Lineage
  137. Account
  138. File: example\data\input\credit-balance-01.xls
  139. File Line: 3
  140. File Column: 0
  141. Record: 2
  142. Field Index: 0
  143. Field Name: Account
  144. LastName
  145. File: example\data\input\credit-balance-01.xls
  146. File Line: 3
  147. File Column: 1
  148. Record: 2
  149. Field Index: 1
  150. Field Name: LastName
  151. FirstName
  152. File: example\data\input\credit-balance-01.xls
  153. File Line: 3
  154. File Column: 2
  155. Record: 2
  156. Field Index: 2
  157. Field Name: FirstName
  158. Balance
  159. File: example\data\input\credit-balance-01.xls
  160. File Line: 3
  161. File Column: 3
  162. Record: 2
  163. Field Index: 3
  164. Field Name: Balance
  165. CreditLimit
  166. File: example\data\input\credit-balance-01.xls
  167. File Line: 3
  168. File Column: 4
  169. Record: 2
  170. Field Index: 4
  171. Field Name: CreditLimit
  172. AccountCreated
  173. File: example\data\input\credit-balance-01.xls
  174. File Line: 3
  175. File Column: 5
  176. Record: 2
  177. Field Index: 5
  178. Field Name: AccountCreated
  179. ---------------------------------------------------------
  180.  
  181. Record {
  182. 0:[Account]:INT=[761]:Integer
  183. 1:[LastName]:STRING=[Pinkett-Smith]:String
  184. 2:[FirstName]:STRING=[Jada]:String
  185. 3:[Balance]:DOUBLE=[49654.87]:Double
  186. 4:[CreditLimit]:INT=[100000]:Integer
  187. 5:[AccountCreated]:DATE=[2006-05-12]:Date
  188. }
  189.  
  190. Record Lineage
  191. File: example\data\input\credit-balance-01.xls
  192. File Line: 4
  193. File Column: 0
  194. Record: 3
  195.  
  196. Field Lineage
  197. Account
  198. File: example\data\input\credit-balance-01.xls
  199. File Line: 4
  200. File Column: 0
  201. Record: 3
  202. Field Index: 0
  203. Field Name: Account
  204. LastName
  205. File: example\data\input\credit-balance-01.xls
  206. File Line: 4
  207. File Column: 1
  208. Record: 3
  209. Field Index: 1
  210. Field Name: LastName
  211. FirstName
  212. File: example\data\input\credit-balance-01.xls
  213. File Line: 4
  214. File Column: 2
  215. Record: 3
  216. Field Index: 2
  217. Field Name: FirstName
  218. Balance
  219. File: example\data\input\credit-balance-01.xls
  220. File Line: 4
  221. File Column: 3
  222. Record: 3
  223. Field Index: 3
  224. Field Name: Balance
  225. CreditLimit
  226. File: example\data\input\credit-balance-01.xls
  227. File Line: 4
  228. File Column: 4
  229. Record: 3
  230. Field Index: 4
  231. Field Name: CreditLimit
  232. AccountCreated
  233. File: example\data\input\credit-balance-01.xls
  234. File Line: 4
  235. File Column: 5
  236. Record: 3
  237. Field Index: 5
  238. Field Name: AccountCreated
  239. ---------------------------------------------------------
  240.  
  241. Record {
  242. 0:[Account]:INT=[317]:Integer
  243. 1:[LastName]:STRING=[Murray]:String
  244. 2:[FirstName]:STRING=[Bill]:String
  245. 3:[Balance]:DOUBLE=[789.65]:Double
  246. 4:[CreditLimit]:INT=[5000]:Integer
  247. 5:[AccountCreated]:DATE=[2007-05-02]:Date
  248. }
  249.  
  250. Record Lineage
  251. File: example\data\input\credit-balance-01.xls
  252. File Line: 5
  253. File Column: 0
  254. Record: 4
  255.  
  256. Field Lineage
  257. Account
  258. File: example\data\input\credit-balance-01.xls
  259. File Line: 5
  260. File Column: 0
  261. Record: 4
  262. Field Index: 0
  263. Field Name: Account
  264. LastName
  265. File: example\data\input\credit-balance-01.xls
  266. File Line: 5
  267. File Column: 1
  268. Record: 4
  269. Field Index: 1
  270. Field Name: LastName
  271. FirstName
  272. File: example\data\input\credit-balance-01.xls
  273. File Line: 5
  274. File Column: 2
  275. Record: 4
  276. Field Index: 2
  277. Field Name: FirstName
  278. Balance
  279. File: example\data\input\credit-balance-01.xls
  280. File Line: 5
  281. File Column: 3
  282. Record: 4
  283. Field Index: 3
  284. Field Name: Balance
  285. CreditLimit
  286. File: example\data\input\credit-balance-01.xls
  287. File Line: 5
  288. File Column: 4
  289. Record: 4
  290. Field Index: 4
  291. Field Name: CreditLimit
  292. AccountCreated
  293. File: example\data\input\credit-balance-01.xls
  294. File Line: 5
  295. File Column: 5
  296. Record: 4
  297. Field Index: 5
  298. Field Name: AccountCreated
  299. ---------------------------------------------------------
Mobile Analytics