Remove/Retain Duplicate Fields
This example shows how to remove duplicate fields as well as other ways of handling records with repeating fields using user-defined policies. By utilizing policies, you can tailor the duplicate field removal or processing to your specific data processing needs.
This can be used for data cleaning and data transformation tasks where duplicate fields can cause inconsistencies and errors in downstream processes. For example, when working with large datasets from different sources, DataPipeline can be used to standardize data structures by removing duplicate fields and ensuring data integrity. Additionally, it can help improve data query performance by reducing unnecessary duplication in data.
Input CSV file
Account,Name,LastName,Name 101,Keanu,Reeves,Keanu Reeves 102,John,Doe,John Doe 103,Jane,Doe,Jane Doe
Java Code Listing
package com.northconcepts.datapipeline.examples.cookbook; import java.io.File; import com.northconcepts.datapipeline.core.DataReader; import com.northconcepts.datapipeline.core.StreamWriter; import com.northconcepts.datapipeline.csv.CSVReader; import com.northconcepts.datapipeline.job.Job; import com.northconcepts.datapipeline.transform.RemoveDuplicateFields; import com.northconcepts.datapipeline.transform.RemoveDuplicateFields.DuplicateFieldsPolicy; import com.northconcepts.datapipeline.transform.TransformingReader; public class RemoveDuplicateFieldsExample { public static void main(String[] args) { retainAllDuplicateFields(); retainFirstDuplicateFields(); retainLastDuplicateFields(); makeArrayOfDuplicateFields(); renameDuplicateFields(); } private static void retainAllDuplicateFields() { System.out.println("================================Retain All Duplicate Fields================================--"); DataReader reader = new CSVReader(new File("example/data/input/duplicate_fields.csv")) .setFieldNamesInFirstRow(true); reader = new TransformingReader(reader) .add(new RemoveDuplicateFields(DuplicateFieldsPolicy.RETAIN_ALL)); Job.run(reader, new StreamWriter(System.out)); } private static void retainFirstDuplicateFields() { System.out.println("\n\n================================Retain First field from Duplicate Fields================================"); DataReader reader = new CSVReader(new File("example/data/input/duplicate_fields.csv")) .setFieldNamesInFirstRow(true); reader = new TransformingReader(reader) .add(new RemoveDuplicateFields(DuplicateFieldsPolicy.RETAIN_FIRST)); Job.run(reader, new StreamWriter(System.out)); } private static void retainLastDuplicateFields() { System.out.println("\n\n================================Retain Last field from Duplicate Fields================================"); DataReader reader = new CSVReader(new File("example/data/input/duplicate_fields.csv")) .setFieldNamesInFirstRow(true); reader = new TransformingReader(reader) .add(new RemoveDuplicateFields(DuplicateFieldsPolicy.RETAIN_LAST)); Job.run(reader, new StreamWriter(System.out)); } private static void makeArrayOfDuplicateFields() { System.out.println("\n\n================================Create An Array Of All Duplicate Fields================================"); DataReader reader = new CSVReader(new File("example/data/input/duplicate_fields.csv")) .setFieldNamesInFirstRow(true); reader = new TransformingReader(reader) .add(new RemoveDuplicateFields(DuplicateFieldsPolicy.MAKE_ARRAY)); Job.run(reader, new StreamWriter(System.out)); } private static void renameDuplicateFields() { System.out.println("\n\n================================Rename Duplicate Fields================================"); DataReader reader = new CSVReader(new File("example/data/input/duplicate_fields.csv")) .setFieldNamesInFirstRow(true); reader = new TransformingReader(reader) .add(new RemoveDuplicateFields(DuplicateFieldsPolicy.RENAME)); Job.run(reader, new StreamWriter(System.out)); } }
Code Walkthrough
- Inside the main method, five methods are called corresponding to the different duplicate policy names within DataPipeline.
- In each method body, the following pattern is repeated:
- CSVReader is created corresponding to the input file
duplicate_fields.csv
. - TransformingReader instance is used to sequentially apply changes to the incoming data.
- RemoveDuplicateFields is invoked with different
DuplicateFieldsPolicy
parameters. Each policy is explained below. - Job.run() is used to transfer the data from the
reader
toStreamWriter(System.out)
. See how to compile and run data pipeline jobs.
- CSVReader is created corresponding to the input file
Duplicate Fields Policies
RETAIN_ALL
The "RETAIN_ALL" policy retains all fields, including duplicates, in the records. In this policy, if there are multiple occurrences of the same field in a record, they are all retained, resulting in multiple columns with the same field name.
For example, in the provided output for "RETAIN_ALL," each record contains both "Name" and "LastName" fields, and the duplicate "Name" field is retained along with the original field.
RETAIN_FIRST
The "RETAIN_FIRST" policy retains the first occurrence of each field in a record and removes subsequent duplicates. In this policy, only the first occurrence of each field is kept, and any duplicate fields with the same name are removed from the record.
For example, in the provided output for "RETAIN_FIRST," each record has only one "Name" and one "LastName" field, and the duplicate "Name" field is removed from the record after the first occurrence.
RETAIN_LAST
The "RETAIN_LAST" policy retains the last occurrence of each field in a record and removes all previous duplicates. In this policy, only the last occurrence of each field is kept, and any duplicate fields with the same name are removed from the record, retaining only the last one.
For example, in the provided output for "RETAIN_LAST," each record has only one "Name" and one "LastName" field, and all previous occurrences of the "Name" field are removed, leaving only the last occurrence of the field in the record.
MAKE_ARRAY
The "MAKE_ARRAY" policy transforms the duplicate fields into arrays of values, combining all duplicate values into a single array for each field. In this policy, instead of removing duplicates, the duplicate field values are consolidated into an array containing all occurrences of the field in the record.
For example, in the provided output for "MAKE_ARRAY," the "Name" field is transformed into an array that contains both "Keanu" and "Keanu Reeves," and the "LastName" field remains unaffected as it does not have any duplicates.
RENAME
The "RENAME" policy modifies the duplicate fields by renaming them with a unique suffix, such as appending a numerical index to the field name. This ensures that each duplicate field has a distinct name, preventing conflicts in the record.
In the provided output for "RENAME," the duplicate "Name" field is renamed as "Name1" to differentiate it from the original "Name" field. Similarly, the "LastName" field remains unaffected since it does not have any duplicates.
Console Output
================================Retain All Duplicate Fields================================
----------------------------------------------- 0 - Record { 0:[Account]:STRING=[101]:String 1:[Name]:STRING=[Keanu]:String 2:[LastName]:STRING=[Reeves]:String 3:[Name]:STRING=[Keanu Reeves]:String } ----------------------------------------------- 1 - Record { 0:[Account]:STRING=[102]:String 1:[Name]:STRING=[John]:String 2:[LastName]:STRING=[Doe]:String 3:[Name]:STRING=[John Doe]:String } ----------------------------------------------- 2 - Record { 0:[Account]:STRING=[103]:String 1:[Name]:STRING=[Jane]:String 2:[LastName]:STRING=[Doe]:String 3:[Name]:STRING=[Jane Doe]:String } ----------------------------------------------- 3 records ================================Retain First field from Duplicate Fields================================ ----------------------------------------------- 0 - Record (MODIFIED) { 0:[Account]:STRING=[101]:String 1:[Name]:STRING=[Keanu]:String 2:[LastName]:STRING=[Reeves]:String } ----------------------------------------------- 1 - Record (MODIFIED) { 0:[Account]:STRING=[102]:String 1:[Name]:STRING=[John]:String 2:[LastName]:STRING=[Doe]:String } ----------------------------------------------- 2 - Record (MODIFIED) { 0:[Account]:STRING=[103]:String 1:[Name]:STRING=[Jane]:String 2:[LastName]:STRING=[Doe]:String } ----------------------------------------------- 3 records ================================Retain Last field from Duplicate Fields================================ ----------------------------------------------- 0 - Record (MODIFIED) { 0:[Account]:STRING=[101]:String 1:[LastName]:STRING=[Reeves]:String 2:[Name]:STRING=[Keanu Reeves]:String } ----------------------------------------------- 1 - Record (MODIFIED) { 0:[Account]:STRING=[102]:String 1:[LastName]:STRING=[Doe]:String 2:[Name]:STRING=[John Doe]:String } ----------------------------------------------- 2 - Record (MODIFIED) { 0:[Account]:STRING=[103]:String 1:[LastName]:STRING=[Doe]:String 2:[Name]:STRING=[Jane Doe]:String } ----------------------------------------------- 3 records ================================Create An Array Of All Duplicate Fields================================ ----------------------------------------------- 0 - Record (MODIFIED) { 0:[Account]:STRING=[101]:String 1:[Name]:ARRAY of STRING=[[Keanu, Keanu Reeves]]:ArrayValue 2:[LastName]:STRING=[Reeves]:String } ----------------------------------------------- 1 - Record (MODIFIED) { 0:[Account]:STRING=[102]:String 1:[Name]:ARRAY of STRING=[[John, John Doe]]:ArrayValue 2:[LastName]:STRING=[Doe]:String } ----------------------------------------------- 2 - Record (MODIFIED) { 0:[Account]:STRING=[103]:String 1:[Name]:ARRAY of STRING=[[Jane, Jane Doe]]:ArrayValue 2:[LastName]:STRING=[Doe]:String } ----------------------------------------------- 3 records ================================Rename Duplicate Fields================================ ----------------------------------------------- 0 - Record (MODIFIED) { 0:[Account]:STRING=[101]:String 1:[Name]:STRING=[Keanu]:String 2:[LastName]:STRING=[Reeves]:String 3:[Name1]:STRING=[Keanu Reeves]:String } ----------------------------------------------- 1 - Record (MODIFIED) { 0:[Account]:STRING=[102]:String 1:[Name]:STRING=[John]:String 2:[LastName]:STRING=[Doe]:String 3:[Name1]:STRING=[John Doe]:String } ----------------------------------------------- 2 - Record (MODIFIED) { 0:[Account]:STRING=[103]:String 1:[Name]:STRING=[Jane]:String 2:[LastName]:STRING=[Doe]:String 3:[Name1]:STRING=[Jane Doe]:String } ----------------------------------------------- 3 records