Remove Duplicates

In this example, you will learn how you can use DataPipline to identify and remove duplicates from a dataset.  This provides you with streamlined and clean data, free from redundant records. It employs efficient algorithms and techniques to identify duplicate entries based on specified criteria, ensuring data integrity and accuracy.

Users can leverage the example to ensure accurate and reliable data analysis and reporting. By removing duplicates, analysts can avoid inflated counts or biased results, enabling more accurate insights and informed decision-making.

Input CSV file

Account,LastName,FirstName,Balance,CreditLimit,AccountCreated,Rating
101,Reeves,Keanu,9315.45,10000.00,1/17/1998,A
312,Butler,Gerard,90.00,10000.00,8/6/2003,B
868,Hewitt,Jennifer Love,0,10000.00,5/25/1985,B
761,Pinkett-Smith,Jada,49654.87,10000.00,12/5/2006,A
317,Murray,Bill,789.65,10000.00,2/5/2007,C

Java Code Listing

package com.northconcepts.datapipeline.examples.cookbook;

import java.io.File;

import com.northconcepts.datapipeline.core.DataReader;
import com.northconcepts.datapipeline.core.FieldList;
import com.northconcepts.datapipeline.core.RemoveDuplicatesReader;
import com.northconcepts.datapipeline.core.StreamWriter;
import com.northconcepts.datapipeline.csv.CSVReader;
import com.northconcepts.datapipeline.job.Job;

public class RemoveDuplicates {
    
    public static void main(String[] args) throws Throwable {
        DataReader reader = new CSVReader(new File("example/data/input/credit-balance-05.csv"))
            .setFieldNamesInFirstRow(true);
        
        // returns records with unique 'Rating-CreditLimit' pairs
        reader = new RemoveDuplicatesReader(
                reader,
                new FieldList("Rating", "CreditLimit")
                );
        
        Job.run(reader, new StreamWriter(System.out));
    }
    
}

Code walkthrough

  1. CSVReader is created corresponding to the input file credit-balance-05.csv.
  2. The CSVReader.setFieldNamesInFirstRow(true) method is invoked to specify that the names specified in the first row should be used as field names.
  3. A RemoveDuplicatesReader is created to remove duplicate records in one field or a combination of fields from the incoming data.
  4. RemoveDuplicatesReader object accepts required DataReader object and optional FieldList or String variable arguments. In the given example, "Rating" and "CreditLimit" values are given in the FieldList object. It returns unique pairs of values, meaning that each rating-creditLimit combination is represented only once. 
  5.  Data is transferred from the reader to the StreamWriter(System.out) via Job.run() method.

Output

-----------------------------------------------
0 - Record {
    0:[Account]:STRING=[101]:String
    1:[LastName]:STRING=[Reeves]:String
    2:[FirstName]:STRING=[Keanu]:String
    3:[Balance]:STRING=[9315.45]:String
    4:[CreditLimit]:STRING=[10000.00]:String
    5:[AccountCreated]:STRING=[1/17/1998]:String
    6:[Rating]:STRING=[A]:String
}

-----------------------------------------------
1 - Record {
    0:[Account]:STRING=[312]:String
    1:[LastName]:STRING=[Butler]:String
    2:[FirstName]:STRING=[Gerard]:String
    3:[Balance]:STRING=[90.00]:String
    4:[CreditLimit]:STRING=[10000.00]:String
    5:[AccountCreated]:STRING=[8/6/2003]:String
    6:[Rating]:STRING=[B]:String
}

-----------------------------------------------
2 - Record {
    0:[Account]:STRING=[317]:String
    1:[LastName]:STRING=[Murray]:String
    2:[FirstName]:STRING=[Bill]:String
    3:[Balance]:STRING=[789.65]:String
    4:[CreditLimit]:STRING=[10000.00]:String
    5:[AccountCreated]:STRING=[2/5/2007]:String
    6:[Rating]:STRING=[C]:String
}

-----------------------------------------------
3 records
Mobile Analytics