Show Unique Values in Column

This example analyzes a dataset to determine the unique values within specific columns along with their respective counts. This feature helps you understand the distribution and occurrence of distinct values, facilitating data profiling and exploration tasks.  This can be used for data quality assessment, anomaly detection, and to understanding data patterns. By identifying unique values and their frequencies, you can gain insights into the characteristics of their data, leading to informed decision-making, data cleansing, and data understanding. 

 

Input CSV File

Sell,List,Living,Rooms,Beds,Baths,Age,Acres,Taxes,Rating,Garage
142,160,28,10,5,,60,0.28,3167,A,
1,193,28,11,,3,60,0.28,3168,B,
10,110,28,,5,3,60,0.28,3169,B,
3,132,,12,5,3,60,0.28,3170,C,
5,,180,13,5,3,60,0.28,3171,C,

 

Java Code Listing

package com.northconcepts.datapipeline.foundations.examples.pipeline;

import java.util.Map;
import java.util.concurrent.atomic.LongAdder;

import com.northconcepts.datapipeline.foundations.file.LocalFileSource;
import com.northconcepts.datapipeline.foundations.pipeline.Pipeline;
import com.northconcepts.datapipeline.foundations.pipeline.dataset.Column;
import com.northconcepts.datapipeline.foundations.pipeline.dataset.Dataset;
import com.northconcepts.datapipeline.foundations.pipeline.dataset.MemoryDataset;
import com.northconcepts.datapipeline.foundations.pipeline.input.CsvPipelineInput;

public class ShowUniqueValuesInColumn {

    public static void main(String[] args) {

        Pipeline pipeline = new Pipeline();

        CsvPipelineInput pipelineInput = new CsvPipelineInput()
                .setFileSource(new LocalFileSource().setPath("data/input/Listing.csv"))
                .setFieldNamesInFirstRow(true);

        pipeline.setInput(pipelineInput);

        Dataset dataset = new MemoryDataset(pipeline).setCollectUniqueValues(true);
        dataset.load().waitForRecordsToLoad();
        dataset.load().waitForColumnStatsToLoad();

        for (Column column : dataset.getColumns()) {
            System.out.println("Column Name: " + column.getName());
            for (Map.Entry entry : column.getUniqueValuesByCount()) {
                System.out.println("  " + entry.getKey() + ": " + entry.getValue());
            }
            System.out.println("===============================");
        }

        dataset.close();
    }
}

 

Code Walkthrough

  1. A Pipeline object is created.
  2. CsvPipelineInput instance is created to read data from the input file Listing.csv and declared as an input to the pipeline object.
  3. The setFieldNamesInFirstRow(true) method is invoked to specify that the names specified in the first row should be used as field names.
  4. MemoryDataset is created to store the pipeline in the in-memory database.  .setCollectUniqueValues(true) enables calculating unique values within a dataset.
  5. Columns and records information is loaded using load().waitForRecordsToLoad() and load().waitForColumnStatsToLoad() methods.
  6. Then columns are iterated using a for loop. For each column  column.getUniqueValuesByCount() method is invoked, and properties from each of these result entries are printed on the console, i.e. key (unique object) and value (count).

 

Console Output

Column Name: Sell
  1: 1
  3: 1
  5: 1
  142: 1
  10: 1
===============================
Column Name: List
  132: 1
  110: 1
  null: 1
  160: 1
  193: 1
===============================
Column Name: Living
  28: 3
  null: 1
  180: 1
===============================
Column Name: Rooms
  11: 1
  12: 1
  13: 1
  null: 1
  10: 1
===============================
Column Name: Beds
  5: 4
  null: 1
===============================
Column Name: Baths
  3: 4
  null: 1
===============================
Column Name: Age
  60: 5
===============================
Column Name: Acres
  0.28: 5
===============================
Column Name: Taxes
  3170: 1
  3169: 1
  3168: 1
  3167: 1
  3171: 1
===============================
Column Name: Rating
  B: 2
  C: 2
  A: 1
===============================
Column Name: Garage
  null: 5
===============================
Mobile Analytics