Show Unique Values in Column
Updated: Aug 10, 2023
This example analyzes a dataset to determine the unique values within specific columns along with their respective counts. This feature helps you understand the distribution and occurrence of distinct values, facilitating data profiling and exploration tasks. This can be used for data quality assessment, anomaly detection, and to understanding data patterns. By identifying unique values and their frequencies, you can gain insights into the characteristics of their data, leading to informed decision-making, data cleansing, and data understanding.
Input CSV File
Sell,List,Living,Rooms,Beds,Baths,Age,Acres,Taxes,Rating,Garage 142,160,28,10,5,,60,0.28,3167,A, 1,193,28,11,,3,60,0.28,3168,B, 10,110,28,,5,3,60,0.28,3169,B, 3,132,,12,5,3,60,0.28,3170,C, 5,,180,13,5,3,60,0.28,3171,C,
Java Code Listing
package com.northconcepts.datapipeline.foundations.examples.pipeline; import java.util.Map; import java.util.concurrent.atomic.LongAdder; import com.northconcepts.datapipeline.foundations.file.LocalFileSource; import com.northconcepts.datapipeline.foundations.pipeline.Pipeline; import com.northconcepts.datapipeline.foundations.pipeline.dataset.Column; import com.northconcepts.datapipeline.foundations.pipeline.dataset.Dataset; import com.northconcepts.datapipeline.foundations.pipeline.dataset.MemoryDataset; import com.northconcepts.datapipeline.foundations.pipeline.input.CsvPipelineInput; public class ShowUniqueValuesInColumn { public static void main(String[] args) { Pipeline pipeline = new Pipeline(); CsvPipelineInput pipelineInput = new CsvPipelineInput() .setFileSource(new LocalFileSource().setPath("data/input/Listing.csv")) .setFieldNamesInFirstRow(true); pipeline.setInput(pipelineInput); Dataset dataset = new MemoryDataset(pipeline).setCollectUniqueValues(true); dataset.load().waitForRecordsToLoad(); dataset.load().waitForColumnStatsToLoad(); for (Column column : dataset.getColumns()) { System.out.println("Column Name: " + column.getName()); for (Map.Entry
Code Walkthrough
- A Pipeline object is created.
- CsvPipelineInput instance is created to read data from the input file
Listing.csv
and declared as an input to the pipeline object. - The
setFieldNamesInFirstRow(true)
method is invoked to specify that the names specified in the first row should be used as field names. - MemoryDataset is created to store the pipeline in the in-memory database.
.setCollectUniqueValues(true)
enables calculating unique values within a dataset. - Columns and records information is loaded using
load().waitForRecordsToLoad()
andload().waitForColumnStatsToLoad()
methods. - Then columns are iterated using a for loop. For each column
column.getUniqueValuesByCount()
method is invoked, and properties from each of these result entries are printed on the console, i.e. key (unique object) and value (count).
Console Output
Column Name: Sell 1: 1 3: 1 5: 1 142: 1 10: 1 =============================== Column Name: List 132: 1 110: 1 null: 1 160: 1 193: 1 =============================== Column Name: Living 28: 3 null: 1 180: 1 =============================== Column Name: Rooms 11: 1 12: 1 13: 1 null: 1 10: 1 =============================== Column Name: Beds 5: 4 null: 1 =============================== Column Name: Baths 3: 4 null: 1 =============================== Column Name: Age 60: 5 =============================== Column Name: Acres 0.28: 5 =============================== Column Name: Taxes 3170: 1 3169: 1 3168: 1 3167: 1 3171: 1 =============================== Column Name: Rating B: 2 C: 2 A: 1 =============================== Column Name: Garage null: 5 ===============================