Show Column Statistics
With this example, you can learn how to use Data Pipeline to get column statistics and insights for data analysis and exploration. It offers functionalities such as counting null and non-null values, determining the data types of columns, and calculating the maximum and minimum lengths of column data.
Real-life use cases for this example could include data quality assessment, data profiling, and data preprocessing tasks. Users can leverage column statistics to identify missing or inconsistent data, understand the distribution of values in different columns, and gain insights into the structure and characteristics of their datasets. This information can be valuable for data cleaning, data validation, and making informed decisions in data-driven applications and analytics processes.
Input CSV file
Sell,List,Living,Rooms,Beds,Baths,Age,Acres,Taxes,Rating,Garage 142,160,28,10,5,,60,0.28,3167,A, 1,193,28,11,,3,60,0.28,3168,B, 10,110,28,,5,3,60,0.28,3169,B, 3,132,,12,5,3,60,0.28,3170,C, 5,,180,13,5,3,60,0.28,3171,C,
Java Code
package com.northconcepts.datapipeline.foundations.examples.pipeline; import java.util.Map.Entry; import java.util.concurrent.atomic.LongAdder; import com.northconcepts.datapipeline.core.FieldType; import com.northconcepts.datapipeline.foundations.file.LocalFileSource; import com.northconcepts.datapipeline.foundations.pipeline.Pipeline; import com.northconcepts.datapipeline.foundations.pipeline.dataset.Column; import com.northconcepts.datapipeline.foundations.pipeline.dataset.Dataset; import com.northconcepts.datapipeline.foundations.pipeline.dataset.MemoryDataset; import com.northconcepts.datapipeline.foundations.pipeline.input.CsvPipelineInput; import com.northconcepts.datapipeline.foundations.time.DateTimePattern; public class ShowColumnStatistics { public static void main(String[] args) { CsvPipelineInput pipelineInput = new CsvPipelineInput() .setFileSource(new LocalFileSource().setPath("data/input/Listing.csv")) .setFieldNamesInFirstRow(true); Pipeline pipeline = new Pipeline(); pipeline.setInput(pipelineInput); Dataset dataset = new MemoryDataset(pipeline); dataset.load().waitForColumnStatsToLoad(); System.out.println("Column Count: " + dataset.getColumnCount()); System.out.println("Record Count: " + dataset.getRecordCount()); for(Column column : dataset.getColumns()) { System.out.println("Name: " + column.getName()); System.out.println("Value Count: " + column.getValueCount()); System.out.println("Null Count: " + column.getNullCount()); System.out.println("Non Null Count: " + column.getNonNullCount()); System.out.println("Blank Count: " + column.getBlankCount()); System.out.println("Non Null Non Blank Count: " + column.getNonNullNonBlankCount()); System.out.println("Unique Value Count: " + column.getUniqueValueCount()); System.out.println("Is Numeric Column: " + column.isNumeric()); System.out.println(" Inferred Numeric Value Count: " + column.getInferredNumericValueCount()); if (column.isNumeric()) { System.out.println(" " + column.getNumberDescriptor() + " -- " + column.getNumberDescriptor().getFieldType()); } System.out.println("Is Temporal Column: " + column.isTemporal()); System.out.println(" Inferred Temporal Value Count: " + column.getInferredTemporalValueCount()); if (column.isTemporal()) { for (Entryentry : column.getTemporalPatterns().entrySet()) { System.out.println(" " + entry.getKey().getPattern() + " -- " + entry.getValue().longValue()); } } System.out.println("Is Boolean Column: " + column.isBoolean()); System.out.println(" Inferred Boolean Value Count: " + column.getInferredBooleanValueCount()); System.out.println("Is Array Column: " + column.isArray()); if(column.isArray()){ System.out.println(" Array Value Count: " + column.getArrayValueCount()); System.out.println(" Minimum Array Elements: " + column.getMinimumArrayElements()); System.out.println(" Maximum Array Elements: " + column.getMaximumArrayElements()); } System.out.println("Minimum Length: " + column.getMinimumLength()); System.out.println("Maximum Length: " + column.getMaximumLength()); System.out.println("Length Sum: " + column.getLengthSum()); System.out.println("Average Length: " + column.getAverageLength()); System.out.println("Sample Value: " + column.getSampleValue()); System.out.println("Inferred Type Name: " + column.getInferredTypeName()); System.out.println("Inferred Field Type: " + column.getInferredFieldType()); System.out.println("Field Type: " + column.getFieldType()); System.out.println("Field Type Count: " + column.getFieldTypeCount()); System.out.println("Best Fit Field Type: " + column.getBestFitFieldType()); for (Entry entry : column.getFieldTypes().entrySet()) { System.out.println(" " + entry.getKey() + " -- " + entry.getValue().longValue()); } System.out.println("Null Values For Field Types:"); for (Entry entry : column.getNullValueFieldTypes().entrySet()) { System.out.println(" " + entry.getKey() + " -- " + entry.getValue().longValue()); } System.out.println("==============================="); } } }
Code Walkthrough
- After creating a Pipeline using the provided input file, it is subsequently loaded into a MemoryDataset to facilitate additional processing.
- The column statistics are then extracted from the dataset, making them accessible for further utilization.
- Various statistics pertaining to each column in the Dataset are printed on the console.
Console Output
14:07:23,509 DEBUG [main] datapipeline:37 - DataPipeline v8.1.0-SNAPSHOT by North Concepts Inc. 14:07:23,710 DEBUG [job-1] datapipeline:615 - Job[1,job-1,Mon Jun 19 14:07:23 EEST 2023]::Start 14:07:24,352 DEBUG [job-1] datapipeline:661 - job::Success Column Count: 11 Record Count: 5 Name: Sell Value Count: 5 Null Count: 0 Non Null Count: 5 Blank Count: 0 Non Null Non Blank Count: 5 Unique Value Count: 0 Is Numeric Column: true Inferred Numeric Value Count: 5 NumberDescriptor [negative=false, wholeDigits=3, fractionDigits=0, fieldType=null] -- INT Is Temporal Column: false Inferred Temporal Value Count: 0 Is Boolean Column: false Inferred Boolean Value Count: 0 Is Array Column: false Minimum Length: 1 Maximum Length: 3 Length Sum: 8 Average Length: 1 Sample Value: 1 Inferred Type Name: Int Inferred Field Type: INT Field Type: STRING Field Type Count: 1 Best Fit Field Type: STRING STRING -- 5 Null Values For Field Types: =============================== Name: List Value Count: 5 Null Count: 1 Non Null Count: 4 Blank Count: 0 Non Null Non Blank Count: 4 Unique Value Count: 0 Is Numeric Column: true Inferred Numeric Value Count: 4 NumberDescriptor [negative=false, wholeDigits=3, fractionDigits=0, fieldType=null] -- INT Is Temporal Column: false Inferred Temporal Value Count: 0 Is Boolean Column: false Inferred Boolean Value Count: 0 Is Array Column: false Minimum Length: 3 Maximum Length: 3 Length Sum: 12 Average Length: 2 Sample Value: 193 Inferred Type Name: Int Inferred Field Type: INT Field Type: STRING Field Type Count: 1 Best Fit Field Type: STRING STRING -- 5 Null Values For Field Types: STRING -- 1 =============================== Name: Living Value Count: 5 Null Count: 1 Non Null Count: 4 Blank Count: 0 Non Null Non Blank Count: 4 Unique Value Count: 0 Is Numeric Column: true Inferred Numeric Value Count: 4 NumberDescriptor [negative=false, wholeDigits=3, fractionDigits=0, fieldType=null] -- INT Is Temporal Column: false Inferred Temporal Value Count: 0 Is Boolean Column: false Inferred Boolean Value Count: 0 Is Array Column: false Minimum Length: 2 Maximum Length: 3 Length Sum: 9 Average Length: 1 Sample Value: 28 Inferred Type Name: Int Inferred Field Type: INT Field Type: STRING Field Type Count: 1 Best Fit Field Type: STRING STRING -- 5 Null Values For Field Types: STRING -- 1 =============================== Name: Rooms Value Count: 5 Null Count: 1 Non Null Count: 4 Blank Count: 0 Non Null Non Blank Count: 4 Unique Value Count: 0 Is Numeric Column: true Inferred Numeric Value Count: 4 NumberDescriptor [negative=false, wholeDigits=2, fractionDigits=0, fieldType=null] -- INT Is Temporal Column: false Inferred Temporal Value Count: 0 Is Boolean Column: false Inferred Boolean Value Count: 0 Is Array Column: false Minimum Length: 2 Maximum Length: 2 Length Sum: 8 Average Length: 1 Sample Value: 11 Inferred Type Name: Int Inferred Field Type: INT Field Type: STRING Field Type Count: 1 Best Fit Field Type: STRING STRING -- 5 Null Values For Field Types: STRING -- 1 =============================== Name: Beds Value Count: 5 Null Count: 1 Non Null Count: 4 Blank Count: 0 Non Null Non Blank Count: 4 Unique Value Count: 0 Is Numeric Column: true Inferred Numeric Value Count: 4 NumberDescriptor [negative=false, wholeDigits=1, fractionDigits=0, fieldType=null] -- INT Is Temporal Column: false Inferred Temporal Value Count: 0 Is Boolean Column: false Inferred Boolean Value Count: 0 Is Array Column: false Minimum Length: 1 Maximum Length: 1 Length Sum: 4 Average Length: 0 Sample Value: 5 Inferred Type Name: Int Inferred Field Type: INT Field Type: STRING Field Type Count: 1 Best Fit Field Type: STRING STRING -- 5 Null Values For Field Types: STRING -- 1 =============================== Name: Baths Value Count: 5 Null Count: 1 Non Null Count: 4 Blank Count: 0 Non Null Non Blank Count: 4 Unique Value Count: 0 Is Numeric Column: true Inferred Numeric Value Count: 4 NumberDescriptor [negative=false, wholeDigits=1, fractionDigits=0, fieldType=null] -- INT Is Temporal Column: false Inferred Temporal Value Count: 0 Is Boolean Column: false Inferred Boolean Value Count: 0 Is Array Column: false Minimum Length: 1 Maximum Length: 1 Length Sum: 4 Average Length: 0 Sample Value: 3 Inferred Type Name: Int Inferred Field Type: INT Field Type: STRING Field Type Count: 1 Best Fit Field Type: STRING STRING -- 5 Null Values For Field Types: STRING -- 1 =============================== Name: Age Value Count: 5 Null Count: 0 Non Null Count: 5 Blank Count: 0 Non Null Non Blank Count: 5 Unique Value Count: 0 Is Numeric Column: true Inferred Numeric Value Count: 5 NumberDescriptor [negative=false, wholeDigits=2, fractionDigits=0, fieldType=null] -- INT Is Temporal Column: false Inferred Temporal Value Count: 0 Is Boolean Column: false Inferred Boolean Value Count: 0 Is Array Column: false Minimum Length: 2 Maximum Length: 2 Length Sum: 10 Average Length: 2 Sample Value: 60 Inferred Type Name: Int Inferred Field Type: INT Field Type: STRING Field Type Count: 1 Best Fit Field Type: STRING STRING -- 5 Null Values For Field Types: =============================== Name: Acres Value Count: 5 Null Count: 0 Non Null Count: 5 Blank Count: 0 Non Null Non Blank Count: 5 Unique Value Count: 0 Is Numeric Column: true Inferred Numeric Value Count: 5 NumberDescriptor [negative=false, wholeDigits=1, fractionDigits=2, fieldType=null] -- DOUBLE Is Temporal Column: false Inferred Temporal Value Count: 0 Is Boolean Column: false Inferred Boolean Value Count: 0 Is Array Column: false Minimum Length: 4 Maximum Length: 4 Length Sum: 20 Average Length: 4 Sample Value: 0.28 Inferred Type Name: Double Inferred Field Type: DOUBLE Field Type: STRING Field Type Count: 1 Best Fit Field Type: STRING STRING -- 5 Null Values For Field Types: =============================== Name: Taxes Value Count: 5 Null Count: 0 Non Null Count: 5 Blank Count: 0 Non Null Non Blank Count: 5 Unique Value Count: 0 Is Numeric Column: true Inferred Numeric Value Count: 5 NumberDescriptor [negative=false, wholeDigits=4, fractionDigits=0, fieldType=null] -- INT Is Temporal Column: true Inferred Temporal Value Count: 5 yyyy -- 5 Is Boolean Column: false Inferred Boolean Value Count: 0 Is Array Column: false Minimum Length: 4 Maximum Length: 4 Length Sum: 20 Average Length: 4 Sample Value: 3167 Inferred Type Name: Int Inferred Field Type: INT Field Type: STRING Field Type Count: 1 Best Fit Field Type: STRING STRING -- 5 Null Values For Field Types: =============================== Name: Rating Value Count: 5 Null Count: 0 Non Null Count: 5 Blank Count: 0 Non Null Non Blank Count: 5 Unique Value Count: 0 Is Numeric Column: false Inferred Numeric Value Count: 0 Is Temporal Column: false Inferred Temporal Value Count: 0 Is Boolean Column: false Inferred Boolean Value Count: 0 Is Array Column: false Minimum Length: 1 Maximum Length: 1 Length Sum: 5 Average Length: 1 Sample Value: A Inferred Type Name: String Inferred Field Type: STRING Field Type: STRING Field Type Count: 1 Best Fit Field Type: STRING STRING -- 5 Null Values For Field Types: =============================== Name: Garage Value Count: 5 Null Count: 5 Non Null Count: 0 Blank Count: 0 Non Null Non Blank Count: 0 Unique Value Count: 0 Is Numeric Column: false Inferred Numeric Value Count: 0 Is Temporal Column: false Inferred Temporal Value Count: 0 Is Boolean Column: false Inferred Boolean Value Count: 0 Is Array Column: false Minimum Length: -1 Maximum Length: -1 Length Sum: 0 Average Length: 0 Sample Value: null Inferred Type Name: String Inferred Field Type: STRING Field Type: STRING Field Type Count: 1 Best Fit Field Type: STRING STRING -- 5 Null Values For Field Types: STRING -- 5 ===============================