Show Column Statistics

With this example, you can learn how to use Data Pipeline to get column statistics and insights for data analysis and exploration. It offers functionalities such as counting null and non-null values, determining the data types of columns, and calculating the maximum and minimum lengths of column data.

Real-life use cases for this example could include data quality assessment, data profiling, and data preprocessing tasks. Users can leverage column statistics to identify missing or inconsistent data, understand the distribution of values in different columns, and gain insights into the structure and characteristics of their datasets. This information can be valuable for data cleaning, data validation, and making informed decisions in data-driven applications and analytics processes.

 

Input CSV file

Sell,List,Living,Rooms,Beds,Baths,Age,Acres,Taxes,Rating,Garage
142,160,28,10,5,,60,0.28,3167,A,
1,193,28,11,,3,60,0.28,3168,B,
10,110,28,,5,3,60,0.28,3169,B,
3,132,,12,5,3,60,0.28,3170,C,
5,,180,13,5,3,60,0.28,3171,C,

 

Java Code

package com.northconcepts.datapipeline.foundations.examples.pipeline;

import java.util.Map.Entry;
import java.util.concurrent.atomic.LongAdder;

import com.northconcepts.datapipeline.core.FieldType;
import com.northconcepts.datapipeline.foundations.file.LocalFileSource;
import com.northconcepts.datapipeline.foundations.pipeline.Pipeline;
import com.northconcepts.datapipeline.foundations.pipeline.dataset.Column;
import com.northconcepts.datapipeline.foundations.pipeline.dataset.Dataset;
import com.northconcepts.datapipeline.foundations.pipeline.dataset.MemoryDataset;
import com.northconcepts.datapipeline.foundations.pipeline.input.CsvPipelineInput;
import com.northconcepts.datapipeline.foundations.time.DateTimePattern;

public class ShowColumnStatistics {

    public static void main(String[] args) {

        CsvPipelineInput pipelineInput = new CsvPipelineInput()
                .setFileSource(new LocalFileSource().setPath("data/input/Listing.csv"))
                .setFieldNamesInFirstRow(true);

        Pipeline pipeline = new Pipeline();

        pipeline.setInput(pipelineInput);

        Dataset dataset = new MemoryDataset(pipeline);
        dataset.load().waitForColumnStatsToLoad();

        System.out.println("Column Count: " + dataset.getColumnCount());
        System.out.println("Record Count: " + dataset.getRecordCount());

        for(Column column : dataset.getColumns()) {
            System.out.println("Name: " + column.getName());
            System.out.println("Value Count: " + column.getValueCount());
            System.out.println("Null Count: " + column.getNullCount());
            System.out.println("Non Null Count: " + column.getNonNullCount());
            System.out.println("Blank Count: " + column.getBlankCount());
            System.out.println("Non Null Non Blank Count: " + column.getNonNullNonBlankCount());
            System.out.println("Unique Value Count: " + column.getUniqueValueCount());

            System.out.println("Is Numeric Column: " + column.isNumeric());
            System.out.println("    Inferred Numeric Value Count: " + column.getInferredNumericValueCount());
            if (column.isNumeric()) {
                System.out.println("    " + column.getNumberDescriptor() + " -- " + column.getNumberDescriptor().getFieldType());
            }

            System.out.println("Is Temporal Column: " + column.isTemporal());
            System.out.println("    Inferred Temporal Value Count: " + column.getInferredTemporalValueCount());
            if (column.isTemporal()) {
                for (Entry entry : column.getTemporalPatterns().entrySet()) {
                    System.out.println("    " + entry.getKey().getPattern() + "  --  " + entry.getValue().longValue());
                }
            }

            System.out.println("Is Boolean Column: " + column.isBoolean());
            System.out.println("    Inferred Boolean Value Count: " + column.getInferredBooleanValueCount());
            System.out.println("Is Array Column: " + column.isArray());
            if(column.isArray()){
                System.out.println("    Array Value Count: " + column.getArrayValueCount());
                System.out.println("    Minimum Array Elements: " + column.getMinimumArrayElements());
                System.out.println("    Maximum Array Elements: " + column.getMaximumArrayElements());
            }
            System.out.println("Minimum Length: " + column.getMinimumLength());
            System.out.println("Maximum Length: " + column.getMaximumLength());
            System.out.println("Length Sum: " + column.getLengthSum());
            System.out.println("Average Length: " + column.getAverageLength());
            System.out.println("Sample Value: " + column.getSampleValue());
            System.out.println("Inferred Type Name: " + column.getInferredTypeName());
            System.out.println("Inferred Field Type: " + column.getInferredFieldType());
            System.out.println("Field Type: " + column.getFieldType());
            System.out.println("Field Type Count: " + column.getFieldTypeCount());
            System.out.println("Best Fit Field Type: " + column.getBestFitFieldType());
            for (Entry entry : column.getFieldTypes().entrySet()) {
                System.out.println("    " + entry.getKey() + "  --  " + entry.getValue().longValue());
            }
            System.out.println("Null Values For Field Types:");
            for (Entry entry : column.getNullValueFieldTypes().entrySet()) {
                System.out.println("    " + entry.getKey() + "  --  " + entry.getValue().longValue());
            }
            System.out.println("===============================");

        }

    }
}

 

Code Walkthrough

  1. After creating a Pipeline using the provided input file, it is subsequently loaded into a MemoryDataset to facilitate additional processing.
  2. The column statistics are then extracted from the dataset, making them accessible for further utilization.
  3. Various statistics pertaining to each column in the Dataset are printed on the console.

 

Console Output

14:07:23,509 DEBUG [main] datapipeline:37 - DataPipeline v8.1.0-SNAPSHOT by North Concepts Inc.
14:07:23,710 DEBUG [job-1] datapipeline:615 - Job[1,job-1,Mon Jun 19 14:07:23 EEST 2023]::Start
14:07:24,352 DEBUG [job-1] datapipeline:661 - job::Success
Column Count: 11
Record Count: 5
Name: Sell
Value Count: 5
Null Count: 0
Non Null Count: 5
Blank Count: 0
Non Null Non Blank Count: 5
Unique Value Count: 0
Is Numeric Column: true
    Inferred Numeric Value Count: 5
    NumberDescriptor [negative=false, wholeDigits=3, fractionDigits=0, fieldType=null] -- INT
Is Temporal Column: false
    Inferred Temporal Value Count: 0
Is Boolean Column: false
    Inferred Boolean Value Count: 0
Is Array Column: false
Minimum Length: 1
Maximum Length: 3
Length Sum: 8
Average Length: 1
Sample Value: 1
Inferred Type Name: Int
Inferred Field Type: INT
Field Type: STRING
Field Type Count: 1
Best Fit Field Type: STRING
    STRING  --  5
Null Values For Field Types:
===============================
Name: List
Value Count: 5
Null Count: 1
Non Null Count: 4
Blank Count: 0
Non Null Non Blank Count: 4
Unique Value Count: 0
Is Numeric Column: true
    Inferred Numeric Value Count: 4
    NumberDescriptor [negative=false, wholeDigits=3, fractionDigits=0, fieldType=null] -- INT
Is Temporal Column: false
    Inferred Temporal Value Count: 0
Is Boolean Column: false
    Inferred Boolean Value Count: 0
Is Array Column: false
Minimum Length: 3
Maximum Length: 3
Length Sum: 12
Average Length: 2
Sample Value: 193
Inferred Type Name: Int
Inferred Field Type: INT
Field Type: STRING
Field Type Count: 1
Best Fit Field Type: STRING
    STRING  --  5
Null Values For Field Types:
    STRING  --  1
===============================
Name: Living
Value Count: 5
Null Count: 1
Non Null Count: 4
Blank Count: 0
Non Null Non Blank Count: 4
Unique Value Count: 0
Is Numeric Column: true
    Inferred Numeric Value Count: 4
    NumberDescriptor [negative=false, wholeDigits=3, fractionDigits=0, fieldType=null] -- INT
Is Temporal Column: false
    Inferred Temporal Value Count: 0
Is Boolean Column: false
    Inferred Boolean Value Count: 0
Is Array Column: false
Minimum Length: 2
Maximum Length: 3
Length Sum: 9
Average Length: 1
Sample Value: 28
Inferred Type Name: Int
Inferred Field Type: INT
Field Type: STRING
Field Type Count: 1
Best Fit Field Type: STRING
    STRING  --  5
Null Values For Field Types:
    STRING  --  1
===============================
Name: Rooms
Value Count: 5
Null Count: 1
Non Null Count: 4
Blank Count: 0
Non Null Non Blank Count: 4
Unique Value Count: 0
Is Numeric Column: true
    Inferred Numeric Value Count: 4
    NumberDescriptor [negative=false, wholeDigits=2, fractionDigits=0, fieldType=null] -- INT
Is Temporal Column: false
    Inferred Temporal Value Count: 0
Is Boolean Column: false
    Inferred Boolean Value Count: 0
Is Array Column: false
Minimum Length: 2
Maximum Length: 2
Length Sum: 8
Average Length: 1
Sample Value: 11
Inferred Type Name: Int
Inferred Field Type: INT
Field Type: STRING
Field Type Count: 1
Best Fit Field Type: STRING
    STRING  --  5
Null Values For Field Types:
    STRING  --  1
===============================
Name: Beds
Value Count: 5
Null Count: 1
Non Null Count: 4
Blank Count: 0
Non Null Non Blank Count: 4
Unique Value Count: 0
Is Numeric Column: true
    Inferred Numeric Value Count: 4
    NumberDescriptor [negative=false, wholeDigits=1, fractionDigits=0, fieldType=null] -- INT
Is Temporal Column: false
    Inferred Temporal Value Count: 0
Is Boolean Column: false
    Inferred Boolean Value Count: 0
Is Array Column: false
Minimum Length: 1
Maximum Length: 1
Length Sum: 4
Average Length: 0
Sample Value: 5
Inferred Type Name: Int
Inferred Field Type: INT
Field Type: STRING
Field Type Count: 1
Best Fit Field Type: STRING
    STRING  --  5
Null Values For Field Types:
    STRING  --  1
===============================
Name: Baths
Value Count: 5
Null Count: 1
Non Null Count: 4
Blank Count: 0
Non Null Non Blank Count: 4
Unique Value Count: 0
Is Numeric Column: true
    Inferred Numeric Value Count: 4
    NumberDescriptor [negative=false, wholeDigits=1, fractionDigits=0, fieldType=null] -- INT
Is Temporal Column: false
    Inferred Temporal Value Count: 0
Is Boolean Column: false
    Inferred Boolean Value Count: 0
Is Array Column: false
Minimum Length: 1
Maximum Length: 1
Length Sum: 4
Average Length: 0
Sample Value: 3
Inferred Type Name: Int
Inferred Field Type: INT
Field Type: STRING
Field Type Count: 1
Best Fit Field Type: STRING
    STRING  --  5
Null Values For Field Types:
    STRING  --  1
===============================
Name: Age
Value Count: 5
Null Count: 0
Non Null Count: 5
Blank Count: 0
Non Null Non Blank Count: 5
Unique Value Count: 0
Is Numeric Column: true
    Inferred Numeric Value Count: 5
    NumberDescriptor [negative=false, wholeDigits=2, fractionDigits=0, fieldType=null] -- INT
Is Temporal Column: false
    Inferred Temporal Value Count: 0
Is Boolean Column: false
    Inferred Boolean Value Count: 0
Is Array Column: false
Minimum Length: 2
Maximum Length: 2
Length Sum: 10
Average Length: 2
Sample Value: 60
Inferred Type Name: Int
Inferred Field Type: INT
Field Type: STRING
Field Type Count: 1
Best Fit Field Type: STRING
    STRING  --  5
Null Values For Field Types:
===============================
Name: Acres
Value Count: 5
Null Count: 0
Non Null Count: 5
Blank Count: 0
Non Null Non Blank Count: 5
Unique Value Count: 0
Is Numeric Column: true
    Inferred Numeric Value Count: 5
    NumberDescriptor [negative=false, wholeDigits=1, fractionDigits=2, fieldType=null] -- DOUBLE
Is Temporal Column: false
    Inferred Temporal Value Count: 0
Is Boolean Column: false
    Inferred Boolean Value Count: 0
Is Array Column: false
Minimum Length: 4
Maximum Length: 4
Length Sum: 20
Average Length: 4
Sample Value: 0.28
Inferred Type Name: Double
Inferred Field Type: DOUBLE
Field Type: STRING
Field Type Count: 1
Best Fit Field Type: STRING
    STRING  --  5
Null Values For Field Types:
===============================
Name: Taxes
Value Count: 5
Null Count: 0
Non Null Count: 5
Blank Count: 0
Non Null Non Blank Count: 5
Unique Value Count: 0
Is Numeric Column: true
    Inferred Numeric Value Count: 5
    NumberDescriptor [negative=false, wholeDigits=4, fractionDigits=0, fieldType=null] -- INT
Is Temporal Column: true
    Inferred Temporal Value Count: 5
    yyyy  --  5
Is Boolean Column: false
    Inferred Boolean Value Count: 0
Is Array Column: false
Minimum Length: 4
Maximum Length: 4
Length Sum: 20
Average Length: 4
Sample Value: 3167
Inferred Type Name: Int
Inferred Field Type: INT
Field Type: STRING
Field Type Count: 1
Best Fit Field Type: STRING
    STRING  --  5
Null Values For Field Types:
===============================
Name: Rating
Value Count: 5
Null Count: 0
Non Null Count: 5
Blank Count: 0
Non Null Non Blank Count: 5
Unique Value Count: 0
Is Numeric Column: false
    Inferred Numeric Value Count: 0
Is Temporal Column: false
    Inferred Temporal Value Count: 0
Is Boolean Column: false
    Inferred Boolean Value Count: 0
Is Array Column: false
Minimum Length: 1
Maximum Length: 1
Length Sum: 5
Average Length: 1
Sample Value: A
Inferred Type Name: String
Inferred Field Type: STRING
Field Type: STRING
Field Type Count: 1
Best Fit Field Type: STRING
    STRING  --  5
Null Values For Field Types:
===============================
Name: Garage
Value Count: 5
Null Count: 5
Non Null Count: 0
Blank Count: 0
Non Null Non Blank Count: 0
Unique Value Count: 0
Is Numeric Column: false
    Inferred Numeric Value Count: 0
Is Temporal Column: false
    Inferred Temporal Value Count: 0
Is Boolean Column: false
    Inferred Boolean Value Count: 0
Is Array Column: false
Minimum Length: -1
Maximum Length: -1
Length Sum: 0
Average Length: 0
Sample Value: null
Inferred Type Name: String
Inferred Field Type: STRING
Field Type: STRING
Field Type Count: 1
Best Fit Field Type: STRING
    STRING  --  5
Null Values For Field Types:
    STRING  --  5
===============================
Mobile Analytics