Convert CSV to Inferred Data Types

This example demonstrates how to automatically detect and convert CSV field values using DataPipeline’s data type inference feature.  DataPipeline converts the text values in a CSV file to numbers, dates/time, booleans, etc.  This approach helps with data ingestion by automating the crucial step of data profiling and schema definition.

 

Input CSV file

Type-BOOLEAN,Type-CHAR,Type-DATE,Type-DATETIME,Type-DOUBLE,Type-INT,Type-LONG,Type-STRING,Type-TIME
true,A,2014-12-25,Fri Dec 26 02:41:57 PST 2014,2048.1024,819,1152921504606846976,A basic numeric constant is considered an integer.,13:41:57
true,B,2015-12-26,Sat Dec 27 04:42:59 PST 2014,2049.1027542352,829,1152921504606846977,A basic numeric constant is considered an integer.,13:42:58
false,C,2018-12-27,Sun Dec 28 03:43:58 PST 2014,2050.1026333,83,1152921504606846978,A basic numeric constant is considered an integer.,14:43:59

 

Java Code

package com.northconcepts.datapipeline.foundations.examples.pipeline;

import com.northconcepts.datapipeline.core.StreamWriter;
import com.northconcepts.datapipeline.csv.CSVReader;
import com.northconcepts.datapipeline.foundations.pipeline.Pipeline;
import com.northconcepts.datapipeline.foundations.pipeline.dataset.Dataset;
import com.northconcepts.datapipeline.foundations.pipeline.dataset.MemoryDataset;
import com.northconcepts.datapipeline.foundations.schema.EntityDef;
import com.northconcepts.datapipeline.foundations.tools.GenerateEntityFromDataset;

import java.io.File;

public class ConvertCsvToInferredDataTypes {
    public static void main(String[] args) {

        Pipeline pipeline = new Pipeline();
        try (Dataset dataset = new MemoryDataset(pipeline)) {

            pipeline.setInputAsDataReaderFactory(() -> new CSVReader(new File("example/data/input/mix-types.csv")).setFieldNamesInFirstRow(true));
            dataset.load().waitForColumnStatsToLoad();

            GenerateEntityFromDataset generateEntityFromDataset = new GenerateEntityFromDataset();
            generateEntityFromDataset.setFieldTypeSelectionMode(GenerateEntityFromDataset.FieldTypeSelectionMode.BEST_FIT_INFERRED);
            EntityDef entity = generateEntityFromDataset.createEntity(dataset);
            pipeline.setTargetEntity(entity);

            pipeline.setOutputAsDataWriterFactory(() -> StreamWriter.newSystemOutWriter());
            pipeline.run();
        }
    }
}

 

Code Walkthrough

  1. After creating a Pipeline using the provided input file reader, we create an in-memory representation of the data using MemoryDataset.
  2. dataset.load().waitForColumnStatsToLoad() executes the data loading and calculates column statistics to be used for type inference
  3. Column names and their data types are automatically determined based on the column statistics.
  4. Generate an EntityDef. - a schema containing the inferred columns.
  5. pipeline.setTargetEntity(entity) sets the EntityDef as the schema for the pipeline's output
  6. The pipeline's output destination is set to the system console
  7. pipeline.run() executes the entire Pipeline, reading the data, converting it to the structure defined by the inferred EntityDef, and printing the resulting records to the console.

 

Console Output

-----------------------------------------------
0 - Record (MODIFIED) {
    0:[Type-BOOLEAN]:BOOLEAN=[true]:Boolean
    1:[Type-CHAR]:STRING=[A]:String
    2:[Type-DATE]:DATE=[2014-12-25]:Date
    3:[Type-DATETIME]:DATETIME=[Fri Dec 26 02:41:57 PST 2014]:Date
    4:[Type-DOUBLE]:DOUBLE=[2048.1024]:Double
    5:[Type-INT]:INT=[819]:Integer
    6:[Type-LONG]:LONG=[1152921504606846976]:Long
    7:[Type-STRING]:STRING=[A basic numeric constant is considered an integer.]:String
    8:[Type-TIME]:TIME=[13:41:57]:Time
}

-----------------------------------------------
1 - Record (MODIFIED) {
    0:[Type-BOOLEAN]:BOOLEAN=[true]:Boolean
    1:[Type-CHAR]:STRING=[B]:String
    2:[Type-DATE]:DATE=[2015-12-26]:Date
    3:[Type-DATETIME]:DATETIME=[Sat Dec 27 04:42:59 PST 2014]:Date
    4:[Type-DOUBLE]:DOUBLE=[2049.1027542352]:Double
    5:[Type-INT]:INT=[829]:Integer
    6:[Type-LONG]:LONG=[1152921504606846977]:Long
    7:[Type-STRING]:STRING=[A basic numeric constant is considered an integer.]:String
    8:[Type-TIME]:TIME=[13:42:58]:Time
}

-----------------------------------------------
2 - Record (MODIFIED) {
    0:[Type-BOOLEAN]:BOOLEAN=[false]:Boolean
    1:[Type-CHAR]:STRING=[C]:String
    2:[Type-DATE]:DATE=[2018-12-27]:Date
    3:[Type-DATETIME]:DATETIME=[Sun Dec 28 03:43:58 PST 2014]:Date
    4:[Type-DOUBLE]:DOUBLE=[2050.1026333]:Double
    5:[Type-INT]:INT=[83]:Integer
    6:[Type-LONG]:LONG=[1152921504606846978]:Long
    7:[Type-STRING]:STRING=[A basic numeric constant is considered an integer.]:String
    8:[Type-TIME]:TIME=[14:43:59]:Time
}

-----------------------------------------------
3 records
Mobile Analytics