Convert CSV to Inferred Data Types
Updated: Oct 18, 2025
This example demonstrates how to automatically detect and convert CSV field values using DataPipeline’s data type inference feature. DataPipeline converts the text values in a CSV file to numbers, dates/time, booleans, etc. This approach helps with data ingestion by automating the crucial step of data profiling and schema definition.
Input CSV file
Type-BOOLEAN,Type-CHAR,Type-DATE,Type-DATETIME,Type-DOUBLE,Type-INT,Type-LONG,Type-STRING,Type-TIME true,A,2014-12-25,Fri Dec 26 02:41:57 PST 2014,2048.1024,819,1152921504606846976,A basic numeric constant is considered an integer.,13:41:57 true,B,2015-12-26,Sat Dec 27 04:42:59 PST 2014,2049.1027542352,829,1152921504606846977,A basic numeric constant is considered an integer.,13:42:58 false,C,2018-12-27,Sun Dec 28 03:43:58 PST 2014,2050.1026333,83,1152921504606846978,A basic numeric constant is considered an integer.,14:43:59
Java Code
package com.northconcepts.datapipeline.foundations.examples.pipeline; import com.northconcepts.datapipeline.core.StreamWriter; import com.northconcepts.datapipeline.csv.CSVReader; import com.northconcepts.datapipeline.foundations.pipeline.Pipeline; import com.northconcepts.datapipeline.foundations.pipeline.dataset.Dataset; import com.northconcepts.datapipeline.foundations.pipeline.dataset.MemoryDataset; import com.northconcepts.datapipeline.foundations.schema.EntityDef; import com.northconcepts.datapipeline.foundations.tools.GenerateEntityFromDataset; import java.io.File; public class ConvertCsvToInferredDataTypes { public static void main(String[] args) { Pipeline pipeline = new Pipeline(); try (Dataset dataset = new MemoryDataset(pipeline)) { pipeline.setInputAsDataReaderFactory(() -> new CSVReader(new File("example/data/input/mix-types.csv")).setFieldNamesInFirstRow(true)); dataset.load().waitForColumnStatsToLoad(); GenerateEntityFromDataset generateEntityFromDataset = new GenerateEntityFromDataset(); generateEntityFromDataset.setFieldTypeSelectionMode(GenerateEntityFromDataset.FieldTypeSelectionMode.BEST_FIT_INFERRED); EntityDef entity = generateEntityFromDataset.createEntity(dataset); pipeline.setTargetEntity(entity); pipeline.setOutputAsDataWriterFactory(() -> StreamWriter.newSystemOutWriter()); pipeline.run(); } } }
Code Walkthrough
- After creating a Pipeline using the provided input file reader, we create an in-memory representation of the data using MemoryDataset.
dataset.load().waitForColumnStatsToLoad()
executes the data loading and calculates column statistics to be used for type inference- Column names and their data types are automatically determined based on the column statistics.
- Generate an EntityDef. - a schema containing the inferred columns.
pipeline.setTargetEntity(entity)
sets theEntityDef
as the schema for the pipeline's output- The pipeline's output destination is set to the system console
pipeline.run()
executes the entirePipeline
, reading the data, converting it to the structure defined by the inferredEntityDef
, and printing the resulting records to the console.
Console Output
----------------------------------------------- 0 - Record (MODIFIED) { 0:[Type-BOOLEAN]:BOOLEAN=[true]:Boolean 1:[Type-CHAR]:STRING=[A]:String 2:[Type-DATE]:DATE=[2014-12-25]:Date 3:[Type-DATETIME]:DATETIME=[Fri Dec 26 02:41:57 PST 2014]:Date 4:[Type-DOUBLE]:DOUBLE=[2048.1024]:Double 5:[Type-INT]:INT=[819]:Integer 6:[Type-LONG]:LONG=[1152921504606846976]:Long 7:[Type-STRING]:STRING=[A basic numeric constant is considered an integer.]:String 8:[Type-TIME]:TIME=[13:41:57]:Time } ----------------------------------------------- 1 - Record (MODIFIED) { 0:[Type-BOOLEAN]:BOOLEAN=[true]:Boolean 1:[Type-CHAR]:STRING=[B]:String 2:[Type-DATE]:DATE=[2015-12-26]:Date 3:[Type-DATETIME]:DATETIME=[Sat Dec 27 04:42:59 PST 2014]:Date 4:[Type-DOUBLE]:DOUBLE=[2049.1027542352]:Double 5:[Type-INT]:INT=[829]:Integer 6:[Type-LONG]:LONG=[1152921504606846977]:Long 7:[Type-STRING]:STRING=[A basic numeric constant is considered an integer.]:String 8:[Type-TIME]:TIME=[13:42:58]:Time } ----------------------------------------------- 2 - Record (MODIFIED) { 0:[Type-BOOLEAN]:BOOLEAN=[false]:Boolean 1:[Type-CHAR]:STRING=[C]:String 2:[Type-DATE]:DATE=[2018-12-27]:Date 3:[Type-DATETIME]:DATETIME=[Sun Dec 28 03:43:58 PST 2014]:Date 4:[Type-DOUBLE]:DOUBLE=[2050.1026333]:Double 5:[Type-INT]:INT=[83]:Integer 6:[Type-LONG]:LONG=[1152921504606846978]:Long 7:[Type-STRING]:STRING=[A basic numeric constant is considered an integer.]:String 8:[Type-TIME]:TIME=[14:43:59]:Time } ----------------------------------------------- 3 records