Convert CSV to Inferred Data Types
Updated: Nov 13, 2025
This example demonstrates how to automatically detect and convert CSV field values using DataPipeline’s data type inference feature. DataPipeline converts the text values in a CSV file to numbers, dates/time, booleans, etc. This approach helps with data ingestion by automating the crucial step of data profiling and schema definition.
Input CSV file
Type-BOOLEAN,Type-CHAR,Type-DATE,Type-DATETIME,Type-DOUBLE,Type-INT,Type-LONG,Type-STRING,Type-TIME true,A,2014-12-25,Fri Dec 26 02:41:57 PST 2014,2048.1024,819,1152921504606846976,A basic numeric constant is considered an integer.,13:41:57 true,B,2015-12-26,Sat Dec 27 04:42:59 PST 2014,2049.1027542352,829,1152921504606846977,A basic numeric constant is considered an integer.,13:42:58 false,C,2018-12-27,Sun Dec 28 03:43:58 PST 2014,2050.1026333,83,1152921504606846978,A basic numeric constant is considered an integer.,14:43:59
Java Code
package com.northconcepts.datapipeline.foundations.examples.pipeline;
import com.northconcepts.datapipeline.core.StreamWriter;
import com.northconcepts.datapipeline.csv.CSVReader;
import com.northconcepts.datapipeline.foundations.pipeline.Pipeline;
import com.northconcepts.datapipeline.foundations.pipeline.dataset.Dataset;
import com.northconcepts.datapipeline.foundations.pipeline.dataset.MemoryDataset;
import com.northconcepts.datapipeline.foundations.schema.EntityDef;
import com.northconcepts.datapipeline.foundations.tools.GenerateEntityFromDataset;
import java.io.File;
public class ConvertCsvToInferredDataTypes {
public static void main(String[] args) {
Pipeline pipeline = new Pipeline();
try (Dataset dataset = new MemoryDataset(pipeline)) {
pipeline.setInputAsDataReaderFactory(() -> new CSVReader(new File("example/data/input/mix-types.csv")).setFieldNamesInFirstRow(true));
dataset.load().waitForColumnStatsToLoad();
GenerateEntityFromDataset generateEntityFromDataset = new GenerateEntityFromDataset();
generateEntityFromDataset.setFieldTypeSelectionMode(GenerateEntityFromDataset.FieldTypeSelectionMode.BEST_FIT_INFERRED);
EntityDef entity = generateEntityFromDataset.createEntity(dataset);
pipeline.setTargetEntity(entity);
pipeline.setOutputAsDataWriterFactory(() -> StreamWriter.newSystemOutWriter());
pipeline.run();
}
}
}
Code Walkthrough
- After creating a Pipeline using the provided input file reader, we create an in-memory representation of the data using MemoryDataset.
dataset.load().waitForColumnStatsToLoad()executes the data loading and calculates column statistics to be used for type inference- Column names and their data types are automatically determined based on the column statistics.
- Generate an EntityDef. - a schema containing the inferred columns.
pipeline.setTargetEntity(entity)sets theEntityDefas the schema for the pipeline's output- The pipeline's output destination is set to the system console
pipeline.run()executes the entirePipeline, reading the data, converting it to the structure defined by the inferredEntityDef, and printing the resulting records to the console.
Console Output
-----------------------------------------------
0 - Record (MODIFIED) {
0:[Type-BOOLEAN]:BOOLEAN=[true]:Boolean
1:[Type-CHAR]:STRING=[A]:String
2:[Type-DATE]:DATE=[2014-12-25]:Date
3:[Type-DATETIME]:DATETIME=[Fri Dec 26 02:41:57 PST 2014]:Date
4:[Type-DOUBLE]:DOUBLE=[2048.1024]:Double
5:[Type-INT]:INT=[819]:Integer
6:[Type-LONG]:LONG=[1152921504606846976]:Long
7:[Type-STRING]:STRING=[A basic numeric constant is considered an integer.]:String
8:[Type-TIME]:TIME=[13:41:57]:Time
}
-----------------------------------------------
1 - Record (MODIFIED) {
0:[Type-BOOLEAN]:BOOLEAN=[true]:Boolean
1:[Type-CHAR]:STRING=[B]:String
2:[Type-DATE]:DATE=[2015-12-26]:Date
3:[Type-DATETIME]:DATETIME=[Sat Dec 27 04:42:59 PST 2014]:Date
4:[Type-DOUBLE]:DOUBLE=[2049.1027542352]:Double
5:[Type-INT]:INT=[829]:Integer
6:[Type-LONG]:LONG=[1152921504606846977]:Long
7:[Type-STRING]:STRING=[A basic numeric constant is considered an integer.]:String
8:[Type-TIME]:TIME=[13:42:58]:Time
}
-----------------------------------------------
2 - Record (MODIFIED) {
0:[Type-BOOLEAN]:BOOLEAN=[false]:Boolean
1:[Type-CHAR]:STRING=[C]:String
2:[Type-DATE]:DATE=[2018-12-27]:Date
3:[Type-DATETIME]:DATETIME=[Sun Dec 28 03:43:58 PST 2014]:Date
4:[Type-DOUBLE]:DOUBLE=[2050.1026333]:Double
5:[Type-INT]:INT=[83]:Integer
6:[Type-LONG]:LONG=[1152921504606846978]:Long
7:[Type-STRING]:STRING=[A basic numeric constant is considered an integer.]:String
8:[Type-TIME]:TIME=[14:43:59]:Time
}
-----------------------------------------------
3 records
