Set Default Values for Missing Data
Updated: Jun 24, 2023
This example illustrates how to make sure that missing data in a file is handled properly by setting default values.
Setting default values comes in handy for many different scenarios such as:
- Data Integration between two or more systems
- Data Preprocessing; validating the data before or after specific flows carried in a system
- Database Constraints; where incomplete data is often not allowed
Here is how you can do it.
Input File
ID, BOOLEAN1, STRING1, BYTE1, CHAR1, DATE1, DATETIME1, TIME1, INT1, LONG1, DOUBLE1, FLOAT1, SHORT1,BIGDECIMAL1,BIGINTEGER1
1, true, "hello world", 7, T, 1999-08-21, 1999-08-21T15:31:50.000Z, 15:31:50, 789456, 987654321, 123.456, 963.258, 17,92233720368547758079223372036854775807.92233720368547758079223372036854775807,92233720368547758079223372036854775807
2, false, goodbye world, 127, z, 1786-12-30, 1786-12-30T01:05:17.000Z, 1:05:17, "2147483647", "9223372036854775807", 1.79769313486231570e+308d, 3.40282346638528860e+38, "32767",-9406564584124654494065645841246544.9406564584124654494065645841246544,-92233720368547758079223372036854775807
3, , , , , , , , -2147483648, "-9223372036854775808", -4.94065645841246544e-324d, 1.40129846432481707e-45, -32768,0,0
4, , , , , , , , , , , , , ,
5
Java Code Listing
package com.northconcepts.datapipeline.foundations.examples.datamapping; import com.northconcepts.datapipeline.core.DataReader; import com.northconcepts.datapipeline.core.DataWriter; import com.northconcepts.datapipeline.core.FieldType; import com.northconcepts.datapipeline.core.StreamWriter; import com.northconcepts.datapipeline.csv.CSVReader; import com.northconcepts.datapipeline.foundations.schema.BooleanFieldDef; import com.northconcepts.datapipeline.foundations.schema.EntityDef; import com.northconcepts.datapipeline.foundations.schema.NumericFieldDef; import com.northconcepts.datapipeline.foundations.schema.SchemaDef; import com.northconcepts.datapipeline.foundations.schema.SchemaTransformer; import com.northconcepts.datapipeline.foundations.schema.TemporalFieldDef; import com.northconcepts.datapipeline.foundations.schema.TextFieldDef; import com.northconcepts.datapipeline.job.Job; import com.northconcepts.datapipeline.transform.TransformingReader; import java.io.File; public class SetDefaultValuesForMissingData { public static void main(String... args) throws Throwable { DataReader reader = new CSVReader(new File("example/data/input/schema/csv-with-null-values.csv")) .setFieldNamesInFirstRow(true); SchemaDef schema = new SchemaDef() .addEntity(new EntityDef().setName("Entity") .addField(new NumericFieldDef("ID", FieldType.INT) .setDefaultValueExpression("0")) .addField(new BooleanFieldDef("BOOLEAN1", FieldType.BOOLEAN) .setDefaultValueExpression("false")) .addField(new TextFieldDef("STRING1", FieldType.STRING) .setDefaultValueExpression("'DEFAULT STRING VALUE'")) .addField(new NumericFieldDef("BYTE1", FieldType.BYTE) .setDefaultValueExpression("88")) .addField(new TextFieldDef("CHAR1", FieldType.CHAR) .setDefaultValueExpression("'z'")) .addField(new TemporalFieldDef("DATE1", FieldType.DATE) .setDefaultValueExpression("toDate('1999-08-21T15:31:50.000Z')")) .addField(new TemporalFieldDef("DATETIME1", FieldType.DATETIME) .setDefaultValueExpression("toDatetime('1999-08-21T15:31:50.000Z')")) .addField(new TemporalFieldDef("TIME1", FieldType.TIME) .setDefaultValueExpression("toDatetime('1999-08-21T15:31:50.000Z')")) .addField(new NumericFieldDef("INT1", FieldType.INT) .setDefaultValueExpression("99")) .addField(new NumericFieldDef("LONG1", FieldType.LONG) .setDefaultValueExpression("1152921504606846976")) .addField(new NumericFieldDef("DOUBLE1", FieldType.DOUBLE) .setDefaultValueExpression("10.0")) .addField(new NumericFieldDef("FLOAT1", FieldType.FLOAT) .setDefaultValueExpression("99")) .addField(new NumericFieldDef("SHORT1", FieldType.SHORT) .setDefaultValueExpression("99")) .addField(new NumericFieldDef("BIGDECIMAL1", FieldType.BIG_DECIMAL) .setDefaultValueExpression("'111111111111111111111111111111111111111.111111111111111111111111111111111111111'")) .addField(new NumericFieldDef("BIGINTEGER1", FieldType.BIG_INTEGER) .setDefaultValueExpression("'99999999999999999999999999999999999999'")) ); EntityDef entity = schema.getEntity("Entity"); reader = new TransformingReader(reader) .add(new SchemaTransformer(entity)); // DataWriter writer = new NullWriter(); DataWriter writer = StreamWriter.newSystemOutWriter(); Job job = Job.run(reader, writer); System.out.println("Records Transferred: " + job.getRecordsTransferred()); System.out.println("Running Time: " + job.getRunningTimeAsString()); } }
Code Walkthrough
- Beginning the execution, a
DataReader
is initialized with the input CSV filecsv-with-null-values.csv
where we also invokesetFieldNamesInFirstRow(true)
to pick the column names from the first row. - Then, a
SchemaDef
is initialized where we set the default properties. - Fields are associated with Entities, thus an entity is passed to the
schema
by callingaddEntity()
method. .addField()
method can be used to define default values with desired datatypes.- A
TransformingReader
is initialized through aSchemaTransformer
to associate default fields as defined in theentity
above. - Finally, a job is run which transforms and displays this data on the console by using the
reader
and aStreamWriter
. getRecordsTransferred()
andgetRunningTimeAsString()
methods are called to get additional information about the transformation.
Console Output
----------------------------------------------- 0 - Record (MODIFIED) { 0:[ID]:INT=[1]:Integer 1:[BOOLEAN1]:BOOLEAN=[true]:Boolean 2:[STRING1]:STRING=[hello world]:String 3:[BYTE1]:BYTE=[7]:Byte 4:[CHAR1]:CHAR=[T]:Character 5:[DATE1]:DATE=[1999-08-21]:Date 6:[DATETIME1]:DATETIME=[Sat Aug 21 15:31:50 PKT 1999]:Date 7:[TIME1]:TIME=[15:31:50]:Time 8:[INT1]:INT=[789456]:Integer 9:[LONG1]:LONG=[987654321]:Long 10:[DOUBLE1]:DOUBLE=[123.456]:Double 11:[FLOAT1]:FLOAT=[963.258]:Float 12:[SHORT1]:SHORT=[17]:Short 13:[BIGDECIMAL1]:BIG_DECIMAL=[92233720368547758079223372036854775807.92233720368547758079223372036854775807]:BigDecimal 14:[BIGINTEGER1]:BIG_INTEGER=[92233720368547758079223372036854775807]:BigInteger } ----------------------------------------------- 1 - Record (MODIFIED) { 0:[ID]:INT=[2]:Integer 1:[BOOLEAN1]:BOOLEAN=[false]:Boolean 2:[STRING1]:STRING=[goodbye world]:String 3:[BYTE1]:BYTE=[127]:Byte 4:[CHAR1]:CHAR=[z]:Character 5:[DATE1]:DATE=[1786-12-30]:Date 6:[DATETIME1]:DATETIME=[Sat Dec 30 01:05:17 PKT 1786]:Date 7:[TIME1]:TIME=[01:05:17]:Time 8:[INT1]:INT=[2147483647]:Integer 9:[LONG1]:LONG=[9223372036854775807]:Long 10:[DOUBLE1]:DOUBLE=[1.7976931348623157E308]:Double 11:[FLOAT1]:FLOAT=[3.4028235E38]:Float 12:[SHORT1]:SHORT=[32767]:Short 13:[BIGDECIMAL1]:BIG_DECIMAL=[-9406564584124654494065645841246544.9406564584124654494065645841246544]:BigDecimal 14:[BIGINTEGER1]:BIG_INTEGER=[-92233720368547758079223372036854775807]:BigInteger } ----------------------------------------------- 2 - Record (MODIFIED) { 0:[ID]:INT=[3]:Integer 1:[BOOLEAN1]:BOOLEAN=[false]:Boolean 2:[STRING1]:STRING=[DEFAULT STRING VALUE]:String 3:[BYTE1]:BYTE=[88]:Byte 4:[CHAR1]:CHAR=[z]:Character 5:[DATE1]:DATE=[1999-08-21]:Date 6:[DATETIME1]:DATETIME=[Sat Aug 21 15:31:50 PKT 1999]:Date 7:[TIME1]:TIME=[15:31:50]:Time 8:[INT1]:INT=[-2147483648]:Integer 9:[LONG1]:LONG=[-9223372036854775808]:Long 10:[DOUBLE1]:DOUBLE=[-4.9E-324]:Double 11:[FLOAT1]:FLOAT=[1.4E-45]:Float 12:[SHORT1]:SHORT=[-32768]:Short 13:[BIGDECIMAL1]:BIG_DECIMAL=[0]:BigDecimal 14:[BIGINTEGER1]:BIG_INTEGER=[0]:BigInteger } ----------------------------------------------- 3 - Record (MODIFIED) { 0:[ID]:INT=[4]:Integer 1:[BOOLEAN1]:BOOLEAN=[false]:Boolean 2:[STRING1]:STRING=[DEFAULT STRING VALUE]:String 3:[BYTE1]:BYTE=[88]:Byte 4:[CHAR1]:CHAR=[z]:Character 5:[DATE1]:DATE=[1999-08-21]:Date 6:[DATETIME1]:DATETIME=[Sat Aug 21 15:31:50 PKT 1999]:Date 7:[TIME1]:TIME=[15:31:50]:Time 8:[INT1]:INT=[99]:Integer 9:[LONG1]:LONG=[1152921504606846976]:Long 10:[DOUBLE1]:DOUBLE=[10.0]:Double 11:[FLOAT1]:FLOAT=[99.0]:Float 12:[SHORT1]:SHORT=[99]:Short 13:[BIGDECIMAL1]:BIG_DECIMAL=[111111111111111111111111111111111111111.111111111111111111111111111111111111111]:BigDecimal 14:[BIGINTEGER1]:BIG_INTEGER=[99999999999999999999999999999999999999]:BigInteger } ----------------------------------------------- 4 - Record (MODIFIED) { 0:[ID]:INT=[5]:Integer 1:[BOOLEAN1]:BOOLEAN=[false]:Boolean 2:[STRING1]:STRING=[DEFAULT STRING VALUE]:String 3:[BYTE1]:BYTE=[88]:Byte 4:[CHAR1]:CHAR=[z]:Character 5:[DATE1]:DATE=[1999-08-21]:Date 6:[DATETIME1]:DATETIME=[Sat Aug 21 15:31:50 PKT 1999]:Date 7:[TIME1]:TIME=[15:31:50]:Time 8:[INT1]:INT=[99]:Integer 9:[LONG1]:LONG=[1152921504606846976]:Long 10:[DOUBLE1]:DOUBLE=[10.0]:Double 11:[FLOAT1]:FLOAT=[99.0]:Float 12:[SHORT1]:SHORT=[99]:Short 13:[BIGDECIMAL1]:BIG_DECIMAL=[111111111111111111111111111111111111111.111111111111111111111111111111111111111]:BigDecimal 14:[BIGINTEGER1]:BIG_INTEGER=[99999999999999999999999999999999999999]:BigInteger } ----------------------------------------------- 5 records Records Transferred: 5 Running Time: 34 Milliseconds