Set Default Values for Missing Data
Updated: Jun 24, 2023
This example illustrates how to make sure that missing data in a file is handled properly by setting default values.
Setting default values comes in handy for many different scenarios such as:
- Data Integration between two or more systems
- Data Preprocessing; validating the data before or after specific flows carried in a system
- Database Constraints; where incomplete data is often not allowed
Here is how you can do it.
Input File
ID, BOOLEAN1, STRING1, BYTE1, CHAR1, DATE1, DATETIME1, TIME1, INT1, LONG1, DOUBLE1, FLOAT1, SHORT1,BIGDECIMAL1,BIGINTEGER1
1, true, "hello world", 7, T, 1999-08-21, 1999-08-21T15:31:50.000Z, 15:31:50, 789456, 987654321, 123.456, 963.258, 17,92233720368547758079223372036854775807.92233720368547758079223372036854775807,92233720368547758079223372036854775807
2, false, goodbye world, 127, z, 1786-12-30, 1786-12-30T01:05:17.000Z, 1:05:17, "2147483647", "9223372036854775807", 1.79769313486231570e+308d, 3.40282346638528860e+38, "32767",-9406564584124654494065645841246544.9406564584124654494065645841246544,-92233720368547758079223372036854775807
3, , , , , , , , -2147483648, "-9223372036854775808", -4.94065645841246544e-324d, 1.40129846432481707e-45, -32768,0,0
4, , , , , , , , , , , , , ,
5
Java Code Listing
package com.northconcepts.datapipeline.foundations.examples.datamapping;
import com.northconcepts.datapipeline.core.DataReader;
import com.northconcepts.datapipeline.core.DataWriter;
import com.northconcepts.datapipeline.core.FieldType;
import com.northconcepts.datapipeline.core.StreamWriter;
import com.northconcepts.datapipeline.csv.CSVReader;
import com.northconcepts.datapipeline.foundations.schema.BooleanFieldDef;
import com.northconcepts.datapipeline.foundations.schema.EntityDef;
import com.northconcepts.datapipeline.foundations.schema.NumericFieldDef;
import com.northconcepts.datapipeline.foundations.schema.SchemaDef;
import com.northconcepts.datapipeline.foundations.schema.SchemaTransformer;
import com.northconcepts.datapipeline.foundations.schema.TemporalFieldDef;
import com.northconcepts.datapipeline.foundations.schema.TextFieldDef;
import com.northconcepts.datapipeline.job.Job;
import com.northconcepts.datapipeline.transform.TransformingReader;
import java.io.File;
public class SetDefaultValuesForMissingData {
public static void main(String... args) throws Throwable {
DataReader reader = new CSVReader(new File("example/data/input/schema/csv-with-null-values.csv"))
.setFieldNamesInFirstRow(true);
SchemaDef schema = new SchemaDef()
.addEntity(new EntityDef().setName("Entity")
.addField(new NumericFieldDef("ID", FieldType.INT)
.setDefaultValueExpression("0"))
.addField(new BooleanFieldDef("BOOLEAN1", FieldType.BOOLEAN)
.setDefaultValueExpression("false"))
.addField(new TextFieldDef("STRING1", FieldType.STRING)
.setDefaultValueExpression("'DEFAULT STRING VALUE'"))
.addField(new NumericFieldDef("BYTE1", FieldType.BYTE)
.setDefaultValueExpression("88"))
.addField(new TextFieldDef("CHAR1", FieldType.CHAR)
.setDefaultValueExpression("'z'"))
.addField(new TemporalFieldDef("DATE1", FieldType.DATE)
.setDefaultValueExpression("toDate('1999-08-21T15:31:50.000Z')"))
.addField(new TemporalFieldDef("DATETIME1", FieldType.DATETIME)
.setDefaultValueExpression("toDatetime('1999-08-21T15:31:50.000Z')"))
.addField(new TemporalFieldDef("TIME1", FieldType.TIME)
.setDefaultValueExpression("toDatetime('1999-08-21T15:31:50.000Z')"))
.addField(new NumericFieldDef("INT1", FieldType.INT)
.setDefaultValueExpression("99"))
.addField(new NumericFieldDef("LONG1", FieldType.LONG)
.setDefaultValueExpression("1152921504606846976"))
.addField(new NumericFieldDef("DOUBLE1", FieldType.DOUBLE)
.setDefaultValueExpression("10.0"))
.addField(new NumericFieldDef("FLOAT1", FieldType.FLOAT)
.setDefaultValueExpression("99"))
.addField(new NumericFieldDef("SHORT1", FieldType.SHORT)
.setDefaultValueExpression("99"))
.addField(new NumericFieldDef("BIGDECIMAL1", FieldType.BIG_DECIMAL)
.setDefaultValueExpression("'111111111111111111111111111111111111111.111111111111111111111111111111111111111'"))
.addField(new NumericFieldDef("BIGINTEGER1", FieldType.BIG_INTEGER)
.setDefaultValueExpression("'99999999999999999999999999999999999999'"))
);
EntityDef entity = schema.getEntity("Entity");
reader = new TransformingReader(reader)
.add(new SchemaTransformer(entity));
// DataWriter writer = new NullWriter();
DataWriter writer = StreamWriter.newSystemOutWriter();
Job job = Job.run(reader, writer);
System.out.println("Records Transferred: " + job.getRecordsTransferred());
System.out.println("Running Time: " + job.getRunningTimeAsString());
}
}
Code Walkthrough
- Beginning the execution, a
DataReaderis initialized with the input CSV filecsv-with-null-values.csvwhere we also invokesetFieldNamesInFirstRow(true)to pick the column names from the first row. - Then, a
SchemaDefis initialized where we set the default properties. - Fields are associated with Entities, thus an entity is passed to the
schemaby callingaddEntity()method. .addField()method can be used to define default values with desired datatypes.- A
TransformingReaderis initialized through aSchemaTransformerto associate default fields as defined in theentityabove. - Finally, a job is run which transforms and displays this data on the console by using the
readerand aStreamWriter. getRecordsTransferred()andgetRunningTimeAsString()methods are called to get additional information about the transformation.
Console Output
-----------------------------------------------
0 - Record (MODIFIED) {
0:[ID]:INT=[1]:Integer
1:[BOOLEAN1]:BOOLEAN=[true]:Boolean
2:[STRING1]:STRING=[hello world]:String
3:[BYTE1]:BYTE=[7]:Byte
4:[CHAR1]:CHAR=[T]:Character
5:[DATE1]:DATE=[1999-08-21]:Date
6:[DATETIME1]:DATETIME=[Sat Aug 21 15:31:50 PKT 1999]:Date
7:[TIME1]:TIME=[15:31:50]:Time
8:[INT1]:INT=[789456]:Integer
9:[LONG1]:LONG=[987654321]:Long
10:[DOUBLE1]:DOUBLE=[123.456]:Double
11:[FLOAT1]:FLOAT=[963.258]:Float
12:[SHORT1]:SHORT=[17]:Short
13:[BIGDECIMAL1]:BIG_DECIMAL=[92233720368547758079223372036854775807.92233720368547758079223372036854775807]:BigDecimal
14:[BIGINTEGER1]:BIG_INTEGER=[92233720368547758079223372036854775807]:BigInteger
}
-----------------------------------------------
1 - Record (MODIFIED) {
0:[ID]:INT=[2]:Integer
1:[BOOLEAN1]:BOOLEAN=[false]:Boolean
2:[STRING1]:STRING=[goodbye world]:String
3:[BYTE1]:BYTE=[127]:Byte
4:[CHAR1]:CHAR=[z]:Character
5:[DATE1]:DATE=[1786-12-30]:Date
6:[DATETIME1]:DATETIME=[Sat Dec 30 01:05:17 PKT 1786]:Date
7:[TIME1]:TIME=[01:05:17]:Time
8:[INT1]:INT=[2147483647]:Integer
9:[LONG1]:LONG=[9223372036854775807]:Long
10:[DOUBLE1]:DOUBLE=[1.7976931348623157E308]:Double
11:[FLOAT1]:FLOAT=[3.4028235E38]:Float
12:[SHORT1]:SHORT=[32767]:Short
13:[BIGDECIMAL1]:BIG_DECIMAL=[-9406564584124654494065645841246544.9406564584124654494065645841246544]:BigDecimal
14:[BIGINTEGER1]:BIG_INTEGER=[-92233720368547758079223372036854775807]:BigInteger
}
-----------------------------------------------
2 - Record (MODIFIED) {
0:[ID]:INT=[3]:Integer
1:[BOOLEAN1]:BOOLEAN=[false]:Boolean
2:[STRING1]:STRING=[DEFAULT STRING VALUE]:String
3:[BYTE1]:BYTE=[88]:Byte
4:[CHAR1]:CHAR=[z]:Character
5:[DATE1]:DATE=[1999-08-21]:Date
6:[DATETIME1]:DATETIME=[Sat Aug 21 15:31:50 PKT 1999]:Date
7:[TIME1]:TIME=[15:31:50]:Time
8:[INT1]:INT=[-2147483648]:Integer
9:[LONG1]:LONG=[-9223372036854775808]:Long
10:[DOUBLE1]:DOUBLE=[-4.9E-324]:Double
11:[FLOAT1]:FLOAT=[1.4E-45]:Float
12:[SHORT1]:SHORT=[-32768]:Short
13:[BIGDECIMAL1]:BIG_DECIMAL=[0]:BigDecimal
14:[BIGINTEGER1]:BIG_INTEGER=[0]:BigInteger
}
-----------------------------------------------
3 - Record (MODIFIED) {
0:[ID]:INT=[4]:Integer
1:[BOOLEAN1]:BOOLEAN=[false]:Boolean
2:[STRING1]:STRING=[DEFAULT STRING VALUE]:String
3:[BYTE1]:BYTE=[88]:Byte
4:[CHAR1]:CHAR=[z]:Character
5:[DATE1]:DATE=[1999-08-21]:Date
6:[DATETIME1]:DATETIME=[Sat Aug 21 15:31:50 PKT 1999]:Date
7:[TIME1]:TIME=[15:31:50]:Time
8:[INT1]:INT=[99]:Integer
9:[LONG1]:LONG=[1152921504606846976]:Long
10:[DOUBLE1]:DOUBLE=[10.0]:Double
11:[FLOAT1]:FLOAT=[99.0]:Float
12:[SHORT1]:SHORT=[99]:Short
13:[BIGDECIMAL1]:BIG_DECIMAL=[111111111111111111111111111111111111111.111111111111111111111111111111111111111]:BigDecimal
14:[BIGINTEGER1]:BIG_INTEGER=[99999999999999999999999999999999999999]:BigInteger
}
-----------------------------------------------
4 - Record (MODIFIED) {
0:[ID]:INT=[5]:Integer
1:[BOOLEAN1]:BOOLEAN=[false]:Boolean
2:[STRING1]:STRING=[DEFAULT STRING VALUE]:String
3:[BYTE1]:BYTE=[88]:Byte
4:[CHAR1]:CHAR=[z]:Character
5:[DATE1]:DATE=[1999-08-21]:Date
6:[DATETIME1]:DATETIME=[Sat Aug 21 15:31:50 PKT 1999]:Date
7:[TIME1]:TIME=[15:31:50]:Time
8:[INT1]:INT=[99]:Integer
9:[LONG1]:LONG=[1152921504606846976]:Long
10:[DOUBLE1]:DOUBLE=[10.0]:Double
11:[FLOAT1]:FLOAT=[99.0]:Float
12:[SHORT1]:SHORT=[99]:Short
13:[BIGDECIMAL1]:BIG_DECIMAL=[111111111111111111111111111111111111111.111111111111111111111111111111111111111]:BigDecimal
14:[BIGINTEGER1]:BIG_INTEGER=[99999999999999999999999999999999999999]:BigInteger
}
-----------------------------------------------
5 records
Records Transferred: 5
Running Time: 34 Milliseconds
