Set Default Values for Missing Data

This example illustrates how to make sure that missing data in a file is handled properly by setting default values.

Setting default values comes in handy for many different scenarios such as:

  • Data Integration between two or more systems
  • Data Preprocessing; validating the data before or after specific flows carried in a system
  • Database Constraints; where incomplete data is often not allowed

Here is how you can do it.

Input File

ID, BOOLEAN1, STRING1, BYTE1, CHAR1, DATE1, DATETIME1, TIME1, INT1, LONG1, DOUBLE1, FLOAT1, SHORT1,BIGDECIMAL1,BIGINTEGER1
1, true, "hello world", 7, T, 1999-08-21, 1999-08-21T15:31:50.000Z, 15:31:50, 789456, 987654321, 123.456, 963.258, 17,92233720368547758079223372036854775807.92233720368547758079223372036854775807,92233720368547758079223372036854775807
2, false, goodbye world, 127, z, 1786-12-30, 1786-12-30T01:05:17.000Z, 1:05:17, "2147483647", "9223372036854775807", 1.79769313486231570e+308d, 3.40282346638528860e+38, "32767",-9406564584124654494065645841246544.9406564584124654494065645841246544,-92233720368547758079223372036854775807
3, , , , , , , , -2147483648, "-9223372036854775808", -4.94065645841246544e-324d, 1.40129846432481707e-45, -32768,0,0
4, , , , , , , , , , , , , ,
5

 

Java Code Listing

package com.northconcepts.datapipeline.foundations.examples.datamapping;

import com.northconcepts.datapipeline.core.DataReader;
import com.northconcepts.datapipeline.core.DataWriter;
import com.northconcepts.datapipeline.core.FieldType;
import com.northconcepts.datapipeline.core.StreamWriter;
import com.northconcepts.datapipeline.csv.CSVReader;
import com.northconcepts.datapipeline.foundations.schema.BooleanFieldDef;
import com.northconcepts.datapipeline.foundations.schema.EntityDef;
import com.northconcepts.datapipeline.foundations.schema.NumericFieldDef;
import com.northconcepts.datapipeline.foundations.schema.SchemaDef;
import com.northconcepts.datapipeline.foundations.schema.SchemaTransformer;
import com.northconcepts.datapipeline.foundations.schema.TemporalFieldDef;
import com.northconcepts.datapipeline.foundations.schema.TextFieldDef;
import com.northconcepts.datapipeline.job.Job;
import com.northconcepts.datapipeline.transform.TransformingReader;

import java.io.File;

public class SetDefaultValuesForMissingData {

    public static void main(String... args) throws Throwable {
        DataReader reader = new CSVReader(new File("example/data/input/schema/csv-with-null-values.csv"))
                .setFieldNamesInFirstRow(true);
        
        SchemaDef schema = new SchemaDef()
            .addEntity(new EntityDef().setName("Entity")
                .addField(new NumericFieldDef("ID", FieldType.INT)
                    .setDefaultValueExpression("0"))
                .addField(new BooleanFieldDef("BOOLEAN1", FieldType.BOOLEAN)
                    .setDefaultValueExpression("false"))
                .addField(new TextFieldDef("STRING1", FieldType.STRING)
                    .setDefaultValueExpression("'DEFAULT STRING VALUE'"))
                .addField(new NumericFieldDef("BYTE1", FieldType.BYTE)
                    .setDefaultValueExpression("88"))
                .addField(new TextFieldDef("CHAR1", FieldType.CHAR)
                    .setDefaultValueExpression("'z'"))
                .addField(new TemporalFieldDef("DATE1", FieldType.DATE)
                    .setDefaultValueExpression("toDate('1999-08-21T15:31:50.000Z')"))
                .addField(new TemporalFieldDef("DATETIME1", FieldType.DATETIME)
                    .setDefaultValueExpression("toDatetime('1999-08-21T15:31:50.000Z')"))
                .addField(new TemporalFieldDef("TIME1", FieldType.TIME)
                    .setDefaultValueExpression("toDatetime('1999-08-21T15:31:50.000Z')"))
                .addField(new NumericFieldDef("INT1", FieldType.INT)
                    .setDefaultValueExpression("99"))
                .addField(new NumericFieldDef("LONG1", FieldType.LONG)
                    .setDefaultValueExpression("1152921504606846976"))
                .addField(new NumericFieldDef("DOUBLE1", FieldType.DOUBLE)
                    .setDefaultValueExpression("10.0"))
                .addField(new NumericFieldDef("FLOAT1", FieldType.FLOAT)
                    .setDefaultValueExpression("99"))
                .addField(new NumericFieldDef("SHORT1", FieldType.SHORT)
                    .setDefaultValueExpression("99"))
                .addField(new NumericFieldDef("BIGDECIMAL1", FieldType.BIG_DECIMAL)
                    .setDefaultValueExpression("'111111111111111111111111111111111111111.111111111111111111111111111111111111111'"))
                .addField(new NumericFieldDef("BIGINTEGER1", FieldType.BIG_INTEGER)
                    .setDefaultValueExpression("'99999999999999999999999999999999999999'"))
            );


        EntityDef entity = schema.getEntity("Entity");
        reader = new TransformingReader(reader)
                .add(new SchemaTransformer(entity));

//        DataWriter writer = new NullWriter();
        DataWriter writer = StreamWriter.newSystemOutWriter();
        
        Job job = Job.run(reader, writer);
        System.out.println("Records Transferred: " + job.getRecordsTransferred());
        System.out.println("Running Time: " + job.getRunningTimeAsString());
    }
}

 

Code Walkthrough

  1. Beginning the execution, a DataReader is initialized with the input CSV file csv-with-null-values.csv  where we also invoke setFieldNamesInFirstRow(true) to pick the column names from the first row.
  2. Then, a SchemaDef is initialized where we set the default properties.
  3. Fields are associated with Entities, thus an entity is passed to the schema by calling addEntity() method.
  4. .addField() method can be used to define default values with desired datatypes.
  5. A TransformingReader is initialized through a SchemaTransformer to associate default fields as defined in the entity above.
  6. Finally, a job is run which transforms and displays this data on the console by using the reader and a StreamWriter.
  7. getRecordsTransferred() and getRunningTimeAsString() methods are called to get additional information about the transformation.

 

Console Output

-----------------------------------------------
0 - Record (MODIFIED) {
    0:[ID]:INT=[1]:Integer
    1:[BOOLEAN1]:BOOLEAN=[true]:Boolean
    2:[STRING1]:STRING=[hello world]:String
    3:[BYTE1]:BYTE=[7]:Byte
    4:[CHAR1]:CHAR=[T]:Character
    5:[DATE1]:DATE=[1999-08-21]:Date
    6:[DATETIME1]:DATETIME=[Sat Aug 21 15:31:50 PKT 1999]:Date
    7:[TIME1]:TIME=[15:31:50]:Time
    8:[INT1]:INT=[789456]:Integer
    9:[LONG1]:LONG=[987654321]:Long
    10:[DOUBLE1]:DOUBLE=[123.456]:Double
    11:[FLOAT1]:FLOAT=[963.258]:Float
    12:[SHORT1]:SHORT=[17]:Short
    13:[BIGDECIMAL1]:BIG_DECIMAL=[92233720368547758079223372036854775807.92233720368547758079223372036854775807]:BigDecimal
    14:[BIGINTEGER1]:BIG_INTEGER=[92233720368547758079223372036854775807]:BigInteger
}

-----------------------------------------------
1 - Record (MODIFIED) {
    0:[ID]:INT=[2]:Integer
    1:[BOOLEAN1]:BOOLEAN=[false]:Boolean
    2:[STRING1]:STRING=[goodbye world]:String
    3:[BYTE1]:BYTE=[127]:Byte
    4:[CHAR1]:CHAR=[z]:Character
    5:[DATE1]:DATE=[1786-12-30]:Date
    6:[DATETIME1]:DATETIME=[Sat Dec 30 01:05:17 PKT 1786]:Date
    7:[TIME1]:TIME=[01:05:17]:Time
    8:[INT1]:INT=[2147483647]:Integer
    9:[LONG1]:LONG=[9223372036854775807]:Long
    10:[DOUBLE1]:DOUBLE=[1.7976931348623157E308]:Double
    11:[FLOAT1]:FLOAT=[3.4028235E38]:Float
    12:[SHORT1]:SHORT=[32767]:Short
    13:[BIGDECIMAL1]:BIG_DECIMAL=[-9406564584124654494065645841246544.9406564584124654494065645841246544]:BigDecimal
    14:[BIGINTEGER1]:BIG_INTEGER=[-92233720368547758079223372036854775807]:BigInteger
}

-----------------------------------------------
2 - Record (MODIFIED) {
    0:[ID]:INT=[3]:Integer
    1:[BOOLEAN1]:BOOLEAN=[false]:Boolean
    2:[STRING1]:STRING=[DEFAULT STRING VALUE]:String
    3:[BYTE1]:BYTE=[88]:Byte
    4:[CHAR1]:CHAR=[z]:Character
    5:[DATE1]:DATE=[1999-08-21]:Date
    6:[DATETIME1]:DATETIME=[Sat Aug 21 15:31:50 PKT 1999]:Date
    7:[TIME1]:TIME=[15:31:50]:Time
    8:[INT1]:INT=[-2147483648]:Integer
    9:[LONG1]:LONG=[-9223372036854775808]:Long
    10:[DOUBLE1]:DOUBLE=[-4.9E-324]:Double
    11:[FLOAT1]:FLOAT=[1.4E-45]:Float
    12:[SHORT1]:SHORT=[-32768]:Short
    13:[BIGDECIMAL1]:BIG_DECIMAL=[0]:BigDecimal
    14:[BIGINTEGER1]:BIG_INTEGER=[0]:BigInteger
}

-----------------------------------------------
3 - Record (MODIFIED) {
    0:[ID]:INT=[4]:Integer
    1:[BOOLEAN1]:BOOLEAN=[false]:Boolean
    2:[STRING1]:STRING=[DEFAULT STRING VALUE]:String
    3:[BYTE1]:BYTE=[88]:Byte
    4:[CHAR1]:CHAR=[z]:Character
    5:[DATE1]:DATE=[1999-08-21]:Date
    6:[DATETIME1]:DATETIME=[Sat Aug 21 15:31:50 PKT 1999]:Date
    7:[TIME1]:TIME=[15:31:50]:Time
    8:[INT1]:INT=[99]:Integer
    9:[LONG1]:LONG=[1152921504606846976]:Long
    10:[DOUBLE1]:DOUBLE=[10.0]:Double
    11:[FLOAT1]:FLOAT=[99.0]:Float
    12:[SHORT1]:SHORT=[99]:Short
    13:[BIGDECIMAL1]:BIG_DECIMAL=[111111111111111111111111111111111111111.111111111111111111111111111111111111111]:BigDecimal
    14:[BIGINTEGER1]:BIG_INTEGER=[99999999999999999999999999999999999999]:BigInteger
}

-----------------------------------------------
4 - Record (MODIFIED) {
    0:[ID]:INT=[5]:Integer
    1:[BOOLEAN1]:BOOLEAN=[false]:Boolean
    2:[STRING1]:STRING=[DEFAULT STRING VALUE]:String
    3:[BYTE1]:BYTE=[88]:Byte
    4:[CHAR1]:CHAR=[z]:Character
    5:[DATE1]:DATE=[1999-08-21]:Date
    6:[DATETIME1]:DATETIME=[Sat Aug 21 15:31:50 PKT 1999]:Date
    7:[TIME1]:TIME=[15:31:50]:Time
    8:[INT1]:INT=[99]:Integer
    9:[LONG1]:LONG=[1152921504606846976]:Long
    10:[DOUBLE1]:DOUBLE=[10.0]:Double
    11:[FLOAT1]:FLOAT=[99.0]:Float
    12:[SHORT1]:SHORT=[99]:Short
    13:[BIGDECIMAL1]:BIG_DECIMAL=[111111111111111111111111111111111111111.111111111111111111111111111111111111111]:BigDecimal
    14:[BIGINTEGER1]:BIG_INTEGER=[99999999999999999999999999999999999999]:BigInteger
}

-----------------------------------------------
5 records
Records Transferred: 5
Running Time: 34 Milliseconds

Mobile Analytics