Filter Columns with All Null Values

This example identifies and excludes columns with all null values in a dataset.  This is useful in data cleaning and preparation tasks before using the data for analysis and visualization.

Input CSV File

Sell,List,Living,Rooms,Beds,Baths,Age,Acres,Taxes,Rating,Garage
142,160,28,10,5,,60,0.28,3167,A,
1,193,28,11,,3,60,0.28,3168,B,
10,110,28,,5,3,60,0.28,3169,B,
3,132,,12,5,3,60,0.28,3170,C,
5,,180,13,5,3,60,0.28,3171,C,

 

Java Code Listing

package com.northconcepts.datapipeline.foundations.examples.pipeline;

import com.northconcepts.datapipeline.foundations.file.LocalFileSink;
import com.northconcepts.datapipeline.foundations.file.LocalFileSource;
import com.northconcepts.datapipeline.foundations.pipeline.Pipeline;
import com.northconcepts.datapipeline.foundations.pipeline.action.transform.SelectArrangeFieldsAction;
import com.northconcepts.datapipeline.foundations.pipeline.dataset.Column;
import com.northconcepts.datapipeline.foundations.pipeline.dataset.Dataset;
import com.northconcepts.datapipeline.foundations.pipeline.dataset.MemoryDataset;
import com.northconcepts.datapipeline.foundations.pipeline.input.CsvPipelineInput;
import com.northconcepts.datapipeline.foundations.pipeline.output.ExcelPipelineOutput;


public class FilterColumnsWithAllNullValues {

    public static void main(String[] args) {

        CsvPipelineInput pipelineInput = new CsvPipelineInput()
                .setFileSource(new LocalFileSource().setPath("data/input/Listing.csv"))
                .setFieldNamesInFirstRow(true);

        ExcelPipelineOutput pipelineOutput = new ExcelPipelineOutput()
                .setFileSink(new LocalFileSink().setPath("data/output/NoNull.xlsx"))
                .setFieldNamesInFirstRow(true);

        Pipeline pipeline = new Pipeline()
            .setInput(pipelineInput)
            .setOutput(pipelineOutput);

        Dataset dataset = new MemoryDataset(pipeline);
        dataset.load().waitForRecordsToLoad();
        dataset.waitForColumnStatsToLoad();

        SelectArrangeFieldsAction action = new SelectArrangeFieldsAction();

        for(Column column : dataset.getColumns()) {
            if(column.getValueCount() != column.getNullCount()) {
                action.add(column.getName());
            }
        }

        pipeline.addAction(action);

        pipeline.run();

        dataset.close();
    }
}

 

Code Walkthrough

  1. CsvPipelineInput instance is created to read data from the input file Listing.csv.
  2. The setFieldNamesInFirstRow(true) method is invoked to specify that the names specified in the first row should be used as field names.
  3. ExcelPipelineOutput instance is created to specify the output file output.xlsx.
  4. A Pipeline object is created. Input and output instances in the previous steps are declared in the pipeline instance.
  5. MemoryDataset is created to store the pipeline in the in-memory database.
  6. Next, SelectArrangeFieldsAction instance is created to define the arrange the order of fields in the dataset.
  7. The columns with existing values are added to the action object using for loop and if conditional. In the given example, Garage field is removed from the dataset.
  8. The SelectArrangeFieldsAction is added to the pipeline to process the data.
  9. The pipeline is executed and the memory database is closed.

 

Output File

Sell	List	Living	Rooms	Beds	Baths	Age	Acres	Taxes	Rating
142	160	28	10	5		60	0.28	3167	A
1	193	28	11		3	60	0.28	3168	B
10	110	28		5	3	60	0.28	3169	B
3	132		12	5	3	60	0.28	3170	C
5		180	13	5	3	60	0.28	3171	C

 

Mobile Analytics