Filter Columns with All Null Values
Updated: Jul 15, 2023
This example identifies and excludes columns with all null values in a dataset. This is useful in data cleaning and preparation tasks before using the data for analysis and visualization.
Input CSV File
Sell,List,Living,Rooms,Beds,Baths,Age,Acres,Taxes,Rating,Garage 142,160,28,10,5,,60,0.28,3167,A, 1,193,28,11,,3,60,0.28,3168,B, 10,110,28,,5,3,60,0.28,3169,B, 3,132,,12,5,3,60,0.28,3170,C, 5,,180,13,5,3,60,0.28,3171,C,
Java Code Listing
package com.northconcepts.datapipeline.foundations.examples.pipeline;
import com.northconcepts.datapipeline.foundations.file.LocalFileSink;
import com.northconcepts.datapipeline.foundations.file.LocalFileSource;
import com.northconcepts.datapipeline.foundations.pipeline.Pipeline;
import com.northconcepts.datapipeline.foundations.pipeline.action.transform.SelectArrangeFieldsAction;
import com.northconcepts.datapipeline.foundations.pipeline.dataset.Column;
import com.northconcepts.datapipeline.foundations.pipeline.dataset.Dataset;
import com.northconcepts.datapipeline.foundations.pipeline.dataset.MemoryDataset;
import com.northconcepts.datapipeline.foundations.pipeline.input.CsvPipelineInput;
import com.northconcepts.datapipeline.foundations.pipeline.output.ExcelPipelineOutput;
public class FilterColumnsWithAllNullValues {
public static void main(String[] args) {
CsvPipelineInput pipelineInput = new CsvPipelineInput()
.setFileSource(new LocalFileSource().setPath("data/input/Listing.csv"))
.setFieldNamesInFirstRow(true);
ExcelPipelineOutput pipelineOutput = new ExcelPipelineOutput()
.setFileSink(new LocalFileSink().setPath("data/output/NoNull.xlsx"))
.setFieldNamesInFirstRow(true);
Pipeline pipeline = new Pipeline()
.setInput(pipelineInput)
.setOutput(pipelineOutput);
Dataset dataset = new MemoryDataset(pipeline);
dataset.load().waitForRecordsToLoad();
dataset.waitForColumnStatsToLoad();
SelectArrangeFieldsAction action = new SelectArrangeFieldsAction();
for(Column column : dataset.getColumns()) {
if(column.getValueCount() != column.getNullCount()) {
action.add(column.getName());
}
}
pipeline.addAction(action);
pipeline.run();
dataset.close();
}
}
Code Walkthrough
- CsvPipelineInput instance is created to read data from the input file
Listing.csv. - The
setFieldNamesInFirstRow(true)method is invoked to specify that the names specified in the first row should be used as field names. - ExcelPipelineOutput instance is created to specify the output file
output.xlsx. - A Pipeline object is created. Input and output instances in the previous steps are declared in the pipeline instance.
- MemoryDataset is created to store the pipeline in the in-memory database.
- Next,
SelectArrangeFieldsActioninstance is created to define the arrange the order of fields in the dataset. - The columns with existing values are added to the
actionobject using for loop and if conditional. In the given example,Garagefield is removed from the dataset. - The
SelectArrangeFieldsActionis added to the pipeline to process the data. - The pipeline is executed and the memory database is closed.
Output File
Sell List Living Rooms Beds Baths Age Acres Taxes Rating 142 160 28 10 5 60 0.28 3167 A 1 193 28 11 3 60 0.28 3168 B 10 110 28 5 3 60 0.28 3169 B 3 132 12 5 3 60 0.28 3170 C 5 180 13 5 3 60 0.28 3171 C
