Filter Columns with All Null Values
Updated: Jul 15, 2023
This example identifies and excludes columns with all null values in a dataset. This is useful in data cleaning and preparation tasks before using the data for analysis and visualization.
Input CSV File
Sell,List,Living,Rooms,Beds,Baths,Age,Acres,Taxes,Rating,Garage 142,160,28,10,5,,60,0.28,3167,A, 1,193,28,11,,3,60,0.28,3168,B, 10,110,28,,5,3,60,0.28,3169,B, 3,132,,12,5,3,60,0.28,3170,C, 5,,180,13,5,3,60,0.28,3171,C,
Java Code Listing
package com.northconcepts.datapipeline.foundations.examples.pipeline; import com.northconcepts.datapipeline.foundations.file.LocalFileSink; import com.northconcepts.datapipeline.foundations.file.LocalFileSource; import com.northconcepts.datapipeline.foundations.pipeline.Pipeline; import com.northconcepts.datapipeline.foundations.pipeline.action.transform.SelectArrangeFieldsAction; import com.northconcepts.datapipeline.foundations.pipeline.dataset.Column; import com.northconcepts.datapipeline.foundations.pipeline.dataset.Dataset; import com.northconcepts.datapipeline.foundations.pipeline.dataset.MemoryDataset; import com.northconcepts.datapipeline.foundations.pipeline.input.CsvPipelineInput; import com.northconcepts.datapipeline.foundations.pipeline.output.ExcelPipelineOutput; public class FilterColumnsWithAllNullValues { public static void main(String[] args) { CsvPipelineInput pipelineInput = new CsvPipelineInput() .setFileSource(new LocalFileSource().setPath("data/input/Listing.csv")) .setFieldNamesInFirstRow(true); ExcelPipelineOutput pipelineOutput = new ExcelPipelineOutput() .setFileSink(new LocalFileSink().setPath("data/output/NoNull.xlsx")) .setFieldNamesInFirstRow(true); Pipeline pipeline = new Pipeline() .setInput(pipelineInput) .setOutput(pipelineOutput); Dataset dataset = new MemoryDataset(pipeline); dataset.load().waitForRecordsToLoad(); dataset.waitForColumnStatsToLoad(); SelectArrangeFieldsAction action = new SelectArrangeFieldsAction(); for(Column column : dataset.getColumns()) { if(column.getValueCount() != column.getNullCount()) { action.add(column.getName()); } } pipeline.addAction(action); pipeline.run(); dataset.close(); } }
Code Walkthrough
- CsvPipelineInput instance is created to read data from the input file
Listing.csv
. - The
setFieldNamesInFirstRow(true)
method is invoked to specify that the names specified in the first row should be used as field names. - ExcelPipelineOutput instance is created to specify the output file
output.xlsx
. - A Pipeline object is created. Input and output instances in the previous steps are declared in the pipeline instance.
- MemoryDataset is created to store the pipeline in the in-memory database.
- Next,
SelectArrangeFieldsAction
instance is created to define the arrange the order of fields in the dataset. - The columns with existing values are added to the
action
object using for loop and if conditional. In the given example,Garage
field is removed from the dataset. - The
SelectArrangeFieldsAction
is added to the pipeline to process the data. - The pipeline is executed and the memory database is closed.
Output File
Sell List Living Rooms Beds Baths Age Acres Taxes Rating 142 160 28 10 5 60 0.28 3167 A 1 193 28 11 3 60 0.28 3168 B 10 110 28 5 3 60 0.28 3169 B 3 132 12 5 3 60 0.28 3170 C 5 180 13 5 3 60 0.28 3171 C