Use Streaming Excel for Reading

Updated: Jun 4, 2023

This example shows you how to use Data Pipeline's ExcelReader to read records from a Microsoft Excel document.

In this demo code you are going to use Apache POI's streaming API (XSSF_SAX) as an Excel document provider to read records from an Excel file. By using this API you can achieve less memory footprint when reading very large spreedsheets. Data Pipeline gives you an option to use this API while reading records from an Excel file.

The advantage of streaming to read files is that you can process the data as it is being produced or created.

This example can easily be modified to show how to use streaming excel for writing.

Java Code listing

package com.northconcepts.datapipeline.examples.cookbook;

import java.io.File;

import com.northconcepts.datapipeline.core.DataReader;
import com.northconcepts.datapipeline.core.StreamWriter;
import com.northconcepts.datapipeline.excel.ExcelDocument;
import com.northconcepts.datapipeline.excel.ExcelDocument.ProviderType;
import com.northconcepts.datapipeline.excel.ExcelReader;
import com.northconcepts.datapipeline.job.Job;

public class UseStreamingExcelReading {
    
    public static void main(String[] args) {
        ExcelDocument document = new ExcelDocument(ProviderType.POI_XSSF_SAX)
                .open(new File("example/data/input/call-center-inbound-call.xlsx"));
        DataReader reader = new ExcelReader(document)
            .setFieldNamesInFirstRow(true);

        Job.run(reader, new StreamWriter(System.out)); // prints 100 records
    }

}

Code Walkthrough

  1. First, ExcelDocument is created to open call-center-inbound-call.xlsx.
  2. ExcelDocument.ProviderType indicates an Excel provider which is used to open the input Excel file. For this example POI_XSSF_SAX provider is selected.
  3. ExcelDocument.open() opens the specified File or InputStream.
  4. ExcelReader is created to read from the input Excel file.
  5. Data are transferred from ExcelReader to the console via Job.run() method. See how to compile and run data pipeline jobs

ExcelDocument

The in-memory abstraction for an Excel workbook. It is not thread-safe and will throw an exception if used in multiple ExcelReaders and/or ExcelWriter concurrently. Since its data is stored in memory, reading and re-reading from it multiple times is very cheap (think millions of reads per second).

ExcelDocument.ProviderType

ExcelDocument.ProviderType is an enum class which contains a list of Excel providers.

POI_XSSF_SAX

POI_XSSF_SAX reads .xlsx files (Excel 2007+) using Apache POI's streaming API (XSSF_SAX) to reduce memory usage.

ExcelReader

Obtains records from a Microsoft Excel document. A method ExcelReader.setFieldNamesInFirstRow(true) in this class causes the ExcelReader to use the names specified in the first row of the input data as field names. If this method is not invoked, the fields would be named as A1, A2. If those fields names need to be changed, a rename transformation can be added on top of ExcelReader or any other type (Refer Rename a field for example).

Console output

0 - Record {
    0:[event_type]:STRING=[STARTED]:String
    1:[id]:INT=[1]:Integer
    2:[agent_id]:INT=[7]:Integer
    3:[phone_number]:STRING=[(437) 689-5268]:String
    4:[start_time]:DATETIME=[Fri Mar 04 22:39:21 EAT 2016]:Date
    5:[end_time]:UNDEFINED=[null]
    6:[disposition]:UNDEFINED=[null]
}

-----------------------------------------------
1 - Record {
    0:[event_type]:STRING=[ENDED]:String
    1:[id]:INT=[1]:Integer
    2:[agent_id]:INT=[7]:Integer
    3:[phone_number]:STRING=[(437) 689-5268]:String
    4:[start_time]:DATETIME=[Fri Mar 04 22:39:21 EAT 2016]:Date
    5:[end_time]:DATETIME=[Fri Mar 04 22:39:24 EAT 2016]:Date
    6:[disposition]:STRING=[PRODUCT_QUESTION]:String
}

-----------------------------------------------
2 - Record {
    0:[event_type]:STRING=[STARTED]:String
    1:[id]:INT=[2]:Integer
    2:[agent_id]:INT=[19]:Integer
    3:[phone_number]:STRING=[(343) 8314-0603]:String
    4:[start_time]:DATETIME=[Fri Mar 04 22:39:26 EAT 2016]:Date
    5:[end_time]:UNDEFINED=[null]
    6:[disposition]:UNDEFINED=[null]
}

-----------------------------------------------
3 - Record {
    0:[event_type]:STRING=[STARTED]:String
    1:[id]:INT=[3]:Integer
    2:[agent_id]:INT=[2]:Integer
    3:[phone_number]:STRING=[(365) 4675-1663]:String
    4:[start_time]:DATETIME=[Fri Mar 04 22:39:28 EAT 2016]:Date
    5:[end_time]:UNDEFINED=[null]
    6:[disposition]:UNDEFINED=[null]
}
.
.
.
.
.
99 - Record {
    0:[event_type]:STRING=[STARTED]:String
    1:[id]:INT=[59]:Integer
    2:[agent_id]:INT=[2]:Integer
    3:[phone_number]:STRING=[(289) 2053-7379]:String
    4:[start_time]:DATETIME=[Fri Mar 04 22:42:40 EAT 2016]:Date
    5:[end_time]:UNDEFINED=[null]
    6:[disposition]:UNDEFINED=[null]
}

-----------------------------------------------
Mobile Analytics