Use Streaming Excel for Reading
This example shows you how to use Data Pipeline's ExcelReader to read records from a Microsoft Excel document.
In this demo code you are going to use Apache POI's streaming API (XSSF_SAX) as an Excel document provider to read records from an Excel file. By using this API you can achieve less memory footprint when reading very large spreedsheets. Data Pipeline gives you an option to use this API while reading records from an Excel file.
The advantage of streaming to read files is that you can process the data as it is being produced or created.
This example can easily be modified to show how to use streaming excel for writing.
Java Code listing
package com.northconcepts.datapipeline.examples.cookbook; import java.io.File; import com.northconcepts.datapipeline.core.DataReader; import com.northconcepts.datapipeline.core.StreamWriter; import com.northconcepts.datapipeline.excel.ExcelDocument; import com.northconcepts.datapipeline.excel.ExcelDocument.ProviderType; import com.northconcepts.datapipeline.excel.ExcelReader; import com.northconcepts.datapipeline.job.Job; public class UseStreamingExcelReading { public static void main(String[] args) { ExcelDocument document = new ExcelDocument(ProviderType.POI_XSSF_SAX) .open(new File("example/data/input/call-center-inbound-call.xlsx")); DataReader reader = new ExcelReader(document) .setFieldNamesInFirstRow(true); Job.run(reader, new StreamWriter(System.out)); // prints 100 records } }
Code Walkthrough
- First, ExcelDocument is created to open
call-center-inbound-call.xlsx
. - ExcelDocument.ProviderType indicates an Excel provider which is used to open the input Excel file. For this example POI_XSSF_SAX provider is selected.
- ExcelDocument.open() opens the specified
File
orInputStream
. - ExcelReader is created to read from the input Excel file.
- Data are transferred from ExcelReader to the console via Job.run() method. See how to compile and run data pipeline jobs
ExcelDocument
The in-memory abstraction for an Excel workbook. It is not thread-safe and will throw an exception if used in multiple ExcelReaders and/or ExcelWriter concurrently. Since its data is stored in memory, reading and re-reading from it multiple times is very cheap (think millions of reads per second).
ExcelDocument.ProviderType
ExcelDocument.ProviderType is an enum class which contains a list of Excel providers.
POI_XSSF_SAX
POI_XSSF_SAX reads .xlsx files (Excel 2007+) using Apache POI's streaming API (XSSF_SAX) to reduce memory usage.
ExcelReader
Obtains records from a Microsoft Excel document. A method ExcelReader.setFieldNamesInFirstRow(true)
in this class causes the ExcelReader
to use the names specified in the first row of the
input data as field names. If this method is not invoked, the fields would be named as A1, A2. If those fields names need to be changed, a rename
transformation can be added on top of ExcelReader
or any other type
(Refer Rename a field for example).
Console output
0 - Record { 0:[event_type]:STRING=[STARTED]:String 1:[id]:INT=[1]:Integer 2:[agent_id]:INT=[7]:Integer 3:[phone_number]:STRING=[(437) 689-5268]:String 4:[start_time]:DATETIME=[Fri Mar 04 22:39:21 EAT 2016]:Date 5:[end_time]:UNDEFINED=[null] 6:[disposition]:UNDEFINED=[null] } ----------------------------------------------- 1 - Record { 0:[event_type]:STRING=[ENDED]:String 1:[id]:INT=[1]:Integer 2:[agent_id]:INT=[7]:Integer 3:[phone_number]:STRING=[(437) 689-5268]:String 4:[start_time]:DATETIME=[Fri Mar 04 22:39:21 EAT 2016]:Date 5:[end_time]:DATETIME=[Fri Mar 04 22:39:24 EAT 2016]:Date 6:[disposition]:STRING=[PRODUCT_QUESTION]:String } ----------------------------------------------- 2 - Record { 0:[event_type]:STRING=[STARTED]:String 1:[id]:INT=[2]:Integer 2:[agent_id]:INT=[19]:Integer 3:[phone_number]:STRING=[(343) 8314-0603]:String 4:[start_time]:DATETIME=[Fri Mar 04 22:39:26 EAT 2016]:Date 5:[end_time]:UNDEFINED=[null] 6:[disposition]:UNDEFINED=[null] } ----------------------------------------------- 3 - Record { 0:[event_type]:STRING=[STARTED]:String 1:[id]:INT=[3]:Integer 2:[agent_id]:INT=[2]:Integer 3:[phone_number]:STRING=[(365) 4675-1663]:String 4:[start_time]:DATETIME=[Fri Mar 04 22:39:28 EAT 2016]:Date 5:[end_time]:UNDEFINED=[null] 6:[disposition]:UNDEFINED=[null] } . . . . . 99 - Record { 0:[event_type]:STRING=[STARTED]:String 1:[id]:INT=[59]:Integer 2:[agent_id]:INT=[2]:Integer 3:[phone_number]:STRING=[(289) 2053-7379]:String 4:[start_time]:DATETIME=[Fri Mar 04 22:42:40 EAT 2016]:Date 5:[end_time]:UNDEFINED=[null] 6:[disposition]:UNDEFINED=[null] } -----------------------------------------------