Data Pipeline 4.4 Now Available

Today we’re pleased announce the release of Data Pipeline version 4.4.  This update includes integration with Amazon S3, new features to better handle real-time data and aggregation, and new XML and JSON readers to speed up your development.

Amazon S3 Streaming Uploads & Downloads

We’ve introduced a new AmazonS3FileSystem class to help you interact with remote files stored on S3.  The class lets you read and write files with minimal memory overhead without first saving them to disk.  It also let’s you browse your buckets, folders, and files.

Read a File from Amazon S3

 

Write a File to Amazon S3

 

JSON Record Reader

Our existing JSON reader requires that you identify the fields ahead of time to add to your records.    But what if you want all the fields or don’t know what the fields are ahead of time?  The new JsonRecordReader is here to help.   You no longer need to specify any fields, just one or more record breaks and Data Pipeline will collect everything it finds into hierarchical records.

 

XML Record Reader

Like the new JSON record reader, XML also has a similar addition.  If you need to collect all subbranches of an XML tree into records without specifying each field, the new XmlRecordReader is exactly what you need.

 

Buffered Reader

When dealing with real-time data, you sometimes need to collect out-of-order records together.  In a shipping system, you may need to process packages arriving on the same truck together, even if packages from multiple trucks are mixed together.  The new BufferedReader class uses a configurable strategy to determine when to release incoming records in each of its buffers downstream.

 

Group By Reader

If you need to summarize streaming data, GroupByReader can now produce output at a different rate than incoming data.  The CreateWindowStrategy and CloseWindowStrategy classes can now be used together to produce sliding windows that release summarized data even when no new records are available.  They can also be used to create complex windows based on a combination of factors including time and record counts.

 

Bigrams, Trigrams, Ngrams

Data Pipeline has a new transformer for extracting every n-sequence of words in a body of text.  The new Ngrams class allows you to extract bigrams, trigrams, and other ngrams from any field in a data stream.   The following example finds the top 25 three-word sequences found in the headlines of several Canadian news websites.

 

Multiple Field Support

The BasicFieldTransformerFieldTransformer, and FieldFilter classes can now accept more than one field name.  You no longer have to duplicate lines of code to perform the same operations or filters on multiple fields, just pass them all into the constructors.

And More

Version 4.4 also includes improvements in the areas of:

  • Multi-threading
  • Debugging & diagnostic
  • The expression language
  • Job management and cancellation
  • BigInteger and BigDecimal support
  • Twitter data readers
  • XPath expression handling

See the changelog for the complete list of updates.

About Dele Taylor

We make Data Pipeline — a lightweight ETL framework for Java. Use it to filter, transform, and aggregate data on-the-fly in your web, mobile, and desktop apps. Learn more about it at northconcepts.com.

Leave a Reply

Your email address will not be published. Required fields are marked *
You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">