With data being produced from many sources in a variety of formats it’s imperative for businesses to have a sane way to gain useful insight. Data integration is the process of transforming data from one or more sources into a form that can be loaded into a target system or used for analysis and business intelligence.
ETL is a process for performing data extraction, transformation and loading. The process extracts data from a variety of sources and formats, transforms it into a standard structure, and loads it into a database, file, web service, or other system for analysis, visualization, machine learning, etc.
ETL tools come in a wide variety of shapes. Some run on your desktop or on-premise servers, while others run as SaaS in the cloud. Some are code-based, built on standard programming languages that many developers already know. Others are built on a custom DSL (domain specific language) in an attempt to be more intentional and require less code. Others still are completely graphical, only offering programming interfaces for complex transformations.
What follows is a list of ETL tools for developers already familiar with Java and the JVM (Java Virtual Machine) to clean, validate, filter, and prepare your data for use.
I was reading a blog at Java Code Geeks on how to create a Spring Batch ETL Job. What struck me about the example was the amount of code required by the framework for such a routine task. In this blog, you’ll see how to accomplish the same task of summarize a million stock trades to find the open, close, high, and low prices for each symbol using our Data Pipeline framework.
One question I like to ask in interviews is: how would you speed up inserts when using JDBC?
This simple question usually shows me how knowledgeable the developer is with databases in general and JDBC specifically.
If you ever find yourself needing to insert data quickly to a SQL database (and not just being asked it in an interview), here are some options to consider.
We’re excited to introduce Data Pipeline version 4.1, the second update on our 2016 roadmap.
This release features MongoDB integration, expression language additions, and improved transformations and joins. We’ve also thrown in a ton of examples for all the new 4.1 and 4.0 features. Enjoy. Continue reading
One feature of Data Pipeline is its ability to aggregate data without a database. This feature allows you to apply SQL “group by” operations to JSON, CSV, XML, Java beans, and other formats on-the-fly — in real-time. This quick tutorial will show you how to use the GroupByReader class to aggregate Twitter search results.
Data Pipeline 3.1 is now available for download. This is a milestone release that adds native support for hierarchical data (nested records and multidimensional arrays).
Data Pipeline makes it easy to read, transform, and write XML and Excel files. This post demonstrates how to load data from an on-disk XML file, apply transformations on-the-fly, and save the result to an Excel file.
Data Pipeline lets you read, write, and convert Excel files using a very simple API. This post will show you how to create Excel files containing more than one work sheet or tab.
We’re hiring a new Java developer and decided to start by asking them to write code instead of the usual Q&A.
Recently we needed to add an hourly scheduler to our sliding window data aggregator and decided this would be a good test to see how people think and code.