How To Aggregate Twitter Searches Without A Database

How To Aggregate Twitter Searches Without A Database

One feature of Data Pipeline is its ability to aggregate data without a database.  This feature allows you to apply SQL “group by” operations to JSON, CSV, XML, Java beans, and other formats on-the-fly — in real-time.  This quick tutorial will show you how to use the GroupByReader class to aggregate Twitter search results.

 

How To Aggregate Twitter Searches Without A Database

Demo — Export Twitter Searches to Excel

We used the code in this post to create a Twitter search tool at https://birdiq.net/twitter-search.

Export your Twitter searches to Excel

This demo summarizes up to 1000 tweets matching your search query to find the top:

  • Retweets
  • Favorites
  • Hashtags
  • Mentions
  • URLs
  • Users

The search results page shows the top 10 items for each of the above, while the Excel spreadsheet contains the summary plus all of the original tweets.

Since this is only a demo, we run searches one-at-a-time and limit the results to 1000 tweets.

Step 1: Search for tweets

The first step is to setup a data source.  In this case, you can start a Twitter search using TwitterSearchReader.  You’ll need a consumer key and secret from https://apps.twitter.com/ to access Twitter’s APIs.  You’ll also need to pass in a search query and the maximum number of tweets to return — #Java and 1000 in this example.

API Limit Policy

Twitter sets limits on how many API calls you can make every 15 minutes.  The ApiLimitPolicy.STOP enum tells Data Pipeline to stop the DataReader normally — as if there’s no more data — if the Twitter API call limit is reached.  Other options are to FAIL by throwing an exception or WAIT up to 15 minutes for the limit to be reset.  You can also create your own policy by implementing IApiLimitPolicy.

Entity Expansion Policy

Some twitter fields like hashtags, mentions and URLs can contain multiple values.  The EntityExpansionPolicy.TO_ARRAY enum tells Data Pipeline to return these multi-valued fields as an array field.  You can also use TO_STRING to return it as a single, space-separated string or you can create your own policy by implementing IEntityExpansionPolicy.

Caching Tweets Temporarily

Normally you would stream data through your pipeline to keep memory low and process data quickly.  For this example, we’ll temporarily save the tweets in a RecordList.  This will allow you to cache them in something like Ehcache in order to handle the same search query in a short period of time.  In the above demo, we use caching to prevent calling the Twitter API a second time when the Excel download link is clicked.

Step 2: Save the tweets to Excel

Now that you have the tweets in memory, the first thing you need to do is add them to an Excel tab named “tweets”.

The setAutofitColumns() call expands the columns widths to fit the widest value.  The setAutoFilterColumns() call adds a row above the data with a drop-down to filter the spreadsheet using the values from each column.  At this point the ExcelDocument is stored in  memory and will be saved to disk later on.

Step 3: Group and count hashtags, mentions, URLs, and retweets

The next step is summarize the saved tweets to find the top hashtags, users, etc.  This will take the form of two reusable methods: top() and count().

Top Tweets

In some cases, Twitter already provides summary data with every tweet.  For example, each tweet has a field for the number of times it was retweeted and another for the number of times someone marked it as their favorite.  If a tweet gets retweeted 5 times, you might see 6 tweets in the search results — the original tweet, plus the retweets.  For these cases, you need to “group by” the text of the tweets to remove duplicates and grab the highest (max) value from the summary field.  You’ll also grab the first value for the other fields you need in the output.

Once the data is grouped and summarized, you’ll sort the data by the summary and text field, take the top n records, and select the fields you want to retain in the output, in the arrangement you want to see them.

This reusable method takes the following parameters:

  • tweets – the cached tweets
  • document – the target Excel document
  • topCount – the number of top rows to return
  • topField – the summary field containing the counts
  • firstFields – the other fields to return in the output sent to the Excel document

Count Tweets

In other cases, Twitter doesn’t provide summary counts and you’ll need to group and count the values yourself.  For example, you’ll need to group and count all the hashtags and URLs found in each tweet.

This reusable method takes the following parameters:

  • tweets – the cached tweets
  • document – the target Excel document
  • countField – the field containing the values to group-by and count
  • firstFields – the other fields to return in the output added to the Excel document

Unlike many SQL databases, the fields you’re going to use in the “group by” may contain array values.  For example, the HashtagEntities field could contain #Java, #WebDevelopment, and #SpringFramework.  When GroupByReader receives array values, it treats each value in the array separately, as if it was the only value in that record’s field.  This allows it to group and count all the #Java hashtags on their own.

Summarize Tweets

You can now call the above methods to add the summary tabs to the Excel document.

Step 4: Save the Excel spreadsheet to disk

The final step is to take the ExcelDocument and save it to disk (or HTTP response stream like in the demo’s case).

That’s all there is to it.  You now have an Excel document containing a tab with the original tweets, plus several other tabs containing summary data.

Twitter Streaming

As we promised at the start, you were able to aggregate data in your own app without setting up a database to store and query the tweets or running a separate data processing server.  Of course, you can always write the results to a database at any time.

Next time we’ll show you how to continuously aggregate data from Twitter’s streaming API in real-time — again without using a database.

Happy Coding!

 

About Dele Taylor

We make Data Pipeline — a lightweight ETL framework for Java. Use it to filter, transform, and aggregate data on-the-fly in your web, mobile, and desktop apps. Learn more about it at northconcepts.com.

One thought on “How To Aggregate Twitter Searches Without A Database

Leave a Reply

Your email address will not be published. Required fields are marked *
You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">