Extract Bigrams, Trigrams, and Ngrams
Updated: Jun 4, 2023
For this example you will learn how you can use DataPipline to extract Bigrams, Trigrams, and Ngrams from XML files.
For input you will use the data extracted from the links provided in the array i.e. URLS
.
Since there are multiple inputs you will use SequenceReader in order to combine the different DataReaders into a single stream by reading from each until empty then moving to the next.
Java Code listing
/* * Copyright (c) 2006-2022 North Concepts Inc. All rights reserved. * Proprietary and Confidential. Use is subject to license terms. * * https://northconcepts.com/data-pipeline/licensing/ */ package com.northconcepts.datapipeline.examples.cookbook; import java.io.BufferedReader; import java.io.InputStreamReader; import java.io.OutputStreamWriter; import java.net.URL; import com.northconcepts.datapipeline.core.DataReader; import com.northconcepts.datapipeline.core.DataWriter; import com.northconcepts.datapipeline.core.LimitReader; import com.northconcepts.datapipeline.core.SequenceReader; import com.northconcepts.datapipeline.core.SortingReader; import com.northconcepts.datapipeline.csv.CSVWriter; import com.northconcepts.datapipeline.group.GroupByReader; import com.northconcepts.datapipeline.job.Job; import com.northconcepts.datapipeline.transform.BasicFieldTransformer; import com.northconcepts.datapipeline.transform.Ngrams; import com.northconcepts.datapipeline.transform.TransformingReader; import com.northconcepts.datapipeline.xml.XmlRecordReader; public class ExtractBigramsTrigramsAndNgrams { private static final int NGRAMS = 3; // bigram: 2; trigrams: 3; quadrigrams: 4; private static final int TOP_PHRASES = 25; private static final String[] URLS = { "https://rss.cbc.ca/lineup/topstories.xml", "https://rss.cbc.ca/lineup/world.xml", "https://rss.cbc.ca/lineup/canada.xml", "https://rss.cbc.ca/lineup/politics.xml", "https://rss.cbc.ca/lineup/business.xml", "https://rss.cbc.ca/lineup/health.xml", "https://rss.cbc.ca/lineup/arts.xml", "https://rss.cbc.ca/lineup/technology.xml", "https://rss.cbc.ca/lineup/offbeat.xml", "https://www.cbc.ca/cmlink/rss-cbcaboriginal", "https://globalnews.ca/feed/", "https://globalnews.ca/canada/feed/", "https://globalnews.ca/world/feed/", "https://globalnews.ca/politics/feed/", "https://globalnews.ca/money/feed/", "https://globalnews.ca/health/feed/", "https://globalnews.ca/entertainment/feed/", "https://globalnews.ca/environment/feed/", "https://globalnews.ca/tech/feed/", "https://globalnews.ca/sports/feed/", "https://www.ctvnews.ca/rss/ctvnews-ca-top-stories-public-rss-1.822009", "https://www.ctvnews.ca/rss/ctvnews-ca-canada-public-rss-1.822284", "https://www.ctvnews.ca/rss/ctvnews-ca-world-public-rss-1.822289", "https://www.ctvnews.ca/rss/ctvnews-ca-entertainment-public-rss-1.822292", "https://www.ctvnews.ca/rss/ctvnews-ca-politics-public-rss-1.822302", "https://www.ctvnews.ca/rss/lifestyle/ctv-news-lifestyle-1.3407722", "https://www.ctvnews.ca/rss/business/ctv-news-business-headlines-1.867648", "https://www.ctvnews.ca/rss/ctvnews-ca-sci-tech-public-rss-1.822295", "https://www.ctvnews.ca/rss/sports/ctv-news-sports-1.3407726", "https://www.ctvnews.ca/rss/ctvnews-ca-health-public-rss-1.822299", "https://www.ctvnews.ca/rss/autos/ctv-news-autos-1.867636", }; public static void main(String[] args) throws Throwable { SequenceReader sequenceReader = new SequenceReader(); for (String url : URLS) { BufferedReader input = new BufferedReader(new InputStreamReader(new URL(url).openStream(), "UTF-8")); sequenceReader.add(new XmlRecordReader(input).addRecordBreak("/rss/channel/item")); } DataReader reader = sequenceReader; reader = new TransformingReader(reader) .add(new BasicFieldTransformer("title").lowerCase()) .add(new Ngrams("title", "phrase", NGRAMS)); reader = new GroupByReader(reader, "phrase") .setExcludeNulls(true) .count("count", true); reader = new SortingReader(reader).desc("count").asc("phrase"); reader = new LimitReader(reader, TOP_PHRASES); DataWriter writer = new CSVWriter(new OutputStreamWriter(System.out)) .setFieldNamesInFirstRow(true); Job.run(reader, writer); } }
Code Walkthrough
- Since data is collected from multiple output you are going to loop through the
URLS
and use BufferedReader to read each singleurl
, UTF-8 is the type of ecoding used. SequenceReader
is then used to combine all theinput
into a single stream, i.e.sequenceReader.add(new XmlRecordReader(input))
.addRecordBreak("/rss/channel/item")
is used to separate the record. e.g. in this case a record break will be added after eachitem
tag.- TransformingReader will then be used to transform the records passing through.
add(new BasicFieldTransformer("title").lowerCase())
transforms the values in the fieldtitle
to lowercase. - In
.add(new Ngrams("title", "phrase", NGRAMS));
,title
is the source field path,phrase
is the target field path andNGRAMS
is the Ngram count - GroupByReader is used to divide the records into groups and also apply summary operations to each group. It contains the
reader
and the field to group by i.e.phrase
. .setExcludeNulls(true)
excludes any null or empty data and.count("count", true)
returns the number of times the phrase has occurred.- SortingReader is used to sort the records. In this case
desc("count")
will sort the record in descending order according tocount
andasc("phrase")
will sort in ascending order according tophrase
. - LimitReader is used to limit the number of records sent downstream, in our case it is limited to
25
. - CSVWriter is used to write the records to a Comma Separated Value (CSV) stream and then print it to the console via the
OutputStreamWriter
. .setFieldNamesInFirstRow(true)
is used to set the field names in the first row of the CSV file..setFieldNamesInFirstRow(true)
is invoked to specify that the names specified in the first row should be used as field names.- Data is transferred from the
reader
to thewriter
via Job.run() method.
Output file
phrase,count texas school shooting,6 artillery to ukraine,5 texas elementary school,5 after texas school,4 at sex assault,4 conservative leadership candidates,4 frontman jacob hoggard,4 hedley frontman jacob,4 monkeypox cases in,4 sex assault trial,4 1 year since,3 2025 invictus games,3 adopts bill 96,3 against russia anand,3 airport security due,3 allegations at sex,3 artists face surrendering,3 as part of,3 at airport security,3 at kamloops residential,3 at texas elementary,3 banning huawei from,3 beads at airport,3 canada sending more,3 certainly not over,3