Extract Bigrams, Trigrams, and Ngrams
Updated: Aug 10, 2023
This example extracts Bigrams, Trigrams, and N-grams from XML files, allowing you to analyze the relationships and patterns between words or terms within the XML content. By extracting these n-grams, you can gain insights into the co-occurrence of terms, identify frequent word combinations, and perform text analysis tasks.
Java Code Listing
package com.northconcepts.datapipeline.examples.cookbook; import java.io.BufferedReader; import java.io.InputStreamReader; import java.io.OutputStreamWriter; import java.net.URL; import com.northconcepts.datapipeline.core.DataReader; import com.northconcepts.datapipeline.core.DataWriter; import com.northconcepts.datapipeline.core.LimitReader; import com.northconcepts.datapipeline.core.SequenceReader; import com.northconcepts.datapipeline.core.SortingReader; import com.northconcepts.datapipeline.csv.CSVWriter; import com.northconcepts.datapipeline.group.GroupByReader; import com.northconcepts.datapipeline.job.Job; import com.northconcepts.datapipeline.transform.BasicFieldTransformer; import com.northconcepts.datapipeline.transform.Ngrams; import com.northconcepts.datapipeline.transform.TransformingReader; import com.northconcepts.datapipeline.xml.XmlRecordReader; public class ExtractBigramsTrigramsAndNgrams { private static final int NGRAMS = 3; // bigram: 2; trigrams: 3; quadrigrams: 4; private static final int TOP_PHRASES = 25; private static final String[] URLS = { "https://rss.cbc.ca/lineup/topstories.xml", "https://rss.cbc.ca/lineup/world.xml", "https://rss.cbc.ca/lineup/canada.xml", "https://rss.cbc.ca/lineup/politics.xml", "https://rss.cbc.ca/lineup/business.xml", "https://rss.cbc.ca/lineup/health.xml", "https://rss.cbc.ca/lineup/arts.xml", "https://rss.cbc.ca/lineup/technology.xml", "https://rss.cbc.ca/lineup/offbeat.xml", "https://www.cbc.ca/cmlink/rss-cbcaboriginal", "https://globalnews.ca/feed/", "https://globalnews.ca/canada/feed/", "https://globalnews.ca/world/feed/", "https://globalnews.ca/politics/feed/", "https://globalnews.ca/money/feed/", "https://globalnews.ca/health/feed/", "https://globalnews.ca/entertainment/feed/", "https://globalnews.ca/environment/feed/", "https://globalnews.ca/tech/feed/", "https://globalnews.ca/sports/feed/", "https://www.ctvnews.ca/rss/ctvnews-ca-top-stories-public-rss-1.822009", "https://www.ctvnews.ca/rss/ctvnews-ca-canada-public-rss-1.822284", "https://www.ctvnews.ca/rss/ctvnews-ca-world-public-rss-1.822289", "https://www.ctvnews.ca/rss/ctvnews-ca-entertainment-public-rss-1.822292", "https://www.ctvnews.ca/rss/ctvnews-ca-politics-public-rss-1.822302", "https://www.ctvnews.ca/rss/lifestyle/ctv-news-lifestyle-1.3407722", "https://www.ctvnews.ca/rss/business/ctv-news-business-headlines-1.867648", "https://www.ctvnews.ca/rss/ctvnews-ca-sci-tech-public-rss-1.822295", "https://www.ctvnews.ca/rss/sports/ctv-news-sports-1.3407726", "https://www.ctvnews.ca/rss/ctvnews-ca-health-public-rss-1.822299", "https://www.ctvnews.ca/rss/autos/ctv-news-autos-1.867636", }; public static void main(String[] args) throws Throwable { SequenceReader sequenceReader = new SequenceReader(); for (String url : URLS) { BufferedReader input = new BufferedReader(new InputStreamReader(new URL(url).openStream(), "UTF-8")); sequenceReader.add(new XmlRecordReader(input).addRecordBreak("/rss/channel/item")); } DataReader reader = sequenceReader; reader = new TransformingReader(reader) .add(new BasicFieldTransformer("title").lowerCase()) .add(new Ngrams("title", "phrase", NGRAMS)); reader = new GroupByReader(reader, "phrase") .setExcludeNulls(true) .count("count", true); reader = new SortingReader(reader).desc("count").asc("phrase"); reader = new LimitReader(reader, TOP_PHRASES); DataWriter writer = new CSVWriter(new OutputStreamWriter(System.out)) .setFieldNamesInFirstRow(true); Job.run(reader, writer); } }
Code Walkthrough
- Since data is collected from multiple sources, you are going to loop through the
URLS
and use BufferedReader to read each singleurl
. In the given example, UTF-8 is the type of encoding used. SequenceReader
is then used to combine all theinput
into a single stream, i.e.sequenceReader.add(new XmlRecordReader(input))
.addRecordBreak("/rss/channel/item")
is used to separate the record. e.g. in this case a record break will be added after eachitem
tag.- TransformingReader is used to transform the records passing through.
add(new BasicFieldTransformer("title").lowerCase())
transforms the values in the fieldtitle
to lowercase. - In
.add(new Ngrams("title", "phrase", NGRAMS));
,title
is the source field path,phrase
is the target field path andNGRAMS
is the Ngram count - GroupByReader is used to divide the records into groups and also apply summary operations to each group. It contains the
reader
and the field to group by i.e.phrase
. .setExcludeNulls(true)
excludes any null or empty data and.count("count", true)
returns the number of times the phrase has occurred.- SortingReader is used to sort the records. In this case
desc("count")
will sort the record in descending order according tocount
andasc("phrase")
will sort in ascending order according tophrase
. - LimitReader is used to limit the number of records sent downstream, in our case it is limited to
25
. - CSVWriter is used to write the records to a Comma Separated Value (CSV) stream and then print it to the console via the
OutputStreamWriter
. .setFieldNamesInFirstRow(true)
is used to set the field names in the first row of the CSV file..setFieldNamesInFirstRow(true)
is invoked to specify that the names specified in the first row should be used as field names.- Data is transferred from the
reader
to thewriter
via Job.run() method.
Output file
phrase,count texas school shooting,6 artillery to ukraine,5 texas elementary school,5 after texas school,4 at sex assault,4 conservative leadership candidates,4 frontman jacob hoggard,4 hedley frontman jacob,4 monkeypox cases in,4 sex assault trial,4 1 year since,3 2025 invictus games,3 adopts bill 96,3 against russia anand,3 airport security due,3 allegations at sex,3 artists face surrendering,3 as part of,3 at airport security,3 at kamloops residential,3 at texas elementary,3 banning huawei from,3 beads at airport,3 canada sending more,3 certainly not over,3