Extract Bigrams, Trigrams, and Ngrams

Updated: Jun 4, 2023

For this example you will learn how you can use DataPipline to extract Bigrams, Trigrams, and Ngrams from XML files.

For input you will use the data extracted from the links provided in the array i.e. URLS.

Since there are multiple inputs you will use SequenceReader in order to combine the different DataReaders into a single stream by reading from each until empty then moving to the next.

Java Code listing

/*
 * Copyright (c) 2006-2022 North Concepts Inc.  All rights reserved.
 * Proprietary and Confidential.  Use is subject to license terms.
 * 
 * https://northconcepts.com/data-pipeline/licensing/
 */
package com.northconcepts.datapipeline.examples.cookbook;

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.net.URL;

import com.northconcepts.datapipeline.core.DataReader;
import com.northconcepts.datapipeline.core.DataWriter;
import com.northconcepts.datapipeline.core.LimitReader;
import com.northconcepts.datapipeline.core.SequenceReader;
import com.northconcepts.datapipeline.core.SortingReader;
import com.northconcepts.datapipeline.csv.CSVWriter;
import com.northconcepts.datapipeline.group.GroupByReader;
import com.northconcepts.datapipeline.job.Job;
import com.northconcepts.datapipeline.transform.BasicFieldTransformer;
import com.northconcepts.datapipeline.transform.Ngrams;
import com.northconcepts.datapipeline.transform.TransformingReader;
import com.northconcepts.datapipeline.xml.XmlRecordReader;

public class ExtractBigramsTrigramsAndNgrams {
    
    private static final int NGRAMS = 3;  // bigram: 2; trigrams: 3; quadrigrams: 4;
    private static final int TOP_PHRASES = 25;
    
    private static final String[] URLS = {
            "https://rss.cbc.ca/lineup/topstories.xml",
            "https://rss.cbc.ca/lineup/world.xml",
            "https://rss.cbc.ca/lineup/canada.xml",
            "https://rss.cbc.ca/lineup/politics.xml",
            "https://rss.cbc.ca/lineup/business.xml",
            "https://rss.cbc.ca/lineup/health.xml",
            "https://rss.cbc.ca/lineup/arts.xml",
            "https://rss.cbc.ca/lineup/technology.xml",
            "https://rss.cbc.ca/lineup/offbeat.xml",
            "https://www.cbc.ca/cmlink/rss-cbcaboriginal",
            
            "https://globalnews.ca/feed/",
            "https://globalnews.ca/canada/feed/",
            "https://globalnews.ca/world/feed/",
            "https://globalnews.ca/politics/feed/",
            "https://globalnews.ca/money/feed/",
            "https://globalnews.ca/health/feed/",
            "https://globalnews.ca/entertainment/feed/",
            "https://globalnews.ca/environment/feed/",
            "https://globalnews.ca/tech/feed/",
            "https://globalnews.ca/sports/feed/",
            
            "https://www.ctvnews.ca/rss/ctvnews-ca-top-stories-public-rss-1.822009",
            "https://www.ctvnews.ca/rss/ctvnews-ca-canada-public-rss-1.822284",
            "https://www.ctvnews.ca/rss/ctvnews-ca-world-public-rss-1.822289",
            "https://www.ctvnews.ca/rss/ctvnews-ca-entertainment-public-rss-1.822292",
            "https://www.ctvnews.ca/rss/ctvnews-ca-politics-public-rss-1.822302",
            "https://www.ctvnews.ca/rss/lifestyle/ctv-news-lifestyle-1.3407722",
            "https://www.ctvnews.ca/rss/business/ctv-news-business-headlines-1.867648",
            "https://www.ctvnews.ca/rss/ctvnews-ca-sci-tech-public-rss-1.822295",
            "https://www.ctvnews.ca/rss/sports/ctv-news-sports-1.3407726",
            "https://www.ctvnews.ca/rss/ctvnews-ca-health-public-rss-1.822299",
            "https://www.ctvnews.ca/rss/autos/ctv-news-autos-1.867636",
            };

    public static void main(String[] args) throws Throwable {
        
        SequenceReader sequenceReader = new SequenceReader();
        
        for (String url : URLS) {
            BufferedReader input = new BufferedReader(new InputStreamReader(new URL(url).openStream(), "UTF-8"));
            sequenceReader.add(new XmlRecordReader(input).addRecordBreak("/rss/channel/item"));
        }
        
        DataReader reader = sequenceReader;
        
        reader = new TransformingReader(reader)
                .add(new BasicFieldTransformer("title").lowerCase())
                .add(new Ngrams("title", "phrase", NGRAMS));
        
        reader = new GroupByReader(reader, "phrase")
                .setExcludeNulls(true)
                .count("count", true);
        
        reader = new SortingReader(reader).desc("count").asc("phrase");
        
        reader = new LimitReader(reader, TOP_PHRASES);
        
        DataWriter writer = new CSVWriter(new OutputStreamWriter(System.out))
                .setFieldNamesInFirstRow(true);
   
        Job.run(reader, writer);
        
    }
    
}

Code Walkthrough

  1. Since data is collected from multiple output you are going to loop through the URLS and use BufferedReader to read each single url, UTF-8 is the type of ecoding used.
  2. SequenceReader is then used to combine all the input into a single stream, i.e. sequenceReader.add(new XmlRecordReader(input)).
  3. addRecordBreak("/rss/channel/item") is used to separate the record. e.g. in this case a record break will be added after each item tag.
  4. TransformingReader will then be used to transform the records passing through.add(new BasicFieldTransformer("title").lowerCase()) transforms the values in the field title to lowercase.
  5. In .add(new Ngrams("title", "phrase", NGRAMS));, title is the source field path, phrase is the target field path and NGRAMS is the Ngram count
  6. GroupByReader is used to divide the records into groups and also apply summary operations to each group. It contains the reader and the field to group by i.e. phrase.
  7. .setExcludeNulls(true) excludes any null or empty data and .count("count", true) returns the number of times the phrase has occurred.
  8. SortingReader is used to sort the records. In this case desc("count") will sort the record in descending order according to count and asc("phrase") will sort in ascending order according to phrase.
  9. LimitReader is used to limit the number of records sent downstream, in our case it is limited to 25.
  10. CSVWriter is used to write the records to a Comma Separated Value (CSV) stream and then print it to the console via the OutputStreamWriter.
  11. .setFieldNamesInFirstRow(true) is used to set the field names in the first row of the CSV file.
  12. .setFieldNamesInFirstRow(true) is invoked to specify that the names specified in the first row should be used as field names.
  13. Data is transferred from the reader to the writer via Job.run() method.

Output file

phrase,count
texas school shooting,6
artillery to ukraine,5
texas elementary school,5
after texas school,4
at sex assault,4
conservative leadership candidates,4
frontman jacob hoggard,4
hedley frontman jacob,4
monkeypox cases in,4
sex assault trial,4
1 year since,3
2025 invictus games,3
adopts bill 96,3
against russia anand,3
airport security due,3
allegations at sex,3
artists face surrendering,3
as part of,3
at airport security,3
at kamloops residential,3
at texas elementary,3
banning huawei from,3
beads at airport,3
canada sending more,3
certainly not over,3
Mobile Analytics