Extract Bigrams, Trigrams, and Ngrams

Updated: Aug 10, 2023

This example extracts BigramsTrigrams, and N-grams from XML files, allowing you to analyze the relationships and patterns between words or terms within the XML content. By extracting these n-grams, you can gain insights into the co-occurrence of terms, identify frequent word combinations, and perform text analysis tasks.

 

Java Code Listing

package com.northconcepts.datapipeline.examples.cookbook;

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.net.URL;

import com.northconcepts.datapipeline.core.DataReader;
import com.northconcepts.datapipeline.core.DataWriter;
import com.northconcepts.datapipeline.core.LimitReader;
import com.northconcepts.datapipeline.core.SequenceReader;
import com.northconcepts.datapipeline.core.SortingReader;
import com.northconcepts.datapipeline.csv.CSVWriter;
import com.northconcepts.datapipeline.group.GroupByReader;
import com.northconcepts.datapipeline.job.Job;
import com.northconcepts.datapipeline.transform.BasicFieldTransformer;
import com.northconcepts.datapipeline.transform.Ngrams;
import com.northconcepts.datapipeline.transform.TransformingReader;
import com.northconcepts.datapipeline.xml.XmlRecordReader;

public class ExtractBigramsTrigramsAndNgrams {
    
    private static final int NGRAMS = 3;  // bigram: 2; trigrams: 3; quadrigrams: 4;
    private static final int TOP_PHRASES = 25;
    
    private static final String[] URLS = {
            "https://rss.cbc.ca/lineup/topstories.xml",
            "https://rss.cbc.ca/lineup/world.xml",
            "https://rss.cbc.ca/lineup/canada.xml",
            "https://rss.cbc.ca/lineup/politics.xml",
            "https://rss.cbc.ca/lineup/business.xml",
            "https://rss.cbc.ca/lineup/health.xml",
            "https://rss.cbc.ca/lineup/arts.xml",
            "https://rss.cbc.ca/lineup/technology.xml",
            "https://rss.cbc.ca/lineup/offbeat.xml",
            "https://www.cbc.ca/cmlink/rss-cbcaboriginal",
            
            "https://globalnews.ca/feed/",
            "https://globalnews.ca/canada/feed/",
            "https://globalnews.ca/world/feed/",
            "https://globalnews.ca/politics/feed/",
            "https://globalnews.ca/money/feed/",
            "https://globalnews.ca/health/feed/",
            "https://globalnews.ca/entertainment/feed/",
            "https://globalnews.ca/environment/feed/",
            "https://globalnews.ca/tech/feed/",
            "https://globalnews.ca/sports/feed/",
            
            "https://www.ctvnews.ca/rss/ctvnews-ca-top-stories-public-rss-1.822009",
            "https://www.ctvnews.ca/rss/ctvnews-ca-canada-public-rss-1.822284",
            "https://www.ctvnews.ca/rss/ctvnews-ca-world-public-rss-1.822289",
            "https://www.ctvnews.ca/rss/ctvnews-ca-entertainment-public-rss-1.822292",
            "https://www.ctvnews.ca/rss/ctvnews-ca-politics-public-rss-1.822302",
            "https://www.ctvnews.ca/rss/lifestyle/ctv-news-lifestyle-1.3407722",
            "https://www.ctvnews.ca/rss/business/ctv-news-business-headlines-1.867648",
            "https://www.ctvnews.ca/rss/ctvnews-ca-sci-tech-public-rss-1.822295",
            "https://www.ctvnews.ca/rss/sports/ctv-news-sports-1.3407726",
            "https://www.ctvnews.ca/rss/ctvnews-ca-health-public-rss-1.822299",
            "https://www.ctvnews.ca/rss/autos/ctv-news-autos-1.867636",
            };

    public static void main(String[] args) throws Throwable {
        
        SequenceReader sequenceReader = new SequenceReader();
        
        for (String url : URLS) {
            BufferedReader input = new BufferedReader(new InputStreamReader(new URL(url).openStream(), "UTF-8"));
            sequenceReader.add(new XmlRecordReader(input).addRecordBreak("/rss/channel/item"));
        }
        
        DataReader reader = sequenceReader;
        
        reader = new TransformingReader(reader)
                .add(new BasicFieldTransformer("title").lowerCase())
                .add(new Ngrams("title", "phrase", NGRAMS));
        
        reader = new GroupByReader(reader, "phrase")
                .setExcludeNulls(true)
                .count("count", true);
        
        reader = new SortingReader(reader).desc("count").asc("phrase");
        
        reader = new LimitReader(reader, TOP_PHRASES);
        
        DataWriter writer = new CSVWriter(new OutputStreamWriter(System.out))
                .setFieldNamesInFirstRow(true);
   
        Job.run(reader, writer);
        
    }
    
}

 

Code Walkthrough

  1. Since data is collected from multiple sources, you are going to loop through the URLS and use BufferedReader to read each single url. In the given example, UTF-8 is the type of encoding used.
  2. SequenceReader is then used to combine all the input into a single stream, i.e. sequenceReader.add(new XmlRecordReader(input)).
  3. addRecordBreak("/rss/channel/item") is used to separate the record. e.g. in this case a record break will be added after each item tag.
  4. TransformingReader is used to transform the records passing through.add(new BasicFieldTransformer("title").lowerCase()) transforms the values in the field title to lowercase.
  5. In .add(new Ngrams("title", "phrase", NGRAMS));, title is the source field path, phrase is the target field path and NGRAMS is the Ngram count
  6. GroupByReader is used to divide the records into groups and also apply summary operations to each group. It contains the reader and the field to group by i.e. phrase.
  7. .setExcludeNulls(true) excludes any null or empty data and .count("count", true) returns the number of times the phrase has occurred.
  8. SortingReader is used to sort the records. In this case desc("count") will sort the record in descending order according to count and asc("phrase") will sort in ascending order according to phrase.
  9. LimitReader is used to limit the number of records sent downstream, in our case it is limited to 25.
  10. CSVWriter is used to write the records to a Comma Separated Value (CSV) stream and then print it to the console via the OutputStreamWriter.
  11. .setFieldNamesInFirstRow(true) is used to set the field names in the first row of the CSV file.
  12. .setFieldNamesInFirstRow(true) is invoked to specify that the names specified in the first row should be used as field names.
  13. Data is transferred from the reader to the writer via Job.run() method.

 

Output file

phrase,count
texas school shooting,6
artillery to ukraine,5
texas elementary school,5
after texas school,4
at sex assault,4
conservative leadership candidates,4
frontman jacob hoggard,4
hedley frontman jacob,4
monkeypox cases in,4
sex assault trial,4
1 year since,3
2025 invictus games,3
adopts bill 96,3
against russia anand,3
airport security due,3
allegations at sex,3
artists face surrendering,3
as part of,3
at airport security,3
at kamloops residential,3
at texas elementary,3
banning huawei from,3
beads at airport,3
canada sending more,3
certainly not over,3
Mobile Analytics