Language Models

Bag-of-words Model

The bag-of-words model is a model where a text is represented as the bag (multi-set) of its words, disregarding grammar and even word order but keeping multiplicity.

Data Processing Pipeline:

  1. Tokenize the text
  2. Remove stop words: uninformative and very abundant words of a language.
  3. Stem words
  4. Compute tf-idf (term frequency-inverse document frequency): an empirical score of the importance of a word to a document chosen from a corpus.

N-gram Language Models

An n-gram is a contiguous sequence of n items from a given sequence of text or speech.

An n-gram language model is a probabilistic model of a word conditioned on the previous n-1 words.


Daily Applications:

  • Text Classification: Spam Filtering
  • Computer-assisted Reviewing
    • Grammar Checker
    • Plagiarism Detection
  • Machine Translation
  • Automatic Identification and Data Capture (AIDC)
    • Optical Character Recognition (OCR)
    • Speech Recognition
  • Natural Language User Interface
    • Answer Engine
    • Personal Assistant

Text Analytics:

  • Text Segmentation
    • Sentence Segmentation: A sentence detector detects the longest white space trimmed character sequence between two punctuation marks, aka sentence.
    • Tokenization: Tokenizers segment an input character sequence into tokens.
  • Stemming: the process of reducing words to their root. For example, Porter's stemming algorithm for the English language.
  • Text Classification
    • Part-of-Speech (POS) Tagging: Part of speech tagger marks tokens with their corresponding word type based on the token itself and the context of the token.
    • Document Classification: Document categorizer classifies text into pre-defined categories, e.g. for sentiment analysis.
  • Syntactic Analysis
    • Chunking (Shallow Parsing): divides a text into syntactically correlated parts of words, like noun groups and verb groups, but does not specify their internal structure nor their role in the main sentence. Can be used for Named Entity Recognition (NER): named entities are definite noun phrases that refer to specific types of individuals, such as organizations, persons, dates, and so on.
    • Parsing (Syntactic parsing): A parser is a procedure for finding one or more trees corresponding to a grammatically well-formed sentence.
  • Semantic Analysis
    • Semantic Role Labeling (Shallow Semantic Parsing)
    • Terminology Extraction
    • Relation Extraction
    • Coreference Resolution

Automatic Summarization: Sentence Extraction, Text Simplification.


Natural Language ToolKit (NLTK)

Natural Language Toolkit (NLTK) is an open source library which provides a platform to build natural language processing programs in Python. An introductory handbook of NLTK is Bird, Steven, Edward Loper and Ewan Klein (2009), Natural Language Processing with Python. O’Reilly Media Inc.

National Centre for Text Mining (NaCTeM)

Text mining tools:

  • TerMine: A quick way for a reader to pick out articles of potential interest from a large body of text.
  • AcroMine: recognition of acronym definitions in a text collection.
  • KLEIO: learn term variation patterns automatically.
  • ASSERT: provides text mining services to facilitate the process of producing systematic reviews, especially in the social sciences although the techniques are applicable to other domains.

🏷 Category=Computation Category=Data Mining