The title computational linguistics shall be interpreted as such:

  1. Computational linguistics concerns the intersection of linguistics and computer science, studying language with digital age tools.
  2. Theoretical linguistics is a young discipline which has fierce contention within, so I'm better off just leave pure linguists alone.

Methods used in computational linguistics include statistics and modeling.

Natural Language

Linguistics is the study of natural language. Computational techniques applied to natural languages are called natural language processing.

Hierarchy of language representation:

  1. Morphology: morpheme (语素), lexeme (词素), stem (词干), word forms
    1. Root (词根): the fundamental semantic morpheme
    2. Derivation (派生): variation of word class
    3. Inflection (屈折): obligatory variation for syntactic agreement and semantic grounding
  2. Syntax: sentence structure
  3. Semantics: meaning
  4. Pragmatics: intent

Lexemes are meaningful linguistic units free of inflection. Lemma, or citation form, is a canonical form of a lexeme chosen by convention; lemma is used in dictionaries as the headword of an entry (e.g. run). Word form include all variations of a lexeme (e.g. run, runs, ran and running). As a related concept to lexeme, stem is the part of a word with inflectional affixes removed, consisting only of the root morpheme and derivational morphemes.

Tokens (单词) of a text are the symbolic units that compose it, e.g. words, punctuation, numbers, etc.

Corpus (text corpus, 语料库): a large and structured set of texts, usually electronically stored and processed. By structured, it means a corpus is parsed.

Spoken corpus (speech corpus 口语语料库): a database of speech audio files and text transcriptions.

Syntax

Syntax is the form.

Syntax (语法学) refers to the way in which words are put together to form phrases, clauses, or sentences.

A grammar (语法) is a compact characterization of a potentially infinite set of sentences.

Syntax is generally distinguished into three levels:

  1. Lexical level (词法): determining how characters form tokens;
  2. Phrase level (句法): determining how tokens form a hierarchical tree;
  3. Context level (上下文): determining name reference and checking types.

Syntactic analysis correspondingly consists of three phases:

  1. In lexical analysis, a lexer turns a sequence of characters into a sequence of tokens;
  2. In phrase analysis, a parser turns the linear sequence of tokens into a hierarchical syntax tree;
    1. Parse Tree (aka Concrete Syntax Tree)
    2. Abstract Syntax Tree (AST)
  3. Contextual analysis is generally implemented manually, where name resolution and type checking are implemented via a symbol table.

Semantics

Semantics is the meaning.

Semantics (语义学) is the study of the rule systems that determine the literal meaning of a sentence. The premise here is that:

  1. Abstract message is coded into text, in a chosen language.
  2. If the text is syntactically legal, it can be decoded into a unique message, whatever the original message is.

Types of semantics:

  1. Lexical Semantics: study of the meaning of words.
  2. Phraseology: fixed word combinations.

Ambiguity:

  1. Semantic ambiguity: a sentence has multiple meanings due to lexical ambiguity, even after its syntax has been resolved.
  2. Anaphoric ambiguity: A phrase or word refers to something previously mentioned, but there is more than one possibility.
  3. Non-literal speech: using figure of speech.

Lexical ambiguity: a token have multiple meanings. While most tokens in any natural language have multiple meanings, the context in which an ambiguous word is used often makes it evident which of the meanings is intended.

Word sense disambiguation: automatically associate the appropriate meaning with a word in context by algorithmic methods.

Syntactic ambiguity: a sentence has multiple parse trees (sentence structures). Only rewriting the sentence, or placing appropriate punctuation can resolve a syntactic ambiguity.

Common sources of syntactic ambiguity in English:

  1. Phrase Attachment: a modifying phrase can be attached to multiple parts of a sentence.
  2. Noun Group: multiple nouns acting as a single noun.
  3. Conjunction: parallel branches.

Notes on Ambiguity by Prof. Ernest Davis

Spoken language can contain many more types of ambiguities, where there is more than one way to compose a set of sounds into words. Such ambiguity is generally resolved according to the context.

Pragmatics

Pragmatics (语用学) is the study of the use of natural language in communication; more generally, the study of the relations between languages and their users.

Formal Language

In mathematics, computer science, and linguistics, a formal language is a set of strings of symbols that may be constrained by rules that are specific to it.

The alphabet of a formal language is the set of symbols, letters, or tokens from which the strings of the language may be formed; frequently it is required to be finite.

A formal grammar, aka generative grammar, is a set of rules in a formal language prescribing how to form strings from the language's alphabet.

The Chomsky Hierarchy of Formal Grammars:

Hierarchy Grammar Production rules
Type-0 Unrestricted Grammars α→β
Type-1 Context-sensitive Grammars αAβ→αγβ
Type-2 Context-free Grammars A→γ
Type-3 Regular Grammars A→a & A→aB

Computer Language

The syntax (语法) of a computer language is the set of rules that defines the combinations of symbols that are considered to be a correctly structured document or fragment in that language.

Syntax of computer languages generally correspond to the Chomsky hierarchy of formal grammars:

  1. Words are in a regular language, specified in the lexical grammar, which is a Type-3 grammar, generally given as regular expressions.
  2. Phrases are in a context-free language (CFL), generally a deterministic context-free language (DCFL), specified in a phrase structure grammar, which is a Type-2 grammar, generally given as production rules in Backus–Naur Form (BNF).
  3. Contextual structure can in principle be described by a context-sensitive grammar, and automatically analyzed by means such as attribute grammars.

The semantics (语义) of a computer language is the rules for interpreting a syntactically legal source code into its object. In case of programming and query languages, the object is a sequence of instructions; in case of markup and style sheet languages, the object is a document.

Formal semantics (形式语义) of programming languages:

  1. Operational semantics (操作语义): constructing logical statements that represent the computational steps a source code induces.
    1. Small-step Semantics
      • Structural Operational Semantics: formally describe how the individual steps of a program take place in a computing system.
      • Reduction semantics
    2. Big-step Semantics
      • Natural Operational Semantics: formally describe how the overall results of the executions are obtained
  2. Denotational semantics (指称语义): constructing the mathematical objects (aka denotations) that represent what a source code do. The mathematical objects are typically continuous functions between domains.
    • Tenet: Semantics should be compositional. That is, the denotation of a program phrase should be built out of the denotations of its subphrases.
  3. Axiomatic semantics (公理语义): How a program section affect assertions about the program state. Assertions are predicates, where the variables define the program state. This is just a partial description of a computation.