The title computational linguistics shall be interpreted as such:
Methods used in computational linguistics include statistics and modeling.
Linguistics is the study of natural language. Computational techniques applied to natural languages are called Natural Language Processing.
Hierarchy of language representation:
Lexemes are meaningful linguistic units free of inflection. Lemma, or citation form, is a canonical form of a lexeme chosen by convention; lemma is used in dictionaries as the headword of an entry (e.g. run). Word form include all variations of a lexeme (e.g. run, runs, ran and running). As a related concept to lexeme, stem is the part of a word with inflectional affixes removed, consisting only of the root morpheme and derivational morphemes.
Tokens (单词) of a text are the symbolic units that compose it, e.g. words, punctuation, numbers, etc.
Corpus (text corpus, 语料库): a large and structured set of texts, usually electronically stored and processed. By structured, it means a corpus is parsed.
Spoken corpus (speech corpus 口语语料库): a database of speech audio files and text transcriptions.
Syntax is the form.
Syntax (语法学) refers to the way in which words are put together to form phrases, clauses, or sentences.
A grammar (语法) is a compact characterization of a potentially infinite set of sentences.
Syntax is generally distinguished into three levels:
Syntactic analysis correspondingly consists of three phases:
Semantics is the meaning.
Semantics (语义学) is the study of the rule systems that determine the literal meaning of a sentence. The premise here is that:
Types of semantics:
Ambiguity:
Lexical ambiguity: a token have multiple meanings. While most tokens in any natural language have multiple meanings, the context in which an ambiguous word is used often makes it evident which of the meanings is intended.
Word sense disambiguation: automatically associate the appropriate meaning with a word in context by algorithmic methods.
Syntactic ambiguity: a sentence has multiple parse trees (sentence structures). Only rewriting the sentence, or placing appropriate punctuation can resolve a syntactic ambiguity.
Common sources of syntactic ambiguity in English:
Notes on Ambiguity by Prof. Ernest Davis
Spoken language can contain many more types of ambiguities, where there is more than one way to compose a set of sounds into words. Such ambiguity is generally resolved according to the context.
Pragmatics (语用学) is the study of the use of natural language in communication; more generally, the study of the relations between languages and their users.
In mathematics, computer science, and linguistics, a formal language is a set of strings of symbols that may be constrained by rules that are specific to it.
The alphabet of a formal language is the set of symbols, letters, or tokens from which the strings of the language may be formed; frequently it is required to be finite.
A formal grammar, aka generative grammar, is a set of rules in a formal language prescribing how to form strings from the language's alphabet.
The Chomsky Hierarchy of Formal Grammars:
Hierarchy | Grammar | Production rules |
---|---|---|
Type-0 | Unrestricted Grammars | α→β |
Type-1 | Context-sensitive Grammars | αAβ→αγβ |
Type-2 | Context-free Grammars | A→γ |
Type-3 | Regular Grammars | A→a & A→aB |
The syntax (语法) of a computer language is the set of rules that defines the combinations of symbols that are considered to be a correctly structured document or fragment in that language.
Syntax of computer languages generally correspond to the Chomsky hierarchy of formal grammars:
The semantics (语义) of a computer language is the rules for interpreting a syntactically legal source code into its object. In case of programming and query languages, the object is a sequence of instructions; in case of markup and style sheet languages, the object is a document.
Formal semantics (形式语义) of programming languages: