Formal language: Difference between revisions

Content deleted Content added

Inline

Revision as of 22:52, 7 January 2008

This article is about the term formal language as it is used in mathematics, logic and computer science. For grammar formalism for natural languages, see grammar framework. For information about a mode of expression that is more disciplined or precise than everyday speech, see Register (linguistics).

A formal language is an organized set of symbols the essential feature of which is that it can be precisely defined in terms of just the shapes and locations of those symbols. Such a language can be defined, then, without any reference to any meanings of any of its expressions; it can exist before any interpretation is assigned to it --that is, before it has any meaning. It is a tool in the field of logic. In mathematics, and computer science, a formal language is defined by precise mathematical or machine processable formulas.

Like languages in linguistics, formal languages generally have two aspects: syntax and semantics. The syntax of a language is what the language looks like. It is defined by the set of possible expressions that are valid utterances in the language. The semantics of a language are what the utterances of the language mean. This is formalized in various ways, depending on the type of language in question.

The branch of mathematics and computer science which studies exclusively the theory of language syntax is known as formal language theory. In formal language theory, a language is nothing more than its syntax; questions of semantics are not addressed in this specialty.

A formal language can be identified with the set of its well formed formulas. If the set of all wffs of a formal language L is exactly the same as the set of all wffs of a formal language L', then L is the same formal language as L'. If they do not have the same set of wffs, then they are not the same language.^[1]

Elementary example

In mathematics, a formal language consists of two parts, an alphabet and rules of syntax. The alphabet is any set of symbols; the rules of syntax define what strings (concatenations of elements of the alphabet) are considered part of the formal language.

As a simple example, consider the alphabet {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, +, =} together with the following rules of syntax:

Every nonempty string that does not contain + or = and does not start with 0 is in the language.
The string "0" is in the language.
A string containing = is in the language if and only if there is exactly one =, and it separates two strings in the language.
A string containing + is in the language if and only if every + in the string separates two valid strings in the language.
No string is in the language other than those implied by the previous rules.

Under these rules, the string "23+4=555" is in the language, the string "=234=+" is not. This formal language expresses whole numbers, well-formed addition statements, and well-formed addition equalities, but it expresses only what they look like (their syntax), not what they mean (semantics). For instance, nowhere in these rules is it defined that 0 means the number zero, or that + means addition.

The rest of this article is devoted to the basic notions of formal language theory.

Advanced examples

As an example of a formal language, an alphabet might be $\left\{a,b\right\}$ , and a string over that alphabet might be $ababba\,$ .

A typical language over that alphabet, containing that string, would be the set of all strings which contain the same number of symbols $a\,$ and $b\,$ .

The empty word (that is, length-zero string) is allowed and is often denoted by $e\,$ , $\epsilon \,$ or $\Lambda \,$ . While the alphabet is a finite set and every string has finite length, a language may very well have infinitely many member strings (because the length of words belonging to it may be unbounded).

Some more examples:

the set of all words over ${a,b}\,$
the set $\left\{a^{n}\right\}$ , where $n\,$ is a natural number and $a^{n}\,$ means $a\,$ repeated $n\,$ times
Finite languages, such as $\{a,b,ab,cba\}\,$ or $\{ac,aca,bba\}\,$
the set of syntactically correct programs in a given programming language; or
the set of inputs upon which a certain Turing machine halts; or,
the set of the longest strings of alphanumeric ASCII characters on this line, (i.e., the set {the, set, of, longest, strings, alphanumeric, ASCII, characters, on, this, line, i, e})
the set of all strings belonging to languages provably undefinable by formal language theory

Language specification formalisms

Formal language theory rarely concerns itself with particular languages (except as examples), but is mainly concerned with the study of various types of formalisms to describe languages. For instance, a language can be given as

The set of strings that can be generated by some grammar (see Chomsky hierarchy);
The set of strings described or matched by a particular regular expression;
The set of strings accepted by some automaton, such as a Turing machine or finite state automaton;
The set of strings for which some decision procedure (an algorithm that asks a sequence of related YES/NO questions) produces the answer YES.

Typical questions asked about such formalisms include:

What is their expressive power? (Can formalism X describe every language that formalism Y can describe? Can it describe other languages?)
What is their recognizability? (How difficult is it to decide whether a given word belongs to a language described by formalism X?)
What is their comparability? (How difficult is it to decide whether two languages, one described in formalism X and one in formalism Y, or in X again, are actually the same language?).

Surprisingly often, the answer to these decision problems is "it cannot be done at all", or "it is extremely expensive" (with a precise characterization of how expensive exactly). Therefore, formal language theory is a major application area of computability theory and complexity theory.

Operations on languages

Certain operations on languages are common. This includes the standard set operations, such as union, intersection, and complementation. Another class of operation is the elementwise application of string operations.

Examples: suppose ${L}_{1}$ and ${L}_{2}$ are languages over some common alphabet.

The concatenation ${L}_{1}{L}_{2}\,$ consists of all strings of the form $vw\,$ where $v\,$ is a string from ${L}_{1}\,$ and $w\,$ is a string from ${L}_{2}\,$ .
The intersection ${L}_{1}\cap {L}_{2}$ of ${L}_{1}\,$ and ${L}_{2}\,$ consists of all strings which are contained in both languages
The complement $\neg L$ of a language w.r.t. a given alphabet consists of all strings over the alphabet that are not in the language.

Such operations are used to investigate closure properties of classes of languages. A class of languages is closed under a particular operation when the operation, applied to languages in the class, always produces a language in the same class again. For instance, the context-free languages are known to be closed under union, concatenation, and intersection with regular languages, but not closed under intersection or complementation.

Other operations on languages

Some other operations frequently used in the study of formal languages are the following:

The Kleene star.
Reversal:
- Let $e$ be the empty word, then $e^{R}=e$ , and
- for each non-empty word $w=x_{1}...x_{n}$ over some alphabet, let $w^{R}=x_{n}...x_{1}$ ,
- then for a formal language $L$ , $L^{R}=\{w^{R}|w\in L\}$ .
Homomorphism and e-free homomorphism.

References

A. G. Hamilton, Logic for Mathematicians, Cambridge University Press, 1978, ISBN 0 521 21838 1.
Seymour Ginsburg, Algebraic and automata theoretic properties of formal languages, North-Holland, 1975, ISBN 0 7204 2506 9.
Michael A. Harrison: Introduction to Formal Language Theory, Addison-Wesley, 1978.
Grzegorz Rozenberg, Arto Salomaa, Handbook of Formal Languages: Volume I-III, Springer, 1997, ISBN 3 540 61486 9.
Patrick Suppes, Introduction to Logic, D. Van Nostrand, 1957, ISBN 0 442 08072 7.

External links

Formal Language Definitions website 1/24/04
James Power, Notes on Formal Language Theory and Parsing, 29 November 2002.

Drafts of some chapters in the "Handbook of Formal Language Theory", Vol. 1-3, G. Rozenberg and A. Salomaa (eds.), Springer Verlag, (1997):t

Alexandru Mateescu and Arto Salomaa, "Preface" in Vol.1, pp. v-viii, and "Formal Languages: An Introduction and a Synopsis", Chapter 1 in Vol. 1, pp.1-39
Sheng Yu, "Regular Languages", Chapter 2 in Vol. 1
Jean-Michel Autebert, Jean Berstel, Luc Boasson, "Context-Free Languages and Push-Down Automata", Chapter 3 in Vol. 1
Christian Choffrut and Juhani Karhumäki, "Combinatorics of Words", Chapter 6 in Vol. 1
Tero Harju and Juhani Karhumäki, "Morphisms", Chapter 7 in Vol. 1, pp. 439 - 510
Jean-Eric Pin, "Syntactic semigroups", Chapter 10 in Vol. 1, pp. 679-746
M. Crochemore and C. Hancart, "Automata for matching patterns", Chapter 9 in Vol. 2
Dora Giammarresi, Antonio Restivo, "Two-dimensional Languages", Chapter 4 in Vol. 3, pp. 215 - 267

^ Hunter, Geoffrey, Metalogic: An Introduction to the Metatheory of Standard First-Order Logic

[1] Hunter, Geoffrey, Metalogic: An Introduction to the Metatheory of Standard First-Order Logic

[1]

@@ Line 7: / Line 7: @@
 The branch of mathematics and computer science which studies exclusively the theory of language syntax is known as formal language theory. In formal language theory, a language is nothing more than its syntax; questions of semantics are not addressed in this specialty.
+A formal language can be identified with the set of its [[well formed formula]]s. If the set of all wffs of a formal language L is exactly the same as the set of all wffs of a formal language L', then L is the same formal language as L'. If they do not have the same set of wffs, then they are not the same language.<ref>Hunter, Geoffrey, Metalogic: An Introduction to the Metatheory of Standard First-Order Logic</ref>
 == Elementary example ==