Linguistics 570

Regular Expressions and Finite-State Automata

Regular Languages: Formal and Fussy

Definition
Kleene-Closure: For any set of strings S, we write that set of all possible strings composed using only members of S (including the empty string). as S*.

Definition
The concatenation product of the sets of strings A and B (written AB) is the set of strings that can be constructed by concatenating an element of A with an element of B. (Similar to Cartesian product but we're defining a set of strings not a set of pairs).
Example: If A = { a, b } and B = { cc, d }, then AB = { acc, ad, bcc, bd } and

A language is a set of strings.

The regular languages is a particular set of languages.

Which set? Given an alphabet Sigma:
Definition
  1. The empty set is a regular language.
  2. For any string x in Sigma*, { x } is a regular language.
  3. If A and B are regular languages then the the concatenation product of A and B is a regular language.
  4. If A and B are regular languages then the union of A and B (written A U B) is a regular language.
  5. IF A is a regular language, so is A*.
  6. Nothing else is a regular language unless its being so follows from 1-5.

Regular Expressions: Informal

Searching for patterns with regular expressions:

  1. Any character is a regular expression matching itself (Clause 2)
  2. Sequence
      /woodchucks/: woodchucks
    (Clause 3, more below)
  3. [ ], |: disjunction (Clause 4)
    • /[wW]oodchucks/: woodchucks or Woodchucks
      Figure 2.1
    • /[abc]/: a or b or c
    • Ranges ( Figure 2.2):
        /[A-Z]/:[ABCDEFGHIJKLMNOPQRSTUVWXYZ]
        /[0-9]/:[0123456789]
    • / cat | mouse /
  4. Wildcards( Figure 2.5):
    • /./: Matches any character
    (Clause 4)
  5. Optionality ( Figure 2.4):
    • /xy?/: xy or x
    (Clause 4)
  6. Kleene star, Kleene plus
    • /w*oodchuck/: oodchuck or woodchuck or wwoodchuck or wwwoodchuck ...
    • /w+oodchuck/: woodchuck or wwoodchuck or wwwoodchuck ...
    • Other expression counters: ( Figure 2.7):
    (Clause 5)

To acheive the full power of the definition of regular languages, you need the following:

Complex expressions can get combined by the operators:

  1. [abc]* Kleene closure of the regular language { a,b, c }
      aa, abb, bba, cba, abbac, abbbccccca, e, ....
  2. ([abc][def]) Concatenation product of the regular language { a, b, c } with the regular language { d, e, f }
      ad, be, af, cf, ...
  3. ([abc]*[def]*) Concatenation product of the Kleene closure of { a, b, c } with the Kleene closure of { d, e, f }
      abbbcccccafeeeed, cbbbadddddef, aaad, cba, edf, ...
    For the last 3: remember the Kleene closure includes the empty string, which is the identity element for the concatenation operation.

Two important regular language theorems

Theorem
The regular languages are closed under intersection.

Theorem
The regular languages are closed under complementation.

Also in the usual battery of regular language tricks:

    Negation ( Figure 2.3): Notice this comes in "disjunction" brackets. Think of this as a long disjunction of all the things that are not "a".

Question: Are the regular languages closed under union?




Relation of Regular languages and Finite-State Automaton Languages

Theorem
Every regular language is accepted by some FSA (Not too hard to see).

Theorem
Every language accepted by an FSA is a regular language (Kind of tricky, other lecture illustrates this).

Summary point
FSAs and regular expressions define the same set of languages.


Properties of Regular Languages

Observation
Many infinite sets of strings are regular languages.
Example: Sigma*

Observation
All finite sets of strings are regular languages.

Observation
An FSA accepting an infinite language must have a loop.
Reason: A finite set of states must serve to accept an infinite set of strings.

Formalization of this intuition:

Theorem: The Pumping Lemma
If L is an infinite finite automaton language over alphabet Sigma, then there are strings x,y,z in Sigma*, y non empty, such that xynz is in L for all n > 0 or n = 0.

    Example

    a(ba)*c is a regular language.

    Pumping String:

    1. x=a
    2. y=ba
    3. z=c
Sketch of proof: We have an infinite language but a finite number of states. Let n be the number of states and consider a string s of length n. In admitting s, some state Si must have been visited twice. Let s=xyz, where y is the substring admitted by the state sequence connecting Si to Si. Then for any n > 0, xynz is admitted by the machine.

The theorem is often used in its contrapositive form:

Contrapositive Pumping Lemma

If there are no strings x,y,z in Sigma*, y non empty, such that xynz is in L for all n > 0 or n = 0, then L is not an infinite finite automaton language.

Example

It can be shown, using The Pumping Lemma, that anbn is not an infinite finite automaton language.

Proof by enumeration of cases. Since y cannot be empty, there are 3 possibilities for a pumping string for anbn.

  1. y consists entirely of a's
  2. y consists of some sequence of a's followed by a sequence of b's.
  3. y consists entirely of b's

In each of these cases, using y as the repeating part of a pumping string generates strings that are not in the language:

  1. y consists entirely of a's: if xyz is in the language, xy2z is not, because xy2z has more a's than b's.
  2. y consists of some non-empty sequence of a's followed by a non-empty sequence of b's. if xyz is in the language, xy2z is not, because xy2z has a's following b's.
  3. y consists entirely of b's: if xyz is in the language, xy2z is not, because xy2z has more b's than a's.