Linguistics 581

Regular Expressions and Finite-State Automata

Preliminaries:

Regular Languages: Informal

Searching for patterns with regular expressions:

  1. Sequence
      /woodchucks/: woodchucks
  2. [ ], |: disjunction
    • /[wW]oodchucks/: woodchucks or Woodchucks
      Figure 2.1
    • /[abc]/: a or b or c
    • Ranges ( Figure 2.2):
        /[A-Z]/:[ABCDEFGHIJKLMNOPQRSTUVWXYZ]
        /[0-9]/:[0123456789]
    • / cat | mouse /
  3. Wildcards( Figure 2.5)::
    • /./: Matches any character
  4. Kleene star, Kleene plus
    • /w*oodchuck/: oodchuck or woodchuck or wwoodchuck or wwwoodchuck ...
    • /w+oodchuck/: woodchuck or wwoodchuck or wwwoodchuck ...
    • Other expression counters: ( Figure 2.7):
  5. Negation ( Figure 2.3):
      [^a]: matches anything that is not lowercase "a".
    Notice this comes in "disjunction" brackets. Think of this as a long disjunction of all the things that are not "a".
  6. Optionality ( Figure 2.4):
    • /xy?/: xy or x
  7. Anchors
    • ^: start of a line
        /^The/: The at the beginning of a line
    • $: end of a line
        /The$/: The at the end of a line
    • \b: a word boundary
        /\bthe\b/ matches the but not other
  8. Characters that need to be backslashed ( Figure 2.8):
      /The dog\./ matches The dog followed by .
  9. Regular expressions match strings. Then we need to do something with them. For each match, grep, a UNIX commands, retrieves the line containing the matching string:
      grep aardvark.*aardvark corpus
      Retrieves any line containing at least 2 occurrences of aardvark
    Notice that the slashes aren't used in the command.

More complex examples

  1. Operator precedence
      /angry|ily/ vs. /angr(y|ily)/
      • /angry|ily/ matches angry OR ily
      • /angr(y|ily)/ angry OR angrily
      /the*/ vs /(the)*/
      • /the*/ matches th OR the OR thee OR theee
      • /(the)*/ matches empty string OR the OR thethe OR thethethe...
      Operator Precedence table
      • Parentheses
      • Counters: * + ? { }
      • Sequences and anchors
      • Disjunction
  2. Abbreviations ( Figure 2.6):
      \d matches [0-9]

Regular Languages: Formal and Fussy

Definition
Kleene-Closure: For any set of strings S, we write that set of all possible strings composed using only members of S (including the empty string). as S*.

Definition
The concatenation product of the sets of strings A and B (written AB) is the set of strings that can be constructed by concatenating an element of A with an element of B. (Similar to Cartesian product but we're defining a set of strings not a set of pairs).
Example: If A = { a, b } and B = { cc, d }, then AB = { acc, ad, bcc, bd } and

A language is a set of strings.

The regular languages is a particular set of languages.

Which set? Given an alphabet Sigma:
Definition
  1. The empty set is a regular language.
  2. For any string x in Sigma*, { x } is a regular language.
  3. If A and B are regular languages then the the concatenation product of A and B is a regular language.
  4. If A and B are regular languages then the union of A and B (written A U B) is a regular language.
  5. IF A is a regular language, so is A*.
  6. Nothing else is a regular language unless its being so follows from 1-5.

Theorem
The regular languages are closed under intersection.

Theorem
The regular languages are closed under complementation.


Finite-State automata

The machine for this language

  1. baa!: machine succeeds
  2. baa...a!: machine succeeds
  3. /baa+!/: regular expression for the language accepted by this machine

Failure examples:

  1. aba! (Fig. 2.11): machine fails (q0)
  2. bba! (Fig. 2.11): machine fails (q1)
  3. bab! (Fig. 2.11): machine fails (q2)
  4. ba! (Fig. 2.11): machine fails (q2)
  5. baab! (Fig. 2.11): machine fails (q3)
  6. baaaaaa!!,baaaaaa!a,: machine fails (q4)
A property of deterministic Finite State automata: a unique failure state for any string.

Machine for accepting spoken numbers 1-99

Represent FSA as a state-transition table State-transition table for sheep language

More formally (p. 36):


Relation of Regular languages and Finite-State Automaton Languages

Theorem
Every regular language is accepted by some FSA (Not too hard to see).

Theorem
Every language accepted by an FSA is a regular language (Kind of tricky, homework exercise illustrates this).

Summary point
FSAs and regular expressions define the same set of languages.


Properties of Regular Languages

Observation
Many infinite sets of strings are regular languages.
Example: Sigma*

Observation
All finite sets of strings are regular languages.

Observation
An FSA accepting an infinite language must have a loop.
Reason: A finite set of states must serve to accept an infinite set of strings.

Formalization of this intuition:

Theorem: The Pumping Lemma
If L is an infinite finite automaton language over alphabet Sigma, then there are strings x,y,z in Sigma*, y non empty, such that xynz is in L for all n > 0 or n = 0.

    Example

    a(ba)*c is a regular language.

    Pumping String:

    1. x=a
    2. y=ba
    3. z=c
Sketch of proof: We have an infinite language but a finite number of states. Let n be the number of states and consider a string s of length n. In admitting s, some state Si must have been visited twice. Let s=xyz, where y is the substring admitted by the state sequence connecting Si to Si. Then for any n > 0, xynz is admitted by the machine.

The theorem is often used in its contrapositive form:

Contrapositive Pumping Lemma

If there are no strings x,y,z in Sigma*, y non empty, such that xynz is in L for all n > 0 or n = 0, then L is not an infinite finite automaton language.

Example

It can be shown, using The Pumping Lemma, that anbn is not an infinite finite automaton language.

Proof by enumeration of cases. Each of the following exhaustive listing of possibilities is impossible:

  1. x is the empty string, z is the empty string.
  2. x is the empty string, z is non empty.
    1. z consists entirely of a's.
    2. z consists entirely of b's.
    3. z consists of a string of a's followed by a string of b's.
  3. x is non empty, z is the empty string.
    1. x consists entirely of a's.
    2. x consists entirely of b's.
    3. x consists of a string of a's followed by a string of b's.
  4. x consists of a nonempty string of a's. y ends with a non empty string of b's
    1. y is all a's
    2. y is all b's.
    3. y is a sequence of a's followed by a sequence of b's.

Implementing FSAs

Deterministic Finite-State Automaton (DFSA) as a recognizer or string accepter:

  1. Input tape with one symbol on each cell of the tape. Machine set to read beginning of tape, set in starting state. Standard input tape set up Another property of input tapes, at least the way we're going to think about them, is that we cxan think of them as being indexed. Each cell on the tape has a number. Usually the starting position is the tape is numbered 0 and we move from one symbol to the right on the tape by adding 1.
  2. Advance to the next cell (next index) on the tape. If the symbol there matches the label on an arc leaving the current state of the machine, follow that arc to the next state. If not, fail.
  3. If there are no more symbols on the tape, we must be in a final state. If not, fail.

Implementing a DFSA

  1. Data Types and data structures
    1. Index: an integer (0, 1, 2, 3 ...)
    2. Tape: a one-column table (one-argument function). Each index is associated with exactly one symbol on the input tape.
        0 b
        1 a
        2 a
        3 a
        4 !
        5 eoi
      tape(index) returns the symbol for that index.
        tape(3)=a.
        tape(5)=eoi
      This stands for "end of input".
    3. Transition-table is a table with as many columns as there are symbols in the input alphabet ({a,b,!} Fig. 2.12)
        transition-table(current-state, symbol)
      means look in the row for the current state, the column for symbol, and return the new state indicated by the table.
        transition-table(2,a) = 3
      for the table in 2.12. In general we have to look on the tape for a symbol. So:
        transition-table(2,tape(3)) = 3
      Notice how everything is getting turned into operations on numbers.
  2. Pseudo code for DFSA
  3. Trace of a DFSA

Non-Deterministic Finite-State Automata (NFSA)

NFSAs as recognizers:

  1. Machine set to read beginning of tape, set in starting state.
  2. Advance to the next cell (next index) on the tape. If the symbol there matches the label on an arc leaving the current state of the machine, follow that arc to the next state. If not, try another path through the machine that matches the input thus far. If there are no more paths through the machine that match the input, fail.
  3. If there are no more symbols on the tape, we must be in a final state. If not, try another path through the machine that matches the input. If there are no more symbols on the tape and no other paths through the machine that match the input, fail.
This is a search problem, because of the steps that include "try another ..."

Implementing a NFSA

  1. Pseudo code for NFSA
  2. Trace of an NFSA