Linguistics 581

Regular Expressions and Finite-State Automata

Preliminaries:

Regular Languages: Informal

Searching for patterns with regular expressions:

  1. Sequence
      /woodchucks/: woodchucks
  2. [ ], |: disjunction
    • /[wW]oodchucks/: woodchucks or Woodchucks
      Figure 2.1
    • /[abc]/: a or b or c
    • Ranges ( Figure 2.2):
        /[A-Z]/:[ABCDEFGHIJKLMNOPQRSTUVWXYZ]
        /[0-9]/:[0123456789]
    • / cat | mouse /
  3. Wildcards( Figure 2.5)::
    • /./: Matches any character
  4. Kleene star, Kleene plus
    • /w*oodchuck/: oodchuck or woodchuck or wwoodchuck or wwwoodchuck ...
    • /w+oodchuck/: woodchuck or wwoodchuck or wwwoodchuck ...
    • Other expression counters: ( Figure 2.7):
  5. Negation ( Figure 2.3):
      [^a]: matches anything that is not lowercase "a".
    Notice this comes in "disjunction" brackets. Think of this as a long disjunction of all the things that are not "a".
  6. Optionality ( Figure 2.4):
    • /xy?/: xy or x
  7. Anchors
    • ^: start of a line
        /^The/: The at the beginning of a line
    • $: end of a line
        /The$/: The at the end of a line
    • \b: a word boundary
        /\bthe\b/ matches the but not other
  8. Characters that need to be backslashed ( Figure 2.8):
      /The dog\./ matches The dog followed by .
  9. Regular expressions match strings. Then we need to do something with them. For each match, grep, a UNIX commands, retrieves the line containing the matching string:
      grep aardvark.*aardvark corpus
      Retrieves any line containing at least 2 occurrences of aardvark
    Notice that the slashes aren't used in the command.

More complex examples

  1. Operator precedence
      /angry|ily/ vs. /angr(y|ily)/
      • /angry|ily/ matches angry OR ily
      • /angr(y|ily)/ angry OR angrily
      /the*/ vs /(the)*/
      • /the*/ matches th OR the OR thee OR theee
      • /(the)*/ matches empty string OR the OR thethe OR thethethe...
      Operator Precedence table
      • Parentheses
      • Counters: * + ? { }
      • Sequences and anchors
      • Disjunction
  2. Abbreviations ( Figure 2.6):
      \d matches [0-9]

Regular Languages: Formal and Fussy

Definition
Kleene-Closure: For any set of strings S, we write that set of all possible strings composed using only members of S (including the empty string). as S*.

Definition
The concatenation product of the sets of strings A and B (written AB) is the set of strings that can be constructed by concatenating an element of A with an element of B. (Similar to Cartesian product but we're defining a set of strings not a set of pairs).
Example: If A = { a, b } and B = { cc, d }, then AB = { acc, ad, bcc, bd } and

A language is a set of strings.

The regular languages is a particular set of languages.

Which set? Given an alphabet Σ:

    Definition
    1. The empty set is a regular language.
    2. For any string x in Σ*, { x } is a regular language.
    3. If A and B are regular languages then the the concatenation product of A and B is a regular language.
    4. If A and B are regular languages then the union of A and B (written A U B) is a regular language.
    5. IF A is a regular language, so is A*.
    6. Nothing else is a regular language unless its being so follows from 1-5.

    Theorem
    The regular languages are closed under intersection.

    Theorem
    The regular languages are closed under complementation.

Finite-State automata

The machine for this language

  1. baa!: machine succeeds
  2. baa...a!: machine succeeds
  3. /baa+!/: regular expression for the language accepted by this machine

Failure examples:

  1. aba! (Fig. 2.11): machine fails (q0)
  2. bba! (Fig. 2.11): machine fails (q1)
  3. bab! (Fig. 2.11): machine fails (q2)
  4. ba! (Fig. 2.11): machine fails (q2)
  5. baab! (Fig. 2.11): machine fails (q3)
  6. baaaaaa!!,baaaaaa!a,: machine fails (q4)
A property of deterministic Finite State automata: unique failure & acceptance states for any string.

Deterministic machine for accepting spoken numbers 1-99

Represent FSA as a state-transition table State-transition table for sheep language

More formally (p. 36):


Relation of Regular languages and Finite-State Automaton Languages

    Theorem
    Every regular language is accepted by some FSA (Not too hard to see).

    Theorem
    Every language accepted by an FSA is a regular language (Kind of tricky, homework exercise illustrates this).

    Bottom Line
    FSAs and regular expressions define the same set of languages.

Properties of Regular Languages

    Observation
    Many infinite sets of strings are regular languages.
    Example: Σ*

    Observation
    All finite sets of strings are regular languages.

    Observation
    An FSA M that accepts an infinite language must have a loop.
    M has a finite set of states (of cardinality n): To accept any string w of length m (m > n), M must visit at least one state twice.

This leads at once to an important theorem called The Pumping Lemma:

    Theorem: The Pumping Lemma
    If L is an infinite finite automaton language over alphabet Σ, then there are strings x, y, z in Σ*, y non empty, such that xynz is in L for all n > 0 or n = 0.

    Example

    a(ba)*c is a regular language.

    Pumping String= xyz, where

    1. x=a
    2. y=ba
    3. z=c
      a(ba)nc ∈ a(ba)*c,  n = 1, 2, 3, ....

    Sketch of proof: We have an infinite language but a machine M with a finite set of states, S. Let n be the cardinality of S and consider a string σ of length n. Consider the shortest sequence of states

    s1, s2, ... , sk

    such that σ is admitted by M using this sequence. Since each transition admits at least one symbol (possibly none),

    k > n

    That is, some state Si must have been visited twice. Let σ=xyz, where y is the substring admitted by the state sequence connecting Si to Si.

    Call the state sequence that admits y:

    Sk, ..., Sj

    y cannot be the empty string, because then there would be a shorter state sequence admitting σ, namely the one skipping directly from Sk to Sj+1.

    Then for any n > 0, xynz is admitted by the machine, and y is not the empty string.

The theorem is often used in its contrapositive form:

    Contrapositive Pumping Lemma

    If L is an infinite language and there are no strings x, y, z ∈ Σ*, y non empty, such that xynz is in L for all n > 0 or n = 0, then L is not a finite state automaton language. xynz is called a pumping string. y is called the pumpable portion of the pumping string (the portion accepted by the loop in the machine that accepts L).

    Example

    It can be shown, using The Pumping Lemma, that anbn is not a finite state automaton language.

    Clearly anbn IS an infinite language. So if anbn IS a finite state automaton language, the PL applies. We proceed by assuming that anbn IS a finite automaton language and deriving a contradiction.

    According to the PL, if L = anbn IS a finite automaton language then there is some value of n, say i, such that

      aibi = x y z
    and for all n > 0, x yn z ∈ L.

    Given the form of L, the pumpable portion y either consists entirely of a's, entirely of b's, or of a sequence of a's followed a sequence of b's. We investigate all 3 possibilities.

      1. y consists entirely of a's; but then
          xy2z has more a's than b's
        Therefore xy2z can't be in L. Therefore the pumpable portion of the pumping string can't consist entirely of a's.
      2. By exactly the same reasoning, the y portion can't consist entirely of b's.
      3. So the pumpable part must consist of a's followed by b's. But this too can't be right.
          a... a ... b ... b
          x y z
        But then x y2 z is:
          a... a ... b a ... b ... b
          x y y z
        And this has a's following b's and therefore cannot be in L. So the pumpable part of a pumping string for L can't consist of a's and b's either.

    But now we've exhausted all the possibilities. Therefore L has no pumping strings. Therefore, by the PL, L is not a finite-state automaton language.


Implementing FSAs

Deterministic Finite-State Automaton (DFSA) as a recognizer or string accepter:

  1. Input tape with one symbol on each cell of the tape. Machine set to read beginning of tape, set in starting state. Standard input tape set up Another property of input tapes, at least the way we're going to think about them, is that we can think of them as being indexed. Each cell on the tape has a number. Usually the starting position is the tape is numbered 0 and we move from one symbol to the right on the tape by adding 1.
  2. Advance to the next cell (next index) on the tape. If the symbol there matches the label on an arc leaving the current state of the machine, follow that arc to the next state. If not, fail.
  3. If there are no more symbols on the tape, we must be in a final state. If not, fail.

Implementing a DFSA

  1. Data Types and data structures
    1. Index: an integer (0, 1, 2, 3 ...)
    2. Tape: a one-column table (one-argument function). Each index is associated with exactly one symbol on the input tape.
        0 b
        1 a
        2 a
        3 a
        4 !
        5 eoi
      tape[index] returns the symbol for that index.
        tape[3]=a.
        tape[5]=eoi
      This stands for "end of input".
    3. Transition-table is a table with as many columns as there are symbols in the input alphabet ({a,b,!} Transition table)
        transition-table[current-state, symbol]
      means look in the row for the current state, the column for symbol, and return the new state indicated by the table.
        transition-table[2,a] = 3
      for the table in 2.12. In general we have to look on the tape for a symbol. So:
        transition-table[2,tape[3]] = 3
  2. Pseudo code for DFSA
  3. Trace of a DFSA

Non-Deterministic Finite-State Automata (NFSA)

NFSAs as recognizers:

  1. Machine set to read beginning of tape, set in starting state.
  2. Advance to the next cell (next index) on the tape. If the symbol there matches the label on an arc leaving the current state of the machine, follow that arc to the next state. If not, try another path through the machine that matches the input thus far. If there are no more paths through the machine that match the input, fail.
  3. If there are no more symbols on the tape, we must be in a final state. If not, try another path through the machine that matches the input. If there are no more symbols on the tape and no other paths through the machine that match the input, fail.
This is a search problem, because of the steps that include "try another ..."

Implementing a NFSA

  1. Pseudo code for NFSA
  2. Trace of an NFSA