# Regular Expression Assignment Practice Lab

## How to construct and debug regular expressions

The next cell defines a large html string that we will use to test some of regular expressions.  When writing a program that is going to depend on accurately extracting instances of certain patterns from text or HTML, you need to create the regular expressions first, testing them on realistic example strings.  You need your expressions to do two things:

1. Match the strings you trying to extract, and possibly some context around them, to guarantee you
   are extracting the right information;
2. If your expression matches context as well as the information you are trying to extract,
   (and often it will have to) you need to identify the target part of  the expression.  This is done by placing the target part of 
   the pattern in parentheses (illustrated below).
   
The homework assignment asks you to extract the baby name year in the html file.  The line containing the relevant information looks like this
     
     <h3 align="center">Popularity in 1990</h3>
     
One regular expression that will match the year is the following:

       '\d\d\d\d'

The code below tries out this idea.  Evaluate it and report on the  success of the idea in the markdown cell below the code cell.  

In [1]:
import re

html_string = """
<head><title>Popular Baby Names</title>
<meta name="dc.language" scheme="ISO639-2" content="eng">
<meta name="dc.creator" content="OACT">
<meta name="lead_content_manager" content="JeffK">
<meta name="coder" content="JeffK">
<meta name="dc.date.reviewed" scheme="ISO8601" content="2005-12-30">
<link rel="stylesheet" href="../OACT/templatefiles/master.css" type="text/css" media="screen">
<link rel="stylesheet" href="../OACT/templatefiles/custom.css" type="text/css" media="screen">
<link rel="stylesheet" href="../OACT/templatefiles/print.css" type="text/css" media="print">
</head>
<body bgcolor="#ffffff" text="#000000" topmargin="1" leftmargin="0">
<table width="100%" border="0" cellspacing="0" cellpadding="4">
  <tbody>
  <tr><td class="sstop" valign="bottom" align="left" width="25%">
      Social Security Online
    </td><td valign="bottom" class="titletext">
      <!-- sitetitle -->Popular Baby Names
    </td>
  </tr>
  <tr bgcolor="#333366"><td colspan="2" height="2"></td></tr>
  <tr><td class="graystars" width="25%" valign="top">
       <a href="../OACT/babynames/">Popular Baby Names</a></td><td valign="top"> 
      <a href="http://www.ssa.gov/"><img src="/templateimages/tinylogo.gif"
      width="52" height="47" align="left"
      alt="SSA logo: link to Social Security home page" border="0"></a><a name="content"></a>
      <h1>Popular Names by Birth Year</h1>September 12, 2007</td>
  </tr>
  <tr bgcolor="#333366"><td colspan="2" height="1"></td></tr>
</tbody></table>
<table width="100%" border="0" cellspacing="0" cellpadding="4" summary="formatting">
  <tr valign="top"><td width="25%" class="greycell">
      <a href="../OACT/babynames/background.html">Background information</a>
      <p><br />
      &nbsp; Select another <label for="yob">year of birth</label>?<br />      
      <form method="post" action="/cgi-bin/popularnames.cgi">
      &nbsp; <input type="text" name="year" id="yob" size="4" value="1990">
      <input type="hidden" name="top" value="1000">
      <input type="hidden" name="number" value="">
      &nbsp; <input type="submit" value="   Go  "></form>
    </td><td>
<h3 align="center">Popularity in 1990</h3>
<p align="center">
"""
re1 = r'\d\d\d\d'
re1_revised = r'[12]\d\d\d'
match = re.search(re1,html_string)
match_two = re.search(re1_revised,html_string)
# match object tells you positions in string where match begins and ends (match.start() and match.end()).  
# Let's look at  this span

#match = None
#match_two = None
if match:
   print(html_string[match.start():match.end()])
if match_two:
   print(html_string[match_two.start():match_two.end()])
   print(match_two.group())

8601
2005
2005


Discuss how well this regular expression worked at extracting the year. If it failed, explain why.
You may edit this cell.

This exercise should have convinced you needed to amend the regular expression to provide some contexts; 4 digits in a row, even if the first is required to be 1 or 2, won't do it.  In the next cell, define and test a new regular expression that does
the job. You may want to try some of the exercises in the following sections first, to get some practice with regular expressions.

For the next html string, you want to find ALL the triples of the form RANK, MALE NAME, FEMALE NAME.
Your output should look like this:

   [('1', 'Jacob', 'Emma'), ('2', 'Michael', 'Isabella'), ('3', 'Ethan', 'Emily')]
   
You can get this using `re.findall`.  The next cell gives you a pretty helpful example of how to use it.

In [7]:
import re
html_str2 = """<tr align="center" valign="bottom">
  <th scope="col" width="12%" bgcolor="#efefef">Rank</th>
  <th scope="col" width="41%" bgcolor="#99ccff">Male name</th>
<th scope="col" bgcolor="pink" width="41%">Female name</th></tr>
<tr align="right"><td>1</td><td>Jacob</td><td>Emma</td>
</tr>
<tr align="right"><td>2</td><td>Michael</td><td>Isabella</td>
</tr>
<tr align="right"><td>3</td><td>Ethan</td><td>Emily</td>
</tr>"""
res1 = re.findall(r'<tr\s+.+><td>\d+</td>',html_str2)
res2 = re.findall(r'<tr\s+.+><td>(\d+)</td>',html_str2)

(res1, res2)

(['<tr align="right"><td>1</td>',
  '<tr align="right"><td>2</td>',
  '<tr align="right"><td>3</td>'],
 ['1', '2', '3'])

Notice the very different results you get with very similar `findall` requests.  The function `findall` is written so as to retrieve the **groups** in your regular expression. The groups in your regular expression are defined by parentheses.  If there are no groups (no parentheses), `findall` returns a list of complete matches.  So the first result above is what you get for a regular expression with no groups, and the second is what you get for a regular expression with one group.  If your regular expression contains multiple groups, you get a list of tuples.  Each tuple member corresponds to one group in the pattern.  Since you're being asked for a result that is a list of triples, you want a regular expression with 3 groups.

## Solving crosswords (requires NLTK)

The following example is adapted from [the NLTK Book, Ch. 3.](http://www.nltk.org/book/ch03.html)

Let's say we're in the midst of doing a cross word puzzle and we need an 8-letter word
whose third letter is *j* and whose sixth letter is *t* which means
*sad*.    We want
words that match the following regular expression pattern::

   '^..j..t..$'

Notice that this specifies a string of exactly 8 characters because
of the `^` and the `$`, which mark the beginning
and ending of the string, respectively.  Each `.` is a wildcard
which matches exactly one character but will match any character.


In [4]:
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [1]:
import nltk
print(nltk.__version__)
nltk.download('words')

3.3
[nltk_data] Error loading words: <urlopen error [Errno 8] nodename nor
[nltk_data]     servname provided, or not known>


False

In [3]:
import re
from nltk.corpus import words
wds = words.words()
print(len(wds))
#235786
cands = [w for w in wds if re.search('^..j..t..$',w)]
cands

236736


['abjectly',
 'adjuster',
 'dejected',
 'dejectly',
 'injector',
 'majestic',
 'objectee',
 'objector',
 'rejecter',
 'rejector',
 'unjilted',
 'unjolted',
 'unjustly']

And now we check our list and there it is: *dejected*.
Will you ever be stumped by a crossword puzzle again?

## Textonyms

The [NLTK Book, Ch. 3](http://www.nltk.org/book/ch03.html>)introduces the following
concept of **textonym** with this definition:

   The T9 system is used for entering text on mobile phones: Two or more words that are 
   entered with the same sequence of keystrokes are known as textonyms. For example, 
   both *hole* and *golf* are entered by pressing the sequence `4653`. What other words could 
   be produced with the same sequence? 

   Here we  could use the regular expression `'^[ghi][mno][jlk][def]$'`.  

    >>> [w for w in wds if re.search('^[ghi][mno][jlk][def]$', w)]
    ['gold', 'golf', 'hold', 'hole']

Try the following.  Find all words that can be spelled out with the sequence
`3456`.

In [7]:
[w for w in wds if re.search('^[def][ghi][jkl][mno]$', w)]

[u'dilo', u'film', u'filo']

In [8]:
[w for w in wds if re.search('^[ghi][mno][jlk][def]$', w)]

[u'gold', u'golf', u'hold', u'hole', u'gold', u'hole']

## Regular expression practice

In [10]:
import re
pat = r'a|b|c'
pat2 = r'[abc]'
pat3 = r'\w\w\w'
print(pat3)
pat4 = '\\w\\w\\w'
print(pat4)
print(re.match(pat3,'bcd'))
print(re.match(pat3,'1bd'))
print(re.match(pat3,'b1d'))
print(re.match(pat3,'b-d'))
print(re.match(pat3,'b?d'))
print(re.match(pat3,'b d'))
print(re.match(pat3,'bda '))

\w\w\w
\w\w\w
<_sre.SRE_Match object; span=(0, 3), match='bcd'>
<_sre.SRE_Match object; span=(0, 3), match='1bd'>
<_sre.SRE_Match object; span=(0, 3), match='b1d'>
None
None
None
<_sre.SRE_Match object; span=(0, 3), match='bda'>


Edit this cell and after each regular expression, describe the class of strings it matches.  Check your answer examining the output of the code cell that follows.

1.  [a-zA-Z]+
2.  [A-Z][a-z]*
3.  \d+(\.\d+)?
4.  ([bcdfghjklmnpqrstvwxyz][aeiou][bcdfghjklmnpqrstvwxyz])*
5.  \w+|[^\w\s]+ 

In [3]:
########################################
###     Some regular expressions     ###
########################################

re2 = r'[a-zA-Z]+'   #Any string consisting of ltters of the alphabet, upper or lower case
re3 = r'[A-Z][a-z]+'  
re4 = r'\d+(\.\d+)?'
re5 = r'([bcdfghjklmnpqrstvwxyz][aeiou][bcdfghjklmnpqrstvwxyz])*'
re6 = r'\w+|[^\w\s]+'

res = [re2,re3,re4,re5,re6]

########################################
###     Some example strings         ###
########################################

example1 = 'abracadabra'
example2 = '1billygoat'
example3 = 'billygoat1'
example4 = '43.1789'
example4a = '43x1789'
example5 = '43.'
example6 = '43'
example7 = 'road_runner'
example8 = ' road_runner'
example9 = 'bathos'
example10 = "The little dog laughed to see such a sight."
example11 = 'socrates'
example12 = 'Socrates'
example13 = '*&%#!?'
example14 = 'IBM'

examples = [example1,example2,example3,example4,example4a,example5,example6,
            example7,example8,example9,example10,example11,example12,example13,
            example14]

########################################
###     Trying some matches          ###
########################################

for i,re_pat in enumerate(res):
    banner = 're%d %s' % (i+2,re_pat)
    print() 
    print(banner)
    print('=' * len(banner))
    print()
    for (i,ex) in enumerate(examples):
        match = re.match(re_pat,ex)
        if match:
            print('  %2d. %-45s  %s' % (i+1,ex,ex[match.start():match.end()]))
        else:
            print('  %2d. %-45s  %s' %(i+1,ex,None))


re2 [a-zA-Z]+

   1. abracadabra                                    abracadabra
   2. 1billygoat                                     None
   3. billygoat1                                     billygoat
   4. 43.1789                                        None
   5. 43x1789                                        None
   6. 43.                                            None
   7. 43                                             None
   8. road_runner                                    road
   9.  road_runner                                   None
  10. bathos                                         bathos
  11. The little dog laughed to see such a sight.    The
  12. socrates                                       socrates
  13. Socrates                                       Socrates
  14. *&%#!?                                         None
  15. IBM                                            IBM

re3 [A-Z][a-z]+

   1. abracadabra                                    None
   2. 1billygoat  

Make sure you can answer the following questions about the results of testing these regular expressions on the examples:

1. Why does `re2` fail on `example8`?
1. Why does `re3` only succeed on `example10` and `example12`?  Be sure to explain why it fails
   on `example14`.
1. When 're4' matches 'example5', why isn't the decimal point part of the match?
1. All of the regular expressions except `re5` report a `None` with at least one
   one of the examples.  Why doesn't `re5` report any `None`s?
1. Why does `re6` match all the characters in `example13`?
1. Why doesnt `re6` match `example8`?

## An example that requires NLTK to be installed

   To run the code for this example, you will use a **balanced corpus**
   of English texts, a corpus collected with the purpose of representing
   a balanced variety of English text types: fiction, poetry, speech,
   non fiction, and so on.  One relatively well-established, free,
   and easy-to-get example of such a corpus is the **Brown Corpus.**
   Brown is about 1.2 M words. 
   
   You can import the corpus as follows::

     >>> from nltk.corpus import brown


   If this does not work, it is because you have nltk installed without the accompanying
   corpora. You can download any nltk corpus you need through the `nltk.download` function For example,
   to get the Brown corpus, do the following in Python::
      
      >>> import nltk
      >>> nltk.download()

   This brings up a window you can interact with.  There are some tabs
   at the top.  Choose the tab labeled *Corpora*,
   select **Brown**, and click the **download** button
   at the bottom of the window.   You will then have
   Brown on your machine and you can import the corpus as follows::

     >>> from nltk.corpus import brown

   The following returns a list of all 1.2 M word tokens in Brown::

     >>> brown.words()
     ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]



In [4]:
# From http://www.nltk.org/book/ch03.html
#  Find the most common vowel sequences in English.  Note: be patient.  Evaluating this may take a while.
from nltk.corpus import brown
from collections import Counter
bw = sorted(set(brown.words()))
# Find every instance of two or more consecutive vowels, and count tokens of each.
ctr = Counter(vs  for word in bw for vs in re.findall('[aeiou]{2,}',word)
              )
ctr.most_common(25)

[('io', 2787),
 ('ea', 2249),
 ('ou', 1855),
 ('ie', 1799),
 ('ia', 1400),
 ('ee', 1289),
 ('oo', 1174),
 ('ai', 1145),
 ('ue', 541),
 ('au', 540),
 ('ua', 502),
 ('ei', 485),
 ('ui', 483),
 ('oa', 466),
 ('oi', 412),
 ('eo', 250),
 ('iou', 225),
 ('eu', 187),
 ('oe', 181),
 ('iu', 128),
 ('ae', 85),
 ('eau', 54),
 ('uo', 53),
 ('eou', 52),
 ('uou', 37)]

## Poker examples

Suppose you are writing a poker program where a player’s hand is represented as a 5-character string with each character representing a card, “a” for ace, “k” for king, “q” for queen, “j” for jack, “t” for 10, and “2” through “9” representing the card with that value.

To see if a given string is a valid hand, one could run the code in t he following cell

In [16]:
import re
def displaymatch(regex,text):
    match = regex.match(text)
    if match is None:
        matchstring = None
    else:
        matchstring = '%s[%s]%s' % (text[:match.start()],text[match.start():match.end()],text[match.end():])
    print('%-10s %s' % (text,matchstring))

valid = re.compile(r"^[a2-9tjqk]{5}$")

## Some examples
displaymatch(valid, "akt5q")  # Valid.
displaymatch(valid, "akt5e")  # Invalid.
displaymatch(valid, "akt")    # Invalid.
displaymatch(valid, "727ak")  # Valid.
displaymatch(valid, "727aka")  # Invalid.
displaymatch(valid, "aaaaa")  # Invalid.

akt5q      [akt5q]
akt5e      None
akt        None
727ak      [727ak]
727aka     None
aaaaa      [aaaaa]


The hand "727ak" contains a pair, and we would like to recognize such hands as special, so that we can go all in.  We can do this using regular expression groups and register references.  The match for each parenthesized part of a regular expression is called a **group**.  We can refer back to the particular match  associated with a group with \integer.  Where integer is any integer from 1 through 9.  \1 refers to the first group, \2 to the second, and so on.  So to match poker hands with pairs, we do the following.

In [17]:
pair = re.compile(r".*(.).*\1.*")
displaymatch(pair,"727ak")
displaymatch(pair,"723ak")
pair.match("727ak").groups()[0]

727ak      [727ak]
723ak      None


'7'

In [18]:
displaymatch(pair,"a2aak")
pair.match("aa2ak").groups()[0]

a2aak      [a2aak]


'a'

Of course, the regex `pair` does not require the text string to be a Poker hand.  We could revise it to do that and if you think about it a little, it would actually make the regex  **a lot** more complicated.  What we could do instead is first apply `valid` to guarantee we've got a valid poker hand and then apply `pair` to find out if it contains a pair. This makes both regexes simple and easy to understand and still enforce all the constraints we want.  Often a good strategy in applying regexes to enforce some complicated constraints is to divide the constraints up into separate categories and apply them **in succession.**.  

A problem with `pair` is that it doesnt tell us  what we've got a pair of.  Actually, the match object contains this information.  It has an attribute called `groups` which contains all portions of the string that matched a group.  We can use a revised version of `displaymatch` to print this, when requested:

In [19]:
import re
def displaymatch(regex,text, print_groups=False):
    match = regex.match(text)
    if match is None:
        matchstring = None
    else:
        matchstring = '%s[%s]%s' % (text[:match.start()],text[match.start():match.end()],text[match.end():])
    if print_groups and match:
        print('%-10s %s %s' % (text,matchstring,match.groups()))
    else:
        print('%-10s %s' % (text,matchstring))

# Re for recognizing pair hands
pair = re.compile(r".*(.).*\1")
print("pair")
displaymatch(pair,"723ak",print_groups=True)
displaymatch(pair,"7a3ak",print_groups=True)
print()
## Write your regex for recognizing two pair below. Test
## This version is not adequate. Look at the examples to see why.
print("two pair")
two_pair = re.compile(r".*(.).*(.).*\1.*\2.*")
displaymatch(two_pair,"7a272",print_groups=True)
displaymatch(two_pair,"722a7",print_groups=True)  # shd succeed, does not
displaymatch(two_pair,"7722a",print_groups=True)  # shd succeed, does not
#displaymatch(two_pair,"7a722",print_groups=True)
#displaymatch(two_pair,"727a2",print_groups=True)
#displaymatch(two_pair,"aaaa2",print_groups=True)  # Will succeed on this one, but that's ok

pair
723ak      None
7a3ak      [7a3a]k ('a',)

two pair
7a272      [7a272] ('7', '2')
722a7      None
7722a      None


## Questions

1.  Write regexes that match three-of-a-kind hands,  and four-of-a-kind hands.  Follow the model of `pairs` and dont bother to
    guarantee that it's a valid Poker hand.
2.  It's quite complex to write a regular expression that checks to see if you've got a straight, but you can try the 
    following strategy.  First, verify you've got a valid poker hand; then verify you havent got a pair, three-of-kind, or
    four-of-a-kind.  So you have a valid poker hand with no repetitions and you dont need the regex that checks for straights
    to rule those out.
    
    Now write a regex that will check to see if a valid poker hand 
    with no repetitions is a straight  beginning with '2'.  It should succeed on `23456` and `25643` and `32654` and it should fail
    `24357`.  To deal with all possible straights in this way, how many cases are there to take care of?  Write a single regular
    expression that will identify any straight, given that it is a valid poker hand with no repetitions.  Test it on the 
    straights above and on straights like `akqjt` and on the non-straight `24357`.
3.  Write a regex that matches a two pair hand. This is tricky and the most natural answer will also match four-of-a-kind. 
    Assume we've eliminated that possibility by failing to match the four-of-kind pattern from 1.  You should 
    test `722a7`, `7a722` and `727a2`.  You will need a pattern that is a big disjunction using `|`, and you will need to
    enclose the disjuncts of this big disjunction in parentheses, but for that purpose you will need parentheses that don't
    count as defining a retrievable group.  The notation for that is `(?:` instead of `(` [the same right paren is used 
    in both cases]. See [Python regex docs.](http://docs.python.org/2/library/re.html)

## How to do extraction

The following example is from `The weather underground page for San Diego <http://www.wunderground.com/weather-forecast/US/CA/San_Diego.html>`_.  The temperature is regularly given in a page division (HTML tag `div`) with ID (HTML attribute `divID`) `NowTemp`.  If we can find that division and the temperature inside it, we have what we want.  The pattern needs to be compiled with flags that allow it to match across multiple lines, because the context that identifies the temperature does not occur on the same line as the temperature.  Compiling regular expressions also makes them more efficient when reused.  A key point is that we place the actual temperature we want inside parentheses, the `(\d{1,3}\.\d)` part of the pattern.  Portions of a pattern that occur in parentheses and are matched are placed ins the `groups` attribute of  the match object.  The groups attribute is a tuple of all the matched strings in parentheses in the pattern.

In [5]:
import re
html_string = """
<div class="br10" id="stationSelect">
		<a class="br10" id="stationselector_button" href="javascript:void(0);" onclick="_gaq.push(['_trackEvent', 'Station Select', 'Opened']);"><span>Station Select</span></a>
		</div>
		</div>
		<div id="conds_dashboard">
		<div id="hour00">
		<div id="nowCond">
		<div class="titleSubtle">Now</div>
		<div id="curIcon"><a href="" class="iconSwitchBig"><img src="http://icons-ak.wxug.com/i/c/k/nt_partlycloudy.gif" width="44" height="44" alt="Scattered Clouds" class="condIcon" /></a></div>
		<div id="curCond">Scattered Clouds</div>
		</div>
		<div id="nowTemp">
		<div class="titleSubtle">Temperature</div>
		<div id="tempActual"><span id="rapidtemp" class="pwsrt" pwsid="KCASANDI123" pwsunit="english" pwsvariable="tempf" english="&deg;F" metric="&deg;C" value="55.8">
  <span class="nobr"><span class="b">55.8</span>&nbsp;&deg;F</span>
</span></div>
		<div id="tempFeel">Feels Like
  <span class="nobr"><span class="b">55.1</span>&nbsp;&deg;F</span>
</div>
		</div>
"""
pattern = r'<div\s+id\s*=\s*\"tempActual\"\s*>.*?(\d{1,3}\.\d).*?</div>'
pattern_re = re.compile(pattern,re.MULTILINE | re.DOTALL)
#m = re.search(pattern_re,html_string)
#m.groups()
pattern_re.findall(html_string)

['55.8']

The pattern in the example above was built up piece by piece.  First we built a regular expression matching the `<div id="nowTemp">` part of the pattern.  That piece looked like this:
    
     subpattern = r'<div\s+id\s*=\s*\"nowTemp\"\s*>
 
 The `\s*` aren't needed for this particular string, but there is considerable variation in how actual HTML is generated, and since
 white space in the `\s*` positions wouldn't be meaningful, it is allowed.  Next we tested the core part of the pattern on its own:
 
     corepattern = r'(\d{1,3}\.\d)'
  
  Finally we tested the last part:
  
     lastpattern = r`</div>'

## Tokenization  (NLTK assumed)

Tokenization is the process of breaking up a text into words.  We have in some cases used `split()` for this purpose, uniformly splitting a text up into words on the spaces, but this doesn't always yield the right results, as the next examples show.

There are three tokenizations of `text` string defined in the cell
below, `try1`, `try2`, and `try3`; `try1` shows what happens
when we just use the Python `split`; `try2` and `try3` use a regular
expression that defines different cases of a proper word,
such as 

1. an abbreviation with periods
2. an ordinary alphabetic word, with an optional hyphen 
3. a string of digits, possibly with a decimal, a dollar sign,
   or a percent
 
and so on.  We apply this pattern to the example string `text`, 
using the `re` module function `findall` to find
all substrings of `text` that match the pattern.

In [2]:
# From http://www.nltk.org/book/ch03.html
import re

text = """
"That," said  Fred, "is what
you ... get in the U.S.A. for $5.29."
"""
try1 = text.split()

# Notice the use of special NONCAPTURING parens (?:...)
# All parens in the regexp must be non capturing.
pattern = r""" 
   (?:[A-Z]\.)+        # abbreviations, e.g. U.S.A.
  |\w+(?:-\w+)*        # words with optional internal hyphens
  |\$?\d+(?:\.\d+)?%?  # numbers, money and percents, e.g. 3.14, $12.40, 82%
  |\.\.\.            # ellipsis
  |[][.,;"'?():-_`]  # keep punctuation, delimiters as separate word tokens
"""
re_flags = re.UNICODE | re.MULTILINE | re.DOTALL | re.X
pattern_re = re.compile(pattern,re_flags)
try2 = pattern_re.findall(text)
# Or equivalently, let nltk do some of the work.
import nltk
try3 = nltk.regexp_tokenize(text,pattern,flags=re_flags)

In [85]:
try1

[u'"That,"',
 u'said',
 u'Fred,',
 u'"is',
 u'what',
 u'you',
 u'...',
 u'get',
 u'in',
 u'the',
 u'U.S.',
 u'A.',
 u'for',
 u'$5.29."']

The `split` tokenized sentence has some very strange words, for example the 7-character strings `"Fred,'"` and `"That,"`,  and the 3-character string `"is`. What's being missed here is that certain characters (like comma and quotation-mark) unambiguously mark a word boundary.  Regular expressions are very good at enforcing this sort of generalization, as we can see by comparing the results of tokenizing the same sentence with a regexp that does not allow words to continue past boundary markers.

In the next two tries, we use such a regular expression (defined
as `pattern`), compiling it using `re.compile` (for efficiency) and using some compiling flags.  See `re` module docs for a complete description.
Here, we'll discuss just the most frequently used one, the `X` flag:

>  This flag allows you to write regular expressions 
>  that look nicer and are more readable by allowing 
>  you to visually separate logical sections of the pattern
>  and add comments. Whitespace within the pattern is ignored, 
>  except when in a character class, or when preceded by an 
>  unescaped backslash, or in groups or inisde a few special operators.  

>  When a line contains a comment character (#) that is not 
>  in a character class and is not preceded by an unescaped 
>  backslash, all characters from the leftmost such # through 
>  the end of the line are ignored.

Next, we call `re.findall(pattern, text)`; `re.findall(pattern, text)` returns a list of all the expressions in `text` that match `pattern`. 
Since each part (line) of `pattern` is written so as to match
a different case of a proper word, `re.findall(pattern, text)`
returns a list of the proper words in `text`.
Note that all parentheses on `pattern` are what the `re`-module docs call "non-capturing".  This means no **groups** are defined by these parens, the
matches against expressions in such parens are not put into
a register, and they are not returned as separate components
in a `findall`.   This is what we want  since the parentheses
in `pattern` wrap around parts of words, and we don't want the
tokenizer returning word parts, just complete words.

The results of using `findall` and the `nltk` tokenizer are equivalent.
Basically what the `nltk` tokenizer is compile the regexp using
flags and use `findall`.  The `nltk` tokenizer also offers another
option, that of writing a tokenizer that matches all word
**boundaries** and then using the `re` module method `split`.  That approach
has some advantages in some situations, but it is not shown here.

In [108]:
try3

[u'"',
 u'That',
 u',',
 u'"',
 u'said',
 u'Fred',
 u',',
 u'"',
 u'is',
 u'what',
 u'you',
 u'...',
 u'get',
 u'in',
 u'the',
 u'U.S.A.',
 u'for',
 u'$5.29',
 u'.',
 u'"']

In [3]:
try2 == try3

True

Python regular expressions use parentheses for two different things, defining retrievable groups, which as we saw, is useful for extraction, and defining the scope of some regular expression operator (like `*` or `+`). Sometimes these two roles get in each other's way.  This is what happens in `pattern` above: Python `findall` handles groups specially and incorrectly treats the parenthesized elements as groups; so we use the regular expression convention of changing `(` to '(?:'.  The "(?:' functions unambiguously to scope an operator and does not define a retrievable group.  Rather than make this change by hand, we call the convenient NLTK function `convert_regexp_to_nongrouping`.  We then compile the regular expression using various regular expression compiling flags.  `re.MULTILINE` and `re.DOTALL` allow our regular tokenizing `pattern` to match across lines, while `re.UNICODE` allows our definition of word, which depends on the interpretation of `\w` to apply to UNICODE characters.  Finally, `re.X` is the most directly relevant to this example.  This allows regular expressions that intersperse comments, which makes them much more readable.  See [Python.org re docs](http://docs.python.org/2/library/re.html) for more details.


In [21]:
text = """
"That," said  Fred, "is what
you ... get in the U.S.A. for $5.29."
"""
# This is illegal, do you know why?
#patx = r'\b\B+\b'
patx = r'\w+'
re_flags = re.UNICODE | re.MULTILINE | re.DOTALL | re.X
patx_re = re.compile(patx,re_flags)
try4 = patx_re.findall(text)

Here is what you get.  Is this a good result?

In [22]:
try4

['That',
 'said',
 'Fred',
 'is',
 'what',
 'you',
 'get',
 'in',
 'the',
 'U',
 'S',
 'A',
 'for',
 '5',
 '29']

## Sentence boundary detection

In [23]:
import re
text = """
The king rarely saw Marie 
on Tuesdays, but
he did see her  on Wednesdays.  He liked
to take long walks
in the garden, gazing longingly at the
rhododendrons.  She
thought this
odd.  Me, too.
"""
lines = re.split(r'\s*[!?.]\s*', text)

In [24]:
lines

['\nThe king rarely saw Marie \non Tuesdays, but\nhe did see her  on Wednesdays',
 'He liked\nto take long walks\nin the garden, gazing longingly at the\nrhododendrons',
 'She\nthought this\nodd',
 'Me, too',
 '']

Now let's clean this up removing unnecessary line breaks and white space.
For each element in `lines`, we split it, then put the pieces back together separated 
by single spaces.  Finally,we remove empty strings.

In [25]:
sentences0 = [' '.join(line.split()) for line in lines]
sentences = [exp for exp in sentences0 if exp]
sentences

['The king rarely saw Marie on Tuesdays, but he did see her on Wednesdays',
 'He liked to take long walks in the garden, gazing longingly at the rhododendrons',
 'She thought this odd',
 'Me, too']

We wrap it all up in a function, supplying the above pattern
as a default if the user doesn't specify one.

In [14]:
def sent_tokenize (text, pat=r'\s*[!?.]\s*'):
    lines = re.split(pat, text)
    sentences0 = [' '.join(line.split()) for line in lines]
    return [exp for exp in sentences0 if exp]

## Putting it all  together

In the next cell we use **negative lookahead**, which allows us 
to match an instance of one pattern as long as it is not immediately
followed by an instance of another.  For example, using `r"Isaac(?!\s+Asimov)"`
to define a pattern that matches "Isaac" when it is not immediately followed by
" Asimov", we get:

In [252]:
text = 'Isaac Asimov patted Isaac Stern on the back'
print(re.findall(r"Isaac",text))
print(re.findall(r"Isaac(?!\s*Asimov)",text))

['Isaac', 'Isaac']
['Isaac']


We input a raw text string and first
tokenize sentences, then words within sentences,
returning a list of tokenized sentences.
Each tokenized sentence is  a list of words.

In [2]:
import nltk
import re

pattern = r""" 
   (?:[A-Z]\.)+        # abbreviations, e.g. U.S.A.
  |\$?\d+(?:\.\d+)?%?  # numbers, money and percents, e.g. 3.14, $12.40, 82% 
  |\$?\.\d+%?         # numbers, money and percents, e.g. .14, $.40, '/8%     
  |\w+(?:-\w+)*        # words with optional internal hyphens. NB \w includes \d
  |\.\.\.            # ellipsis
  |[][./,;"'!?():-_`]  # keep punctuation, delimiters as separate word tokens
"""

re_flags = re.UNICODE | re.MULTILINE | re.DOTALL | re.X
# Add in to our sentence boundary pattern
# that the next letters following the sentence ender
# must NOT be a lower case letter (a-z).
back_pat = '\s*[!?.]\s+(?![a-z])'
def sent_tokenize (text, pat = '\s*[!?.]\s+'):
    lines = re.split(pat, text)
    sentences0 = [' '.join(line.split()) for line in lines]
    return [exp for exp in sentences0 if exp]

text = """
The king rarely saw Marie 
on Tuesdays, but
he did see her  on Wednesdays.  He liked
to take long walks
in the garden, gazing longingly at the
rhododendrons.  She
thought this
odd.  Me, too.
"That," said  Fred, "is what
you (Texans!) get in 1/2 the U.S.A. for $5.29, .23% of nothing."
"""
sents = sent_tokenize(text,pat = back_pat)
tokenized_sents = [nltk.regexp_tokenize(sent, pattern, flags=re_flags)
                   for sent in sents]
tokenized_sents

[['The',
  'king',
  'rarely',
  'saw',
  'Marie',
  'on',
  'Tuesdays',
  ',',
  'but',
  'he',
  'did',
  'see',
  'her',
  'on',
  'Wednesdays'],
 ['He',
  'liked',
  'to',
  'take',
  'long',
  'walks',
  'in',
  'the',
  'garden',
  ',',
  'gazing',
  'longingly',
  'at',
  'the',
  'rhododendrons'],
 ['She', 'thought', 'this', 'odd'],
 ['Me', ',', 'too'],
 ['"',
  'That',
  ',',
  '"',
  'said',
  'Fred',
  ',',
  '"',
  'is',
  'what',
  'you',
  '(',
  'Texans',
  '!',
  ')',
  'get',
  'in',
  '1',
  '/',
  '2',
  'the',
  'U.S.A.',
  'for',
  '$5.29',
  ',',
  '.23%',
  'of',
  'nothing',
  '.',
  '"']]