Lex and Yacc Annotated Examples


Example 0

First, the simplest lex program
%%
.|\n ECHO;
%%

This program doesn't do much. There is no definition section, and there are no user subroutines either, so all that's left is the rules section. The rules section contains just one rule.

Recall that the form of a rule is pattern action ;. The pattern in example 0 is .|\n. The "." matches any character other than end-of-line, the "|" indicates "or", and "\n" matches end-of-line. So this pattern will match everything. Note that the pattern begins in column zero -- adding whitespace to the left of the pattern is a BAD THINGTM.

The action is ECHO. Note that it is seperated from the the pattern by whitespace. Recall that ECHO is a predefined action that simply copies the matched string to standard out.

And the line is terminated with a semi-colon. It is very easy to forget this, and it leads to rather odd errors. You should explore the consequences of mangling working lex files to examine the typical error messages; this can prove to be very useful, for when you unintentionally make an error in your lex file, you'll stand a better chance of recognizing the error and have an idea of what may have caused it.

To compile this simple example, you must first apply the lexer to the lex file, and then you should compile the resulting generated C program. Recall that the "lex" program takes a lex file and produces a file named "yy.lex.c", and that when you compile the "yy.lex.c" file, you must tell the linker to use the lex library (by providing the "-ll" argument to the C compiler).

$ lex ex0.lex
$ ls
ex0.lex       yy.lex.c
$ cc -o ex0 yy.lex.c -ll
$ ls
ex0           ex0.lex          yy.lex.c
$



Example 1

Example one is a slightly more complicated example of a lexer. Note that it has a non-empty definition section and defines main in the user-subroutines section.

Of course, the definition section's content is trivial -- it's just a comment block. However, it does demonstrate the %{ and %} delimiters for including literal C code in the definition section.

The rules section begins with mechanism for ignoring whitespace by taking all tabs and spaces and applying the empty action to them. The comment is there for clarity, always a good thing.

The list of literals (forms of the verb "to be") has as the action for all except the last ("being") the pipe "|" character. This means "for this pattern, use the action that the following pattern uses". Thus, the space between the literal and the pipe is required.

The [a-zA-Z]+ will match all other words. And then we also echo the remaining characters.


%{

/*
 * Lex Example #1
 *
 * Identify uses of the verb "to be".
 */ 

%}
%%

[\t ]+ 		/* ignore whitespace */ ;

is |
am |
are |
were |
was |
be |
being       	{ printf("'%s' is a form of the verb 'to be'.\n", yytext); }

[a-zA-Z]+   	{ printf("'%s' is not a form of the verb 'to be'.\n", yytext); }

.|\n		ECHO;

%%

int main( void ) {
   yylex();
}


Example 2

This example takes a simplified version of the grammar we worked on in class for recognizing a street address. It is very restrictive about what it will accept, but it should work for:

John Doe
1234 First Street
San Diego CA, 98765-4321
The Yacc file format is basically the same as the Lex file format: you have a Definition section, a Rules Section, and a User-Subroutine section. In the definition section we not only have a comment as literal C code, but we also are including the standard I/O libary. This is because we are using a FILE pointer in the user-subroutines section. Also, note the %token line -- this defines the tokens we expect our lexer to identify.
%{
/*
 * Example 2 - Yacc
 *
 * Recognize a street address.
 */

#include <stdio.h>
%}

%token CAPSTRING CAPLETTER NUMBER STATE ZIPPLUSFOUR COMMA HASH DOT NEWLINE

%%

sentence: firstline secondline thirdline { printf("Have a valid address.\n"); }
        ;

firstline: firstname surname NEWLINE
         | firstname middlename surname NEWLINE
         ;

secondline: NUMBER street NEWLINE
          | NUMBER street HASH NUMBER NEWLINE
          ;

thirdline: city STATE COMMA zip NEWLINE
         ;

firstname: CAPSTRING ;

middlename: CAPLETTER DOT ;

surname: CAPSTRING ; 

street: CAPSTRING
       | CAPSTRING street
       ;

city: CAPSTRING
    | CAPSTRING city
    ;

zip: ZIPPLUSFOUR
   ;

%%

extern FILE *yyin;

void main( void ) {

   while( !feof( yyin ) ) {
      yyparse();
   }
}

To generate the C code and necessary header file from this yacc file, we do the following:
$ yacc -d ex2.yacc
$ ls
ex2.yacc	y.tab.c		y.tab.h
$
The "-d" is optional, but if we do not include it, yacc will not generate the y.tab.h file. If we take a look at the y.tab.h file, we see:
# define CAPSTRING 257
# define CAPLETTER 258
# define NUMBER 259
# define STATE 260
# define ZIPPLUSFOUR 261
# define COMMA 262
# define HASH 263
# define DOT 264
# define NEWLINE 265
We don't need to look at the y.tab.h file, we just need to know it is there and know how to use it. And we use it in our lexer. So now we can write our lex file:
%{
/*
 * Example 2 Lexer
 */

#include "y.tab.h"

%}

%%

[\t ]+ 		/* ignore whitespace */;

[A-Z][a-z]+	 { return CAPSTRING; };

[A-Z][A-Z]	{return STATE; }

[A-Z] 		{return CAPLETTER; }

[0-9]+ 		{return NUMBER; }

[0-9][0-9][0-9][0-9][0-9]-[0-9][0-9][0-9][0-9] 	{return ZIPPLUSFOUR; }

\, 		{return COMMA; }

#		{return HASH; }

\. 		{return DOT; }

\n		{return NEWLINE; }

%%

#ifdef STANDALONE
void main( void ) {
   int token;

   token = yylex();
   while ( token > 0 ) {
      switch ( token ) {
         case CAPSTRING: printf("CapString\n"); break;
         case STATE: printf("State\n"); break;
         case CAPLETTER: printf("CapLetter\n"); break;
         case NUMBER: printf("Number\n"); break;
         case ZIPPLUSFOUR: printf("ZipPlusFour\n"); break;
         case COMMA: printf("Comma\n"); break;
         case HASH: printf("Hash\n"); break;
         case DOT: printf("Dot\n"); break;
         case NEWLINE: printf("Newline\n"); break;
         default: printf("Unknown\n");
      }
      token = yylex();
   }
}
#endif

Note that the Definition section has a '#include "y.tab.h"'. This is how we can use the list of tokens declared in the yacc file in our lex file.

Also of interest is the \. -- since dot means "match any character except end-of-line", we need to escape it if we want to match a period.

Also, the lexer has "#ifdef STANDALONE" protecting the main method. If you want to run the lexer by itself, you can pass "-DSTANDALONE" to the compiler when you are compiling the lex.yy.c file, and you can then run just the lexer. Otherwise, you use the main defined in the yacc file.

If you don't see the need for this protection, remove it, and then try compiling the whole system.


Example 3

Yacc cannot handle more than one token of lookahead. The following Yacc file defines a grammar that requires two tokens of lookahead to identify among the phrases: Can you see why?
%{
/*
 * Example 3 - Yacc
 *
 * A broken grammar.
 */

#include <stdio.h>
%}

%token HORSE GOAT OX CART PLOW AND

%%

phrase: cart_animal AND CART
      | work_animal AND PLOW      { printf("Got phrase.\n"); }
      ;

cart_animal: HORSE
           | GOAT
           ;

work_animal: HORSE
           | OX
           ;

%%

extern FILE *yyin;

void main( void ) {
   while ( !feof( yyin ) ) {
      yyparse();
   }    
}


Example 4

Here is an example of an ambigious grammar. Note that Yacc doesn't like it.
%{

/*
 * Example 4 - another broken yacc grammar
 */

#include <stdio.h>

%}

%token A B C

%%

s: a
 | b
 ;

a: A
 | a A
 | C a
 ;

b: B
 ;

%%

extern FILE *yyin;

void main( void ) {
   while ( !feof( yyin ) ) {
      yyparse();
   }
}

Basically, N = { s, a, b } and T = { A, B, C }. The ambiguity arises in the productions for a.

We can fix the grammar by making it unambigious.


%{

/*
 * Example 4 - another broken yacc grammar, fixed.
 */

#include <stdio.h>

%}

%token A B C

%%

s: a
 | b
 ;

a: A
 | A a
 | C a
 ;

b: B
 ;

%%

extern FILE *yyin;

void main( void ) {
   while ( !feof( yyin ) ) {
      yyparse();
   }
}

And once we have our grammar fixed, of course, we can write our lexer.


%{

/*
 * Example 4 (fixed) Lexer.
 */

#include "y.tab.h"

%}

%%

[\t ]+		/* ignore */ ;

a |
A		{ return A; }

b |
B		{ return B; }

c |
C		{ return C; }

%%



Source Links

The example source files:


$Id: LexAndYacc.html,v 1.4 2003/04/26 06:51:02 stremler Exp stremler $