%% .|\n ECHO; %%
This program doesn't do much. There is no definition section, and there are no user subroutines either, so all that's left is the rules section. The rules section contains just one rule.
Recall that the form of a rule is pattern action ;. The pattern
in example 0 is .|\n. The "." matches any character other
than end-of-line, the "|" indicates "or", and "\n" matches end-of-line.
So this pattern will match everything. Note that the pattern begins in
column zero -- adding whitespace to the left of the pattern is a
BAD THING
The action is ECHO. Note that it is seperated from the
the pattern by whitespace. Recall that ECHO is a predefined action
that simply copies the matched string to standard out.
And the line is terminated with a semi-colon. It is very easy to forget this, and it leads to rather odd errors. You should explore the consequences of mangling working lex files to examine the typical error messages; this can prove to be very useful, for when you unintentionally make an error in your lex file, you'll stand a better chance of recognizing the error and have an idea of what may have caused it.
To compile this simple example, you must first apply the lexer to the lex file, and then you should compile the resulting generated C program. Recall that the "lex" program takes a lex file and produces a file named "yy.lex.c", and that when you compile the "yy.lex.c" file, you must tell the linker to use the lex library (by providing the "-ll" argument to the C compiler).
$ lex ex0.lex $ ls ex0.lex yy.lex.c $ cc -o ex0 yy.lex.c -ll $ ls ex0 ex0.lex yy.lex.c $
Example one is a slightly more complicated example of a lexer. Note that it has a non-empty definition section and defines main in the user-subroutines section.
Of course, the definition section's content is trivial -- it's just a comment block. However, it does demonstrate the %{ and %} delimiters for including literal C code in the definition section.
The rules section begins with mechanism for ignoring whitespace by taking all tabs and spaces and applying the empty action to them. The comment is there for clarity, always a good thing.
The list of literals (forms of the verb "to be") has as the action for all except the last ("being") the pipe "|" character. This means "for this pattern, use the action that the following pattern uses". Thus, the space between the literal and the pipe is required.
The [a-zA-Z]+ will match all other words. And then we also echo the remaining characters.
%{
/*
* Lex Example #1
*
* Identify uses of the verb "to be".
*/
%}
%%
[\t ]+ /* ignore whitespace */ ;
is |
am |
are |
were |
was |
be |
being { printf("'%s' is a form of the verb 'to be'.\n", yytext); }
[a-zA-Z]+ { printf("'%s' is not a form of the verb 'to be'.\n", yytext); }
.|\n ECHO;
%%
int main( void ) {
yylex();
}
This example takes a simplified version of the grammar we worked on in class for recognizing a street address. It is very restrictive about what it will accept, but it should work for:
John Doe 1234 First Street San Diego CA, 98765-4321The Yacc file format is basically the same as the Lex file format: you have a Definition section, a Rules Section, and a User-Subroutine section. In the definition section we not only have a comment as literal C code, but we also are including the standard I/O libary. This is because we are using a FILE pointer in the user-subroutines section. Also, note the %token line -- this defines the tokens we expect our lexer to identify.
%{
/*
* Example 2 - Yacc
*
* Recognize a street address.
*/
#include <stdio.h>
%}
%token CAPSTRING CAPLETTER NUMBER STATE ZIPPLUSFOUR COMMA HASH DOT NEWLINE
%%
sentence: firstline secondline thirdline { printf("Have a valid address.\n"); }
;
firstline: firstname surname NEWLINE
| firstname middlename surname NEWLINE
;
secondline: NUMBER street NEWLINE
| NUMBER street HASH NUMBER NEWLINE
;
thirdline: city STATE COMMA zip NEWLINE
;
firstname: CAPSTRING ;
middlename: CAPLETTER DOT ;
surname: CAPSTRING ;
street: CAPSTRING
| CAPSTRING street
;
city: CAPSTRING
| CAPSTRING city
;
zip: ZIPPLUSFOUR
;
%%
extern FILE *yyin;
void main( void ) {
while( !feof( yyin ) ) {
yyparse();
}
}
$ yacc -d ex2.yacc $ ls ex2.yacc y.tab.c y.tab.h $The "-d" is optional, but if we do not include it, yacc will not generate the y.tab.h file. If we take a look at the
y.tab.h file, we
see:
# define CAPSTRING 257 # define CAPLETTER 258 # define NUMBER 259 # define STATE 260 # define ZIPPLUSFOUR 261 # define COMMA 262 # define HASH 263 # define DOT 264 # define NEWLINE 265We don't need to look at the y.tab.h file, we just need to know it is there and know how to use it. And we use it in our lexer. So now we can write our lex file:
%{
/*
* Example 2 Lexer
*/
#include "y.tab.h"
%}
%%
[\t ]+ /* ignore whitespace */;
[A-Z][a-z]+ { return CAPSTRING; };
[A-Z][A-Z] {return STATE; }
[A-Z] {return CAPLETTER; }
[0-9]+ {return NUMBER; }
[0-9][0-9][0-9][0-9][0-9]-[0-9][0-9][0-9][0-9] {return ZIPPLUSFOUR; }
\, {return COMMA; }
# {return HASH; }
\. {return DOT; }
\n {return NEWLINE; }
%%
#ifdef STANDALONE
void main( void ) {
int token;
token = yylex();
while ( token > 0 ) {
switch ( token ) {
case CAPSTRING: printf("CapString\n"); break;
case STATE: printf("State\n"); break;
case CAPLETTER: printf("CapLetter\n"); break;
case NUMBER: printf("Number\n"); break;
case ZIPPLUSFOUR: printf("ZipPlusFour\n"); break;
case COMMA: printf("Comma\n"); break;
case HASH: printf("Hash\n"); break;
case DOT: printf("Dot\n"); break;
case NEWLINE: printf("Newline\n"); break;
default: printf("Unknown\n");
}
token = yylex();
}
}
#endif
Note that the Definition section has a '#include "y.tab.h"'. This is how we can use the list of tokens declared in the yacc file in our lex file.
Also of interest is the \. -- since dot means "match any character except end-of-line", we need to escape it if we want to match a period.
Also, the lexer has "#ifdef STANDALONE" protecting the main method. If you want to run the lexer by itself, you can pass "-DSTANDALONE" to the compiler when you are compiling the lex.yy.c file, and you can then run just the lexer. Otherwise, you use the main defined in the yacc file.
If you don't see the need for this protection, remove it, and then try compiling the whole system.
%{
/*
* Example 3 - Yacc
*
* A broken grammar.
*/
#include <stdio.h>
%}
%token HORSE GOAT OX CART PLOW AND
%%
phrase: cart_animal AND CART
| work_animal AND PLOW { printf("Got phrase.\n"); }
;
cart_animal: HORSE
| GOAT
;
work_animal: HORSE
| OX
;
%%
extern FILE *yyin;
void main( void ) {
while ( !feof( yyin ) ) {
yyparse();
}
}
%{
/*
* Example 4 - another broken yacc grammar
*/
#include <stdio.h>
%}
%token A B C
%%
s: a
| b
;
a: A
| a A
| C a
;
b: B
;
%%
extern FILE *yyin;
void main( void ) {
while ( !feof( yyin ) ) {
yyparse();
}
}
Basically, N = { s, a, b } and T = { A, B, C }. The ambiguity arises in the productions for a.
We can fix the grammar by making it unambigious.
%{
/*
* Example 4 - another broken yacc grammar, fixed.
*/
#include <stdio.h>
%}
%token A B C
%%
s: a
| b
;
a: A
| A a
| C a
;
b: B
;
%%
extern FILE *yyin;
void main( void ) {
while ( !feof( yyin ) ) {
yyparse();
}
}
And once we have our grammar fixed, of course, we can write our lexer.
%{
/*
* Example 4 (fixed) Lexer.
*/
#include "y.tab.h"
%}
%%
[\t ]+ /* ignore */ ;
a |
A { return A; }
b |
B { return B; }
c |
C { return C; }
%%