If you are a Java developer and you are interested in Domain Specific Language (DSL) and Code Generation, soon or late you are going to play a bit with ANTLR. In addition if you are such kind of person you will probably know the Martin Fowler bliki. Now something personal: I in general dislike working with graphic tools when I can do the same thing by coding and/or command line (who knows if in one of my next posts I will decide to explain why). I also dislike to store into a database things that are much more comfortable into the file system. All these reasons drive me to implement my own bliki.
I have given also an opportunity to WordPress, indeed a spanish blog I translate to italian is maintained with WordPress, but let's speak about the static part of this site.
Because I aim to experiment with ANTLR I decided to wrote a small language to define a blog post. Once you have such a language and you can parse your posts you can use this data to:
title:url: date: tags: antlr,java, .. content: the HTML part of the post
If you are new to ANTLR the first grammar will be:
post: title url date tags content; title: 'title:' LINE; url: 'url:' LINE; date: 'date:' DIGIT DIGIT DIGIT DIGIT '-' DIGIT DIGIT '-' DIGIT DIGIT NL; tags: 'tags:' WORDS? (',' WORDS)? '\n'; content: 'content:' .*; DIGIT: [0-9]; WORDS: ([a-zA-Z0-9] | ' ')+; LINE: ~[\r\n]* NL; NL: '\r'? '\n';If you try this grammar with:
$ antlr4 BadBlog.g4 $ javac *.java $ grun BadBlog post -tokens test1.postYou will receive some errors like this line 1:0 missing 'title:' at 'title: something\n'. Why ANTLR says title: is missing if it's actually inside the file?
This fact is stated at page 15 of The Definitive ANTLR 4 Reference:
Out lexer consume title: while matching theNote that lexers try to match the longest string possible
LINE
rule, and this is visible from the preceeding command:
[@0,0:18='title: something\n',<11>,1:0]The token 11 is
LINE
.
The solution is to implement everything at lexer level (I introduce "..." to end the content rule):
post: TITLE URL DATE TAGS CONTENT; TITLE: 'title:' .*? NL; URL: 'url:' .*? NL; DATE: 'date:' .*? NL; TAGS: 'tags:' .*? NL; CONTENT: 'content:' .*? NL '...' NL; NL : '\r'? '\n';If you test this you will see that the grammar successfully parse the file at the price of having also starting and ending string when accessing the AST, e.g. TITLE().getText() will contains also title:.
With our grammar we want basically to parse:
The Lexer respect rules precedence but here the problem is that the LINE
rule
has no start condition and once it starts will match for instance always more chars than WORDS
.
The solution are lexer modes but for this you should split your grammar in a lexer and parser grammars,
see BlogLexer.g4 and BlogParser.g4 . You need a sequence that start a mode and a sequence that switch back to the
default mode. Inside a mode you have different lexer rules, for instance after title: we match chars until a
new line while after content: the new line char alone has nothing special and we match a longer sequence
as you can see reading the grammar.
The only remark is how we match a long sequence of chars, the CH
rule, into the lexer that the parser
join together into a chars
object.
I created an eclipse project for this blog, you can play with my grammar:
compile-lexer.launch
to compile the lexer, thencompile-parser.launch
to compile the parser, thenupdate-web-gen.launch
.grun.launch
to have from eclipse the same output of grun
command, but
while developing a new grammar, at least when it's a small grammar, it's easier from the command line.