htmlparser - JavaCC HTML Parser

Derivation

The original of this work went off-line in 2007. Do get in touch if, like me, you value this work. Tim Pizey.

JavaCC HTML Parser

This is a JavaCC grammar for parsing HTML documents. It does not enforce the DTD, but instead builds a simple parse tree which can be used to validate, reformat, display, analyze, or edit the HTML document. The goal was to produce a parse tree which threw away very little information contained in the source file, so that by dumping the parse tree, an almost identical copy of the input document would result. The only source information discarded by the parser is whitespace inside of tags (i.e., the spaces or newlines between the attributes of a tag.) It is not confused by things that look like tags inside of quoted strings.

The generated parse tree supports the commonly used "Visitor" design pattern. Several visitor classes are provided, which do things like dump the parse tree, restructure the parse tree, etc. Common tasks such as formatting, validation, or analysis are easily performed as Visitors.

Downloading

The code is contained in several files, which are part of the packages com.quiotix.html.example and com.quiotix.html.parser.

You can download the source tree from the Source Repository

You can download all the files as a source archive or the class files in a pre-built jar file from the project repository.

Compiling and testing

The parser was written with JavaCC, the Java Compiler Compiler. You do not need JavaCC unless you plan to modify the core parser. The parser package is built with Ant. To compile the package with ANT, simply execute

ant

If you have installed JavaCC and want to recompile the parser from the JavaCC source, also set the $JAVACC_HOME environment variable.

To compile with Maven:

mvn package

The main() method of HtmlParser will read an HTML file from System.in, parse it, and visit it with a Visitor which reconstructs the original file and dumps it to System.out. While this is not terribly useful for anything other than testing, it does test all the elements of the parser. To test it, put the parser files on your classpath and execute

	java com.quiotix.html.parser.HtmlParser < some-html-file.html

Using the parser

The parser transforms an input stream into a parse tree; the elements of the parse tree are defined in HtmlDocument. You can then traverse the tree using the Visitor pattern; the base visitor is defined in HtmlVisitor, and there are several visitors which are part of the HtmlParser package. HtmlDumper is a simple visitor which traverses the parse tree and reconstructs the original document and writes it to System.out; it is a useful starting point for building your own custom visitors. HtmlCollector and HtmlScrubber are more sophisticated visitors, which transform the parse tree. HtmlCollector imparts a tree structure to the otherwise flat parse tree, matching begin and end blocks with each other. HtmlScrubber cleans up the documents, converting tags and attributes to upper or lower case, removes unnecessary quotes and white space, etc.

To parse a document, invoke the HtmlDocument method on the parser. This will produce an HtmlDocument, which is a sequence of HtmlElement objects. Defined subclasses of HtmlElement include Tag, EndTag, Comment, Text, Newline, and TagBlock (TagBlock is a composite object comprising of a tag, a matching end tag, and a sequence of the intervening elements.)

In order to not have the parser reject documents which do not meet the DTD requirements but which still would be accepted by most browsers (e.g., documents with missing end tags, overlapped tags, tags in a context where the DTD doesn't allow them, etc), the parser does not attempt to match start tags with end tags (so the result of the parsing process will not contain TagBlock elements.) This is more easily done after parsing by a bottom-up parsing mechanism anyway, and the HtmlCollector class (a subclass of HtmlVisitor) does this; it walks the document and attempts to match up tags and impart more structure to the document.

Examples

HtmlParse simply invokes the parser and dumps the resulting parse tree with HtmlDumper.

HtmlFormat is an example program which uses the parser, the scrubber, and the collector to parse and pretty-print an HTML page and dump it to System.out.

DumpLinks is a simple visitor which parses a document from System.in and writes out a list of links in the document to System.out.

Copyright

Copyright (C) 1999-2002 Quiotix Corporation. This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License, version 2 as published by the Free Software Foundation.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

If you use it, like it, dislike it, or find it useful, we would appreciate your letting us know by dropping an e-mail to html-parser@quiotix.com. Also, if you find any errors or have suggestions for improving it, or you find HTML files which it will not parse, let us know and we can post corrections and improvements.

Commercial support contracts are also available for this package. If you are interested in a commercial support contract, please contact us.

Version History

Version	Date	Comments
1.3	26-Jul-2006	Move default version of HTML up to 4 ie close P and LI tags, add quotes to unquoted attributes Allow equal signs at end of line as skip space (continuation character) to cope with Microsoft Internet Archive format (.mht) Regenerate with current javacc builds to reduce warnings in eclipse.
1.2	12-Jul-2006	Convert to Maven. Ant build still supported. Examples moved into main tree.
1.1.1	08-Jul-2005	Do best-effort-only quote matching in HTML declarations, to deal with broken DOCTYPE declaration generated by some versions of JavaDoc.
1.1	04-May-2005	Remove matching of quoted strings in text, which can confuse the parser in some documents; add attribute-fetching methods in AttributeList and Tag. Convert JDK 1.1 collections to Collections framework.
1.03	07-Sep-2004	Support identifiers with namespaces
1.02	29-May-2002	Repackage, add ANT script, add more examples, update documentation
1.01	06-Nov-2000	Treat STYLE blocks just like SCRIPT blocks; don't interpret markup inside of STYLE blocks. Accept "-- >" and "->" as valid end-comment markers (the former is part of the spec; the latter is a not uncommon HTML coding error.) Accept XML-style "empty" tags (e.g., <tag />). Archive reorganized somewhat, more examples added. (source archive, jar file)
1.00	03-Nov-1999	Fixed bug in error recovery code which caused us to drop whitespace inside things that looked like tags but turned out not to be (or which were mal-formed.) Thanks to Doug Reed for reporting this bug. (tar file)
0.92a	21-Jul-1999	Fixed bug in HtmlScrubber.java, submitted by Thorsten Weber. No change to HtmlParser.jj.
0.92	01-Jul-1999	Added extensive error handling and recovery; now handles mal-formed tags as ordinary text. Also handles <SCRIPT> blocks properly.
0.91	14-Jun-1999	Fixed bugs in quote processing; now handles unmatched quotes (including apostrophes) in text and in comments
0.90	10-Jun-1999	Initial public release

Known Issues

Quote processing.: The parser tries to deal with quoted strings properly (so that things that look like tags or end-of-comment markers inside of quotes are not mistaken for tags), but sometimes quoted strings are improperly terminated, or an apostrophe will be interpreted as the start of a quoted string. (Changed in 1.1; quoted strings in body text are no longer treated as anything but text. This was a problem in very early versions of JavaScript, prior to the SCRIPT tag.)
Things that look like tags.: (Resolved in 0.92.) The presence of a < character in the text can confuse the parser into thinking it's found a tag. While this is improper HTML (you should use the < sequence instead), many HTML pages take advantage of the browsers forgiving nature here.

Project Documentation