The original of this work went off-line in 2007. Do get in touch if, like me, you value this work. Tim Pizey.
This is a JavaCC grammar for parsing HTML documents. It does not enforce the DTD, but instead builds a simple parse tree which can be used to validate, reformat, display, analyze, or edit the HTML document. The goal was to produce a parse tree which threw away very little information contained in the source file, so that by dumping the parse tree, an almost identical copy of the input document would result. The only source information discarded by the parser is whitespace inside of tags (i.e., the spaces or newlines between the attributes of a tag.) It is not confused by things that look like tags inside of quoted strings.
The generated parse tree supports the commonly used "Visitor" design pattern. Several visitor classes are provided, which do things like dump the parse tree, restructure the parse tree, etc. Common tasks such as formatting, validation, or analysis are easily performed as Visitors.
The code is contained in several files, which are part of the packages com.quiotix.html.example and com.quiotix.html.parser.
You can download the source tree from the Source Repository
You can download all the files as a source archive or the class files in a pre-built jar file from the project repository.
The parser was written with JavaCC, the Java Compiler Compiler. You do not need JavaCC unless you plan to modify the core parser. The parser package is built with Ant. To compile the package with ANT, simply execute
ant
If you have installed JavaCC and want to recompile the parser from the JavaCC source, also set the $JAVACC_HOME environment variable.
To compile with Maven:
mvn package
The main() method of HtmlParser will read an HTML file from System.in, parse it, and visit it with a Visitor which reconstructs the original file and dumps it to System.out. While this is not terribly useful for anything other than testing, it does test all the elements of the parser. To test it, put the parser files on your classpath and execute
java com.quiotix.html.parser.HtmlParser < some-html-file.html
The parser transforms an input stream into a parse tree; the elements of the parse tree are defined in HtmlDocument. You can then traverse the tree using the Visitor pattern; the base visitor is defined in HtmlVisitor, and there are several visitors which are part of the HtmlParser package. HtmlDumper is a simple visitor which traverses the parse tree and reconstructs the original document and writes it to System.out; it is a useful starting point for building your own custom visitors. HtmlCollector and HtmlScrubber are more sophisticated visitors, which transform the parse tree. HtmlCollector imparts a tree structure to the otherwise flat parse tree, matching begin and end blocks with each other. HtmlScrubber cleans up the documents, converting tags and attributes to upper or lower case, removes unnecessary quotes and white space, etc.
To parse a document, invoke the HtmlDocument method on the parser. This will produce an HtmlDocument, which is a sequence of HtmlElement objects. Defined subclasses of HtmlElement include Tag, EndTag, Comment, Text, Newline, and TagBlock (TagBlock is a composite object comprising of a tag, a matching end tag, and a sequence of the intervening elements.)
In order to not have the parser reject documents which do not meet the DTD requirements but which still would be accepted by most browsers (e.g., documents with missing end tags, overlapped tags, tags in a context where the DTD doesn't allow them, etc), the parser does not attempt to match start tags with end tags (so the result of the parsing process will not contain TagBlock elements.) This is more easily done after parsing by a bottom-up parsing mechanism anyway, and the HtmlCollector class (a subclass of HtmlVisitor) does this; it walks the document and attempts to match up tags and impart more structure to the document.
HtmlParse simply invokes the parser and dumps the resulting parse tree with HtmlDumper.
HtmlFormat is an example program which uses the parser, the scrubber, and the collector to parse and pretty-print an HTML page and dump it to System.out.
DumpLinks is a simple visitor which parses a document from System.in and writes out a list of links in the document to System.out.
Copyright (C) 1999-2002 Quiotix Corporation. This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License, version 2 as published by the Free Software Foundation.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
If you use it, like it, dislike it, or find it useful, we would appreciate your letting us know by dropping an e-mail to html-parser@quiotix.com. Also, if you find any errors or have suggestions for improving it, or you find HTML files which it will not parse, let us know and we can post corrections and improvements.
Commercial support contracts are also available for this package. If you are interested in a commercial support contract, please contact us.
Version | Date | Comments |
---|---|---|
1.3 | 26-Jul-2006 |
|
1.2 | 12-Jul-2006 | Convert to Maven. Ant build still supported. Examples moved into main tree. |
1.1.1 | 08-Jul-2005 | Do best-effort-only quote matching in HTML declarations, to deal with broken DOCTYPE declaration generated by some versions of JavaDoc. |
1.1 | 04-May-2005 | Remove matching of quoted strings in text, which can confuse the parser in some documents; add attribute-fetching methods in AttributeList and Tag. Convert JDK 1.1 collections to Collections framework. |
1.03 | 07-Sep-2004 | Support identifiers with namespaces |
1.02 | 29-May-2002 | Repackage, add ANT script, add more examples, update documentation |
1.01 | 06-Nov-2000 | Treat STYLE blocks just like SCRIPT blocks; don't interpret markup inside of STYLE blocks. Accept "-- >" and "->" as valid end-comment markers (the former is part of the spec; the latter is a not uncommon HTML coding error.) Accept XML-style "empty" tags (e.g., <tag />). Archive reorganized somewhat, more examples added. (source archive, jar file) |
1.00 | 03-Nov-1999 | Fixed bug in error recovery code which caused us to drop whitespace inside things that looked like tags but turned out not to be (or which were mal-formed.) Thanks to Doug Reed for reporting this bug. (tar file) |
0.92a | 21-Jul-1999 | Fixed bug in HtmlScrubber.java, submitted by Thorsten Weber. No change to HtmlParser.jj. |
0.92 | 01-Jul-1999 | Added extensive error handling and recovery; now handles mal-formed tags as ordinary text. Also handles <SCRIPT> blocks properly. |
0.91 | 14-Jun-1999 | Fixed bugs in quote processing; now handles unmatched quotes (including apostrophes) in text and in comments |
0.90 | 10-Jun-1999 | Initial public release |