com.quiotix.html.parser
Class HtmlParser

java.lang.Object
  extended by com.quiotix.html.parser.HtmlParser
All Implemented Interfaces:
HtmlParserConstants

public class HtmlParser
extends Object
implements HtmlParserConstants

This grammar parses an HTML document and produces a (flat) parse "tree" representing the document. It preserves almost all information in the source document, including carriage control and spacing (except inside of tags.) See the HtmlDocument and HtmlDocument.* classes for a description of the parse tree. The parse tree supports traversal using the commonly used "Visitor" pattern. The HtmlDumper class is a visitor which dumps out the tree to an output stream. It does not require begin tags to be matched with end tags, or validate the names or contents of the tags (this can easily be done post-parsing; see the HtmlCollector class (which matches begin tags with end tags) for an example.) Notable edge cases include: - Quoted string processing. Quoted strings are matched inside of comments, and as tag attribute values. Quoted strings are matched in normal text only to the extent that they do not span line breaks. Please direct comments, questions, gripes or praise to html-parser@quiotix.com. If you like it/hate it/use it, please let us know!


Field Summary
 Token jj_nt
          Next token.
 boolean lookingAhead
          Whether we are looking ahead.
 Token token
          Current token.
 HtmlParserTokenManager token_source
          Generated Token Manager.
 
Fields inherited from interface com.quiotix.html.parser.HtmlParserConstants
ALPHA_CHAR, ALPHANUM_CHAR, ATTR_EQ, ATTR_NAME, ATTR_VAL, BLOCK_EOL, BLOCK_LBR, BLOCK_WORD, COMMENT_END, COMMENT_EOL, COMMENT_START, COMMENT_WORD, DASH, DECL_ANY, DECL_END, DECL_START, DEFAULT, ENDTAG_START, EOF, EOL, IDENTIFIER, IDENTIFIER_CHAR, IMPLICIT_TAG_END, LAV_ERROR, LexAttrVal, LexComment, LexDecl, LexInTag, LexScript, LexStartTag, LexStyle, LIT_ERROR, LST_ERROR, NEWLINE, NUM_CHAR, PCDATA, QUOTE, QUOTED_STRING, QUOTED_STRING_NB, SCRIPT_END, STYLE_END, TAG_END, TAG_NAME, TAG_SCRIPT, TAG_SLASHEND, TAG_START, TAG_STYLE, tokenImage, WHITESPACE
 
Constructor Summary
HtmlParser(HtmlParserTokenManager tm)
          Constructor with generated Token Manager.
HtmlParser(InputStream stream)
          Constructor with InputStream.
HtmlParser(InputStream stream, String encoding)
          Constructor with InputStream and supplied encoding
HtmlParser(Reader stream)
          Constructor.
 
Method Summary
 HtmlDocument.Attribute Attribute()
           
 HtmlDocument.AttributeList AttributeList()
           
 HtmlDocument.ElementSequence BlockContents()
           
 HtmlDocument.Comment CommentTag()
           
 HtmlDocument.Comment DeclTag()
           
 void disable_tracing()
          Disable tracing.
 HtmlDocument.HtmlElement Element()
           
 HtmlDocument.ElementSequence ElementSequence()
           
 void enable_tracing()
          Enable tracing.
 HtmlDocument.HtmlElement EndTag()
           
 ParseException generateParseException()
          Generate ParseException.
 Token getNextToken()
          Get the next Token.
 Token getToken(int index)
          Get the specific Token.
 HtmlDocument HtmlDocument()
          Constructor.
static void main(String[] args)
          Runnable.
 void ReInit(HtmlParserTokenManager tm)
          Reinitialise.
 void ReInit(InputStream stream)
          Reinitialise.
 void ReInit(InputStream stream, String encoding)
          Reinitialise.
 void ReInit(Reader stream)
          Reinitialise.
 HtmlDocument.HtmlElement ScriptBlock()
           
 HtmlDocument.HtmlElement StyleBlock()
           
 HtmlDocument.HtmlElement Tag()
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

token_source

public HtmlParserTokenManager token_source
Generated Token Manager.


token

public Token token
Current token.


jj_nt

public Token jj_nt
Next token.


lookingAhead

public boolean lookingAhead
Whether we are looking ahead.

Constructor Detail

HtmlParser

public HtmlParser(InputStream stream)
Constructor with InputStream.


HtmlParser

public HtmlParser(InputStream stream,
                  String encoding)
Constructor with InputStream and supplied encoding


HtmlParser

public HtmlParser(Reader stream)
Constructor.


HtmlParser

public HtmlParser(HtmlParserTokenManager tm)
Constructor with generated Token Manager.

Method Detail

main

public static void main(String[] args)
                 throws ParseException
Runnable.

Throws:
ParseException

HtmlDocument

public final HtmlDocument HtmlDocument()
                                throws ParseException
Constructor.

Throws:
ParseException

ElementSequence

public final HtmlDocument.ElementSequence ElementSequence()
                                                   throws ParseException
Returns:
a sequence of elements
Throws:
ParseException

Element

public final HtmlDocument.HtmlElement Element()
                                       throws ParseException
Returns:
an element
Throws:
ParseException

Attribute

public final HtmlDocument.Attribute Attribute()
                                       throws ParseException
Returns:
an attribute
Throws:
ParseException

AttributeList

public final HtmlDocument.AttributeList AttributeList()
                                               throws ParseException
Returns:
an AttributeList
Throws:
ParseException

Tag

public final HtmlDocument.HtmlElement Tag()
                                   throws ParseException
Returns:
a tag
Throws:
ParseException

BlockContents

public final HtmlDocument.ElementSequence BlockContents()
                                                 throws ParseException
Returns:
the contents of a block
Throws:
ParseException

ScriptBlock

public final HtmlDocument.HtmlElement ScriptBlock()
                                           throws ParseException
Returns:
the contents of a script block
Throws:
ParseException

StyleBlock

public final HtmlDocument.HtmlElement StyleBlock()
                                          throws ParseException
Returns:
the contents of a style block
Throws:
ParseException

EndTag

public final HtmlDocument.HtmlElement EndTag()
                                      throws ParseException
Returns:
the end of a tag
Throws:
ParseException

CommentTag

public final HtmlDocument.Comment CommentTag()
                                      throws ParseException
Returns:
the start of a comment tag
Throws:
ParseException

DeclTag

public final HtmlDocument.Comment DeclTag()
                                   throws ParseException
Returns:
the start of a declaration
Throws:
ParseException

ReInit

public void ReInit(InputStream stream)
Reinitialise.


ReInit

public void ReInit(InputStream stream,
                   String encoding)
Reinitialise.


ReInit

public void ReInit(Reader stream)
Reinitialise.


ReInit

public void ReInit(HtmlParserTokenManager tm)
Reinitialise.


getNextToken

public final Token getNextToken()
Get the next Token.


getToken

public final Token getToken(int index)
Get the specific Token.


generateParseException

public ParseException generateParseException()
Generate ParseException.


enable_tracing

public final void enable_tracing()
Enable tracing.


disable_tracing

public final void disable_tracing()
Disable tracing.



Copyright © 1999-2011 Quiotix. All Rights Reserved.