NAME

Whatpm::HTML::Parser - An HTML parser


SYNOPSIS

  use Whatpm::HTML::Parser;
  use Message::DOM::DOMImplementation;
  $parser = Whatpm::HTML::Parser->new;
  $dom = Message::DOM::DOMImplementation->new;
  $doc = $dom->create_document;
  
  $parser->parse_char_string ($chars => $doc);
  $parser->parse_byte_string ($encoding, $bytes => $doc);
  ## Or, just use DOM attribute:
  $doc->manakai_is_html (1);
  $doc->inner_html ($chars);


DESCRIPTION

The Whatpm::HTML::Parser module is an implementation of the HTML parser. It implements the HTML parsing algorithm as defined by HTML Living Standard. Therefore, it's parsing behavior is fully compatible with Web browsers with HTML5 parser enabled.


METHODS

It is recommended to use standard DOM interface, such as inner_html method of the Document object, to parse an HTML string, where possible. The the Whatpm::HTML::Parser manpage module, which, in fact, is used to implement the inner_html method, offers more control on how parser behaves, which would not be useful unless you are writing a complex user agent such as browser or validator.

The the Whatpm::HTML::Parser manpage module provides following methods:

$parser = Whatpm::HTML::Parser->new

Create a new parser.

$parser->parse_char_string ($chars => $doc)

Parse a string of characters (i.e. a possibly utf8-flagged string) as HTML and construct the DOM tree.

The first argument to the method must be a string to parse. It may or may not be a valid HTML document.

The second argument to the method must be a DOM Document object (the Message::DOM::Document manpage). Any child nodes of the document is first removed by the parser.

$parser->parse_byte_string ($encoding, $bytes => $doc)

Parse a string of bytes as HTML and construct the DOM tree.

The first argument to the method must be the label of a (character) encoding, as specified by the Encoding Standard. The undef value can be specified if the encoding is not known.

The second argument to the method must be a string to parse. It may or may not be a valid HTML document.

The third argument to the method must be a DOM Document object (the Message::DOM::Document manpage). Any child nodes of the document is first removed by the parser.

$parser->set_inner_html ($node, $chars)

Parse a string of characters in the context of a node. If the node is a Document, this is equivalent to the parse_char_string method. If the node is an Element, parsing is performed in the fragment mode.

The first argument to the method must be a DOM Node object (the Message::DOM::Node manpage) that is also a Document (the Message::DOM::Document manpage) or an Element (the Message::DOM::Element manpage). The node is used to give the context to the parser and to receive the parsed subtree. Any existing child node of the node is removed first.

The second argument to the method must be a string of characters.

$code = $parser->onerror
$parser->onerror ($new_code)

Get or set the error handler for the parser. Any parse error, as well as warning and information, is reported to the handler. See the Whatpm::Errors manpage for more information.

Parsed document structure is reflected to the Document object specified as an argument to parse methods. The character encoding used to parse the document can be retrieved by the input_encoding method of the Document.

Although the parser is intended to be fully conformant to the HTML Living Standard, it might not implement latest spec changes yet. See list of bugs on the HTML parser <http://manakai.g.hatena.ne.jp/task/2/> for the current implementation status.


SEE ALSO

the Message::DOM::Document manpage, the Message::DOM::Element manpage.

the Whatpm::HTML::Serializer manpage.

the Whatpm::ContentChecker manpage.

the Whatpm::XML::Parser manpage.


SPECIFICATIONS

[HTML]

HTML Living Standard - Parsing HTML documents <http://www.whatwg.org/specs/web-apps/current-work/#parsing>.

HTML Living Standard - Parsing HTML fragments <http://www.whatwg.org/specs/web-apps/current-work/#parsing-html-fragments>.

[ENCODING]

Encoding Standard <http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html>.


AUTHOR

Wakaba <w@suika.fam.cx>.


LICENSE

Copyright 2007-2012 Wakaba <w@suika.fam.cx>.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.