NAME
    YAPE::HTML - Yet Another Parser/Extractor for HTML

SYNOPSIS
      use YAPE::HTML;
      use strict;
      
      my $content = "<html>...</html>";
      my $parser = YAPE::HTML->new($content);
      my ($extor,@fonts,@urls,@headings,@comments);
      
      # here is the tokenizing part
      while (my $chunk = $parser->next) {
        if ($chunk->type eq 'tag' and $chunk->tag eq 'font') {
          if (my $face = $chunk->get_attr('face')) {
            push @fonts, $face;
          }
        }
      }
      
      # here we catch any errors
      unless ($parser->done) {
        die sprintf "bad HTML: %s (%s)",
          $parser->error, $parser->chunk;
      }
      
      # here is the extracting part
      
      # <A> tags with HREF attributes
      # <IMG> tags with SRC attributes
      $extor = $parser->extract(a => ['href'], img => ['src']);
      while (my $chunk = $extor->()) {
        push @urls, $chunk->get_attr(
          $chunk->tag eq 'a' ? 'href' : 'src'
        );
      }
      
      # <H1>, <H2>, ..., <H6> tags
      $extor = $parser->extract(qr/^h[1-6]$/ => []);
      while (my $chunk = $extor->()) {
        push @headings, $chunk;
      }
      
      # all comments
      $extor = $parser->extract(-COMMENT);
      while (my $chunk = $extor->()) {
        push @comments, $chunk;
      }

`YAPE' MODULES
    The `YAPE' hierarchy of modules is an attempt at a unified means
    of parsing and extracting content. It attempts to maintain a
    generic interface, to promote simplicity and reusability. The
    API is powerful, yet simple. The modules do tokenization (which
    can be intercepted) and build trees, so that extraction of
    specific nodes is doable.

DESCRIPTION
    This module is yet another parser and tree-builder for HTML
    documents. It is designed to make extraction and modification of
    HTML documents simplistic. The API allows for easy custom
    additions to the document being parsed, and allows very specific
    tag, text, and comment extraction.

USAGE
    In addition to the base class, `YAPE::HTML', there is the
    auxiliary class `YAPE::HTML::Element' (common to all `YAPE' base
    classes) that holds the individual nodes' classes. There is
    documentation for the node classes in that module's
    documentation.

    HTML elements and their attributes are stored internally as
    lowercase strings. For clarification, that means that the tag
    `<A HREF="FooBar.html">' is stored as

      {
        TAG => 'a',
        ATTR => {
          href => 'FooBar.html',
        }
      }

    This means that tags will be output in lowercase. There will be
    a feature in a future version to switch output case to capital
    letters.

  Functions

    * `YAPE::HTML::EMPTY(@tags)'
        Adds to the internal hash of tags which never contain any
        out-of-tag content. This hash is `%YAPE::HTML::EMPTY', and
        contains the following tag names: `area', `base', `br',
        `hr', `img', `input', `link', `meta', and `param'. Deletion
        from this hashmust be done manually. Adding to this hash
        automatically adds to the `%OPEN' hash, described next.

    * `YAPE::HTML::OPEN(@tags)'
        Adds to the internal hash of tags which do not require a
        closing tag. This hash is `%YAPE::HTML::OPEN', and contains
        the following tag names: `area', `base', `br', `dd', `dt',
        `hr', `img', `input', `li', `link', `meta', `p', and
        `param'. Deletion from this hash must be done manually.

        There is a subtle difference between "empty" and "open"
        tags. For example, the `<AREA>' tag contains a few
        attributes, but there is no text associated with it (nor any
        other tags), and therefore, is "empty"; the `<LI>', on the
        other hand,

        It is strongly suggested that for ease in parsing, any tags
        that you do not explicitly close have a `/' at the end of
        the tag:

          Here's my cat: <img src="cat.jpg" />

  Methods for `YAPE::HTML'

    * `use YAPE::HTML;'
    * `use YAPE::HTML qw( MyExt::Mod );'
        If supplied no arguments, the module is loaded normally, and
        the node classes are given the proper inheritence (from
        `YAPE::HTML::Element'). If you supply a module (or list of
        modules), `import' will automatically include them (if
        needed) and set up *their* node classes with the proper
        inheritence -- that is, it will append `MyExt::Mod::Element'
        and `YAPE::HTML::Element' to each node class's `@ISA' (where
        `MyExt::Mod' is the name of the module being used).

    * `my $p = YAPE::HTML->new($HTML, $strict);'
        Creates a `YAPE::HTML' object, using the contents of the
        `$HTML' string as its HTML to parse. The optional second
        argument determines whether this parser instance will demand
        strict comment parsing and require all tags to be closed
        with a closing tag or a `/' at the end of the tag (`<HR
        />'). Any true value (except for the special string `-
        NO_STRICT') will turn strict parsing on. This is off by
        default. (This could be considered a bug.)

    * `my $text = $p->chunk($len);'
        Returns the next `$len' characters in the input string;
        `$len' defaults to 30 characters. This is useful for
        figuring out why a parsing error occurs.

    * `my $done = $p->done;'
        Returns true if the parser is done with the input string,
        and false otherwise.

    * `my $errstr = $p->error;'
        Returns the parser error message.

    * `my $coderef = $p->extract(...);'
        Returns a code reference that returns the next object that
        matches the criteria given in the arguments. The arguments
        are various; all text:

          $p->extract(-TEXT);

        all comments:

          $p->extract(-COMMENT);

        all tags:

          $p->extract(-TAG);

        specific tags:

          $p->extract(b => [], i => [], u => []);

        specific tags with specific attributes:

          $p->extract(a => ['href','target']);

        regex object to match tags:

          $p->extract(qr/^h[1-6]$/ => []);

        regex object with specific attributes:

          $p->extract(qr/^h[1-6]$/ => ['align']);

        or any combination of these -- the exception being that the
        three constants must appear before any of the tag-attribute
        pairs:

          $p->extract(
            -COMMENT,            # all comments
            div => ['align'],    # <DIV> with ALIGN attr
            qr/^t[drh]$/ => [],  # <TD> <TR> <TH> tags
          );

    * `my $node = $p->display(...);'
        Returns a string representation of the entire content. It
        calls the `parse' method in case there is more data that has
        not yet been parsed. This calls the `fullstring' method on
        the root nodes. Check the `YAPE::HTML::Element' docs on the
        arguments to `fullstring'.

    * `my $node = $p->next;'
        Returns the next token, or `undef' if there is no valid
        token. There will be an error message (accessible with the
        `error' method) if there was a problem in the parsing.

    * `my $node = $p->parse;'
        Calls `next' until all the data has been parsed.

    * `my $attr = $p->quote($string);'
        Returns a quoted string, suitable for using as an attribute.
        It turns any embedded `"' characters into `&quot;'. This can
        also be called as a raw function:

          my $quoted = YAPE::HTML::quote($string);

    * `my $root = $p->root;'
        Returns an array reference holding the root of the tree
        structure -- for documents that contain multiple top-level
        tags, this will have more than one element.

    * `my $state = $p->state;'
        Returns the current state of the parser. It is one of the
        following values: `close(TAG)', `comment', `done', `error',
        `open(TAG)', `text', `text(script)', or `text(xmp)'. The
        `open' and `close' states contain the name of the element in
        parentheses (ex. `open(img)'). Tag names, as well as the
        names of attributes, are converted to lowercase. The state
        of `text(script)' refers to text found inside an `<SCRIPT>'
        element, and likewise for `text(xmp)'.

    * `my $HTMLnode = $p->top;'
        Returns the first `<HTML>' node it finds in the tree
        structure.

FEATURES
    This is a list of special features of `YAPE::HTML'.

    * On-the-fly cleaning of HTML
        If you aren't enforcing strict HTML syntax, then in the act
        of parsing HTML, if a tag that *should* be closed is not
        closed, it will be flagged for closing. That means that
        input like:

          <b>Foo<i>bar</b>

        will appear as:

          <b>Foo<i>bar</i></b>

        upon request for output. In addition, tags that are left
        dangling open at the end of an HTML document get closed.
        That means:

          <b>Foo<i>bar

        will appear as:

          <b>Foo<i>bar</i></b>

    * Syntax-checking
        If strict checking is off, the only error you'll receive
        from mismatched HTML tags is a closing tag out-of-place.

        On the other hand, if you do enforce strict HTML syntax,
        you'll be informed of tags that do not get closed as well
        (that should be closed).

TO DO
    This is a listing of things to add to future versions of this
    module.

  API

    * Proper DTD support
        Add an object, `YAPE::HTML::dtd', for handling a
        `<!DOCTYPE>' tag.

    * HTML entity translation (via `HTML::Entities' no doubt)
        Add a flag to the `fullstring' method of objects, `-EXPAND',
        which will display `&...;' HTML escapes as the character
        representing them.

    * SSI parsing support
        Add an object, `YAPE::HTML::ssi', for handling `<!--#command
        >' tags.

    * Toggle case of output (lower/upper case)
        Add a flag to the `fullstring' method of objects, `-UPPER',
        which will display tag and attribute names in uppercase.

    * Super-strict syntax checking
        DTD-like strictness in regards to nesting of elements --
        `<LI>' is not allowed to be outside an `<OL>' or `<UL>'
        element.

  Internals

    * Make it faster, of course
        There's probably some inherent slowness to this method, but
        it works. And it supports the robust `extract' method.

    * Combine `CLOSED' and `IMPLICIT'
        Make three constants, `CLOSED_NO', `CLOSED_YES', and
        `CLOSED_IMPL'.

BUGS
    Following is a list of known or reported bugs.

    * The above features aren't in here yet.  `;)'
    * Strict syntax-checking is not on by default.
    * This documentation might be incomplete.
    * Probably need some more test cases.
SUPPORT
    Visit `YAPE''s web site at http://www.pobox.com/~japhy/YAPE/.

SEE ALSO
    The `YAPE::HTML::Element' documentation, for information on the
    node classes.

AUTHOR
      Jeff "japhy" Pinyan
      CPAN ID: PINYAN
      japhy@pobox.com
      http://www.pobox.com/~japhy/

