NAME
    Text::Scan - Fast search for very large numbers of keys in a body of
    text.

SYNOPSIS
            use Text::Scan;

            $dict = new Text::Scan;

            %terms = ( dog  => 'canine',
                       bear => 'ursine',
                       pig  => 'porcine' );

            # load the dictionary with keys and values
            # (values can be any scalar, keys must be strings)
            while( ($key, $val) = each %terms ){
                    $dict->insert( $key, $val );
            }

            # Scan a document for matches
            %found = $dict->scan( $document );

            # Or, if you need to count number of occurrences of any given 
            # key, use an array. This will give you a countable flat list
            # of key => value pairs.
            @found = $dict->scan( $document );

            # Check for membership ($val is true)
            $val = $dict->has('pig');

            # Retrieve all keys. This returns all inserted keys in ascending 
            # char value, substrings first.
            @keys = $dict->keys();

            # Retrieve all values (in same order as corresponding keys) 
            # (new in v0.10)
            @vals = $dict->values();
        
            # Like perl's index() but with multiple patterns (new in v0.07)
            # Scan for the starting positions of terms.
            @indices = $dict->mindex( $document );

            # The hash version of mindex() records the position of the first 
            # occurrences of each word
            %indices = $dict->mindex( $document ); 

            # Turn on wildcard scanning. (New in v0.09) 
            # This can be done anytime. Works for scan() and mindex()
            $dict->usewild();
                
DESCRIPTION
    This module provides facilities for fast searching on arbitrarily long
    texts with very many search keys. The basic object behaves somewhat like
    a perl hash, except that you can retrieve based on a superstring of any
    keys stored. Simply scan a string as shown above and you will get back a
    perl hash (or list) of all keys found in the string (along with
    associated values (or positions if you use mindex() instead of scan(),
    see examples above)). All keys present in the text are returned, except
    in the case where one or more keys are present but are prefixes of
    another longer key. In these cases only the longest key is returned.

    NOTE: This is a behavioral change from previous versions where keys
    could never overlap. Now they may overlap and still be detected.

    IMPORTANT: A single space is used as a delimiter for purposes of
    recognizing key boundaries. That's right, there is a bias in favor of
    processing natural language! In other words, if 'my dog' is a key and
    'my dogs bite' is the text, 'my dog' will not be recognized. I plan to
    make this more configurable in the future, to have a different delimiter
    or none at all. For now, recognize that the key 'drunk' will not be
    found in the text 'gedrunk' or 'drunken' (or 'drunk.' for that matter).
    Properly tokenizing your corpus is essential. I know there is probably a
    better solution to the problem of substrings, and if anyone has
    suggestions, by all means contact me.

COMMENTARY
    What I am leaning toward is simply having no implicit delimiter at all,
    and relying on the programmer to use a chosen delimiter when inserting
    keys, then tokenizing the target text properly so that the delimiter is
    present at boundaries as defined by your application. This would leave
    you free to have no delimiter if you really want "drunk" to match
    "gedrunk", "drunken", "drunk." etc. The chore of tokenizing the target
    would be mitigated by pattern matching capabilities (hmm..)

NEW
    In v 0.13: A more-or-less complete rewrite of Text::Scan uses a more
    traditional finite-state machine rather than a ternary trie for the
    basic data structure. This results in an average 20% savings in memory
    and 10% savings in runtime, besides being much simpler to implement,
    thus less prone to bugs.

    In v 0.09: Wildcards! A limited wildcard functionality is available.
    call usewild() to turn it on. Thereafter any asterisk (*) followed by a
    space (' ') will be treated as "zero or more non-space characters". Once
    this function is turned on, the scan will be approximately 50% slower
    than with literal strings. If you include '*' in any key without calling
    usewild(), the '*' will be treated literally.

TO DO
    Some obvious things have not been implemented. Deletion of key/values,
    patterns as keys (kind of a big one), the abovementioned elimination of
    the default boundary marker ' ', possibility of calling scan() with a
    filehandle instead of a string scalar. There is also an optimization
    I've been thinking about, call it "continuation reentrancy", that would
    greatly speed up matches on literal strings by never examining the same
    input char more than once.

    Another optimization that might help is a transition reordering scheme
    for the sequential searches within states. This was shown by Sleator to
    reduce the strict number of comparisons over time.

CREDITS
    Chad, Tara, Dan, Kim, love ya sweethearts.

    Many test scripts come directly from Rogaski's "Tree::Ternary" module.

    The C code interface was created using Ingerson's "Inline".

OLD CREDITS (versions prior to 0.13)
    The basic data structure used to be a ternary trie, but I changed it
    starting with version 0.13 to a finite state machine, for the sake of
    performance and simplicity. However, it was a lot of fun working with
    these ideas, so I'm including the old credits here.

    The basic framework for this code is borrowed from both Bentley &
    Sedgwick, and Leon Brocard's additions to it for "Tree::Ternary_XS". The
    differences are in the modified search algorithm to allow for scanning,
    the storage of keys/values, and an extra node-rotation for gradual
    self-adjusting optimization to the statistical characteristics of the
    target text.

    Many test scripts come directly from Rogaski's "Tree::Ternary" module.

    The C code interface was created using Ingerson's "Inline".

SEE ALSO
    "Bentley & Sedgwick "Fast Algorithms for Sorting and Searching Strings",
    Proceedings ACM-SIAM (1997)"

    "Bentley & Sedgewick "Ternary Search Trees", Dr Dobbs Journal (1998)"

    "Sleator & Tarjan "Self-Adjusting Binary Search Trees", Journal of the
    ACM (1985)"

    "Tree::Ternary"

    "Tree::Ternary_XS"

    "Inline"

COPYRIGHT
    Copyright 2001, 2002 Ira Woodhead. All rights reserved.

    This library is free software; you can redistribute it and/or modify it
    under the same terms as Perl itself

AUTHOR
    Ira Woodhead, ira@foobox.com

