NAME
    Unicode::LineBreak - UAX #14 Unicode Line Breaking Algorithm

SYNOPSIS
        use Unicode::LineBreak;
        $lb = Unicode::LineBreak->new();
        $broken = $lb->break($string);

DESCRIPTION
    Unicode::LineBreak performs Line Breaking Algorithm described in Unicode
    Standards Annex #14 [UAX #14]. East_Asian_Width informative properties
    defined by Annex #11 [UAX #11] will be concerned to determin breaking
    positions.

    NOTE: This is alpha release just for proof-of-concept.

  Public Interface
    new ([KEY => VALUE, ...])
        *Constructor*. About KEY => VALUE pairs see "Options".

    $self->config (KEY)
    $self->config (KEY => VALUE, ...)
        *Instance method*. Get or update configuration. About KEY => VALUE
        pairs see "Options".

    $self->break (STRING)
        *Instance method*. Break Unicode string STRING and returns it.

    getcontext([Charset => CHARSET], [Language => LANGUAGE])
        *Function*. Get language/region context used by character set
        CHARSET or language LANGUAGE.

  Options
    new and config methods accept following pairs.

    Context => CONTEXT
        Specify language/region context. Currently available contexts are
        "EASTASIAN" and "NONEASTASIAN". Default context is "NONEASTASIAN".

    Format => METHOD
        Specify the method to format broken lines.

        "DEFAULT"
            Default method. Just only insert newline at arbitrary breaking
            positions.

        "NEWLINE"
            Insert or replace newline sequences by that specified by Newline
            option, remove SPACEs leading newline sequences or end-of-text.
            Then append newline at end of text if it does not exist.

        "TRIM"
            Insert newline at arbitrary breaking positions. Remove SPACEs
            leading newline sequences.

        Subroutine reference
            See "Customizing Line Breaking Behavior".

        See also Newline option.

    HangulAsAL => "YES" | "NO"
        Treat hangul syllables and conjoining jamos as alphabetic characters
        (AL). Default is "NO".

    LegacyCM => "YES" | "NO"
        Treat combining characters lead by SPACE as an isolated combining
        character. As of Unicode 5.0, such use of SPACE is not recommended.
        Default is "YES".

    MaxColumns => NUMBER
        Maximum number of columns line may include not counting trailing
        spaces and newline sequence. In other words, maximum length of line.
        Default is 76.

    Newline => STRING
        Unicode string to be used for newline sequence. Default is "\n".

    NSKanaAsID => """CLASS..."""
        Treat some non-starters (NS) as normal ideographic characters (ID)
        based on classification specified by CLASS. CLASS may include
        following substrings.

        "ALL"
            All of characters below. Synonym is "YES".

        "ITERATION MARKS"
            Ideographic iteration marks.

            N.B. Some of them are neither hiragana nor katakana.

        "KANA SMALL LETTERS", "PROLONGED SOUND MARKS"
            Hiragana or katakana small letters and prolonged sound marks.

            N.B. These letters are optionally treated either as non-starter
            or as normal ideographic. See [JIS X 4051] 6.1.1.

        "MASU MARK"
            U+303C MASU MARK.

            N.B. Although this character is not kana, it is usually regarded
            as abbreviation to sequence of hiragana "ます" or katakana
            "マス", MA and SU.

            N.B. This character is classified as Non-starter (NS) by [UAX
            #14] and as Class 13 (corresponding to ID) by [JIS X 4051].

        "NO"
            Default. None of above are treated as ID characters.

    SizingMethod => METHOD
        Specify method to calculate size of string. Following options are
        available.

        "DEFAULT"
            Default method.

        "NARROWAL"
            Some particular letters of Latin, Greek and Cyrillic scripts
            have ambiguous (A) East_Asian_Width property. Thus, these
            characters are treated as wide in "EASTASIAN" context. By this
            option those characters are treated as narrow.

        Subroutine reference
            See "Customizing Line Breaking Behavior".

  Customizing Line Breaking Behavior
   Formatting Lines
    If you specify subroutine reference as a value of "Format" option, it
    should accept three arguments: Instance of LineBreak object, type of
    event and a string. Type of event is string to determine the context
    that subroutine is called in. String is a fragment of Unicode string
    leading or trailing breaking position.

        EVENT |When Fired           |Value of STRING
        -----------------------------------------------------------------
        "sot" |Beginning of text    |Fragment of first line
        "sop" |After mandatory break|Fragment of next line
        "sol" |After arbitrary break|Fragment on sequel of line
        ""    |Just before any break|Complete line without trailing
              |                     |SPACEs
        "eol" |Arabitrary break     |SPACEs leading breaking position
        "eop" |Mandatory break      |Newline and its leading SPACEs
        "eot" |End of text          |SPACEs (and newline) at end of
              |                     |text
        -----------------------------------------------------------------

    Subroutine should return modified text fragment or may return "undef" to
    express that no modification occurred. Note that modification in the
    context of "sot", "sop" or "sol" may affect decision of successive
    breaking positions while in the others won't.

   Calculating String Size
    If you specify subroutine reference as a value of "SizingMethod" option,
    it should accept five arguments: Instance of LineBreak object, original
    size of string (say LEN), origianl Unicode string (PRE), additional
    SPACEs (SPC) and Unicode string (STR).

    Subroutine should return calculated size of "PRE.SPC.STR".

   Character Classifications and Core Line Breaking Rules
    Classifications of character and core line breaking rules are defined by
    Unicode::LineBreak::Data and Unicode::LineBreak::Rules. If you wish to
    customize them, see data directory of source package.

  Configuration Files
    Built-in defaults of option parameters for "new" method can be
    overridden by configuration files: Unicode/LineBreak/Defaults.pm. For
    more details read Unicode/LineBreak/Defaults.pm.sample.

  Conformance to Standards
    Character properties based on by this module are defined by Unicode
    Standards version 5.1.0.

    This module is intended to implement UAX14-C2.

    *   Some ideographic characters may be treated either as NS or as ID by
        choice.

    *   Hangul syllables and conjoining jamos may be treated as either ID or
        AL by choice.

    *   Characters assigned to AI may be resolved to either AL or ID by
        choice.

    *   Character(s) assigned to CB are not resolved.

    *   Characters assigned to SA are resolved to AL, except that characters
        that have General_Category Mn or Mc be resolved to CM.

    *   Characters assigned to SG or XX are resolved to AL.

CAVEAT
    *To be written*.

BUGS
    Please report bugs or buggy behaviors to developer. See "AUTHOR".

VERSION
    Consult $VERSION variable.

    Development versions of this module may be found at
    <http://hatuka.nezumi.nu/repos/Unicode-LineBreak/>.

REFERENCES
    [JIS X 4051]
        JIS X 4051:2004 *日本語文書の組版方法* (*Formatting Rules
        for Japanese Documents*), published by Japanese Standards
        Association, 2004.

    [UAX #11]
        A. Freytag (2008). *Unicode Standard Annex #11: East Asian Width*,
        Revision 17. <http://unicode.org/reports/tr11/>.

    [UAX #14]
        A. Freytag and A. Heninger (2008). *Unicode Standard Annex #14:
        Unicode Line Breaking Algorithm*, Revision 22.
        <http://unicode.org/reports/tr14/>.

SEE ALSO
    Text::Wrap.

AUTHOR
    Copyright (C) 2009 Hatuka*nezumi - IKEDA Soji <hatuka(at)nezumi.nu>.

    This program is free software; you can redistribute it and/or modify it
    under the same terms as Perl itself.

