NAME
    Unicode::Truncate - Unicode-aware efficient string truncation

SYNOPSIS
        use Unicode::Truncate;

        truncate_utf8("hello world", 7);
        ## "hell…";

        truncate_utf8("hello world", 7, '');
        ## "hello w"

        truncate_utf8('深圳', 7);
        ## "深…"

DESCRIPTION
    This module is for truncating UTF-8 encoded Unicode text to particular
    byte lengths while inflicting the least amount of data corruption
    possible. The resulting truncated string will be no longer than your
    specified number of bytes (after UTF-8 encoding).

    With this module's "truncate_utf8", all truncated strings will continue
    to be valid UTF-8: it won't cut in the middle of a UTF-8 encoded
    code-point. Furthermore, if your text contains combining diacritical
    marks, this module will not cut in between a diacritical mark and the
    base character.

RATIONALE
    Why not just use "substr" on a string before UTF-8 encoding it? The main
    problem is that the number of bytes that an encoded string will consume
    is not known until after you encode it. It depends on how many "high"
    code-points are in the string, how "high" those code-points are, the
    normalisation form chosen, and (relatedly) how many combining marks are
    used. Even before encoding, "substr" also may cut in front of combining
    marks.

    Truncating post-encoding may result in invalid UTF-8 partials at the end
    of your string, as well as cutting in front of combining marks.

    I knew I had to write this module after I asked Tom Christiansen about
    the best way to truncate unicode to fit in fixed-byte fields and he got
    angry and told me to never do that. :)

    Of course in a perfect world we would only need to worry about the
    amount of space some text takes up on the screen, in the real world we
    often have to or want to make sure things fit within certain byte size
    capacity limits. Many data-bases, network protocols, and file-formats
    require honouring byte-length restrictions. Even if they automatically
    truncate for you, are they doing it properly and consistently? On many
    file-systems, file and directory names are subject to byte-size limits.
    Many APIs that use C structs have fixed limits as well. You may even
    wish to do things like guarantee that a collection of news headlines
    will fit in a single ethernet packet.

    One interesting aspect of unicode's combining marks is that there is no
    specified limit to the number of combining marks that can be applied. So
    in some interpretations a single decomposed unicode character can take
    up an arbitrarily large number of bytes in its UTF-8 encoding. However,
    there are various recommendations such as the unicode standard UAX15-D3
    <http://www.unicode.org/reports/tr15/#UAX15-D3> "stream-safe" limit of
    30. Reportedly the largest known "legitimate" use is a 1 base + 8
    combining marks grapheme used in a Tibetan script.

ELLIPSIS
    When a string is truncated, "truncate_utf8" indicates this by appending
    an ellipsis. By default this is the character U+2026 (…) however you can
    use any other string by passing it in as the third argument. Note that
    in UTF-8 encoding the default ellipsis consumes 3 bytes (the same as 3
    periods in a row). The length of the truncated content *including* the
    ellipsis is guaranteed to be no greater than the byte size limit you
    specified.

IMPLEMENTATION
    This module uses the ragel <http://www.colm.net/open-source/ragel/>
    state machine compiler to parse/validate UTF-8 and to determine the
    presence of combining characters. Ragel is nice because we can determine
    the truncation location with a single pass through the data in an
    optimised C loop.

    One feature of this design is that it will not scan further than it
    needs to in order to determine the truncation location. So creating
    short truncations of really long strings doesn't even require traversing
    the long strings.

    Another purpose of this module is to be a "proof of concept" for the
    Inline::Filters::Ragel source filter as well as a demonstration of the
    really cool Inline::Module system.

SEE ALSO
    Unicode-Truncate github repo
    <https://github.com/hoytech/Unicode-Truncate>

    Although very efficient, as discussed above, "substr" will not be able
    to give you a guaranteed byte-length output (if done pre-encoding)
    and/or will potentially corrupt text (if done post-encoding).

    There are several similar modules such as Text::Truncate,
    String::Truncate, Text::Elide but they are all essentially wrappers
    around "substr" and are subject to its limitations.

    A reasonable "99%" solution is to encode your string as UTF-8, truncate
    at the byte-level with "substr", decode with "Encode::FB_QUIET", and
    then re-encode it to UTF-8. This will ensure that the output is always
    valid UTF-8, but will still risk corrupting unicode text that contains
    combining marks.

    Ricardo Signes suggested an algorithm using Unicode::GCString which
    would be very correct but likely less efficient.

    It may be possible to use the regexp engine's "\X" combined with "(?{})
    in some way but I haven't been able to figure that out."

BUGS
    This module currently only implements a sub-set of unicode's grapheme
    cluster boundary rules
    <http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries>.
    Eventually I plan to extend this so the module "does the right thing" in
    more cases. Of course I can't test this on all the writing systems of
    the world so I don't know the severity of the corruption in all
    situations. It's possible that the corruption can be minimised in
    additional ways without sacrificing the simplicity or efficiency of the
    algorithm. If you have any ideas please let me know and I'll try to
    incorporate them.

    One obvious enhancement for languages that use white-space is to chop
    off the last (potentially partial) word up to the next whitespace block:
    "s/\S+$//" (note you'll have to worry about the ellipsis yourself in
    this case).

    Currently building this module requires Inline::Filters::Ragel to be
    installed. I'd like to add an option to Inline::Module that has ragel
    run at dist time instead.

    Perl internally supports characters outside what is officially unicode.
    This module only works with the official UTF-8 range so if you are using
    this perl extension (perhaps for some sort of non-unicode sentinel
    value) this module will throw an exception indicating invalid UTF-8
    encoding.

AUTHOR
    Doug Hoyte, "<doug@hcsw.org>"

COPYRIGHT & LICENSE
    Copyright 2014 Doug Hoyte.

    This module is licensed under the same terms as perl itself.

POD ERRORS
    Hey! The above document had some coding errors, which are explained
    below:

    Around line 201:
        Unterminated C<...> sequence

