package Text::Categorize::Textrank::En;

use strict;
use warnings;
use Log::Log4perl;
use Text::StemTagPOS;
use Text::Categorize::Textrank;
use Data::Dump qw(dump);

# TODO: need parameter for maximum phrase length.

BEGIN {
    use Exporter ();
    use vars qw($VERSION @ISA @EXPORT @EXPORT_OK %EXPORT_TAGS);
    $VERSION     = '0.50';
    @ISA         = qw(Exporter);
    @EXPORT      = qw(getTextrankInfoOfText);
    @EXPORT_OK   = qw(getTextrankInfoOfText);
    %EXPORT_TAGS = ();
}

#12345678901234567890123456789012345678901234
#Find potential keywords in English text.

=head1 NAME

C<Text::Categorize::Textrank::En> - Find potential keywords in English text.

=head1 SYNOPSIS

	use strict;
	use warnings;
	use Text::Categorize::Textrank::En;
	use Data::Dump qw(dump);
	my $textrankerEn = Text::Categorize::Textrank::En->new();
	my $text         = 'This is the first sentence. Here is the second sentence.';
	my $results      = $textrankerEn->getTextrankInfoOfText(listOfText => [$text]);
	dump $results->{hashOfTextrankValues};

=head1 DESCRIPTION

C<Text::Categorize::Textrank::En> provides methods for ranking the words in English
text as potential keywords. It implements a version of the textrank algorithm
from the report I<TextRank: Bringing Order into Texts> by R. Mihalcea and P. Tarau.

Encoding of all text should be in Perl's internal format; see L<Text::Iconv> or L<Encode> for
converting text from various encodings.

=head1 CONSTRUCTOR

=head2 C<new>

The method C<new> creates an instance of the C<Text::Categorize::Textrank::En>
class with the following parameters:

=over

=item C<endingSentenceTag>

 endingSentenceTag => 'PP'

C<endingSentenceTag> is the part-of-speech tag that should be used to indicate
the end of a sentence. The default is 'PP'. The value of this tag must be
a tag generated by the module L<Lingua::EN::Tagger>.

=item C<listOfPOSTypesToKeep>

 listOfPOSTypesToKeep => [qw(TEXTRANK_WORDS)]

The textrank algorithm preprocesses the text so that only certain parts-of-speech (POS) are retained
and used to build the graph representing the text. The module L<Lingua::EN::Tagger> is used
to tag the parts-of-speech of the text. The parts-of-speech retained can be specified by
word types, where the type is a combination of 'ALL', 'ADJECTIVES', 'ADVERBS', 'CONTENT_WORDS', 'NOUNS', 'PUNCTUATION',
'TEXTRANK_WORDS', or 'VERBS'. The default is C<[qw(TEXTRANK_WORDS)]>, which equates to
C<[qw(ADJECTIVES NOUNS)]>.

=item C<listOfPOSTagsToKeep>

 listOfPOSTagsToKeep => [...]

C<listOfPOSTagsToKeep> provides finer control over the
parts-of-speech to be retained when filtering the tagged text. For a list
of all the possible tags call C<getListOfPartOfSpeechTags()>.

=back

=cut

sub new
{
  my ($Class, %Parameters) = @_;
  my $Self = bless ({}, ref ($Class) || $Class);

  # get the POS/stemmer engine.
  $Self->{posTaggerStemmerEngine} = Text::StemTagPOS->new (%Parameters);

  return $Self;
}

=head1 METHODS

=head2 C<getTextrankInfoOfText>

 getTextrankInfoOfText (...)

The method C<getTextrankInfoOfText> returns a data structure (hash-reference)
containing all the stemmed
words partitioned into their sentences (L<listOfStemmedTaggedSentences|Text::StemTagPOS/getStemmedAndTaggedText>), the subset of words
used to compute the textranks (L<listOfFilteredSentences|Text::StemTagPOS/getTaggedTextToKeep>), and the textrank
of the tokens (L<hashOfTextrankValues|Text::Categorize::Textrank/getTextrankOfListOfTokens>) that occur in C<listOfFilteredSentences>. The sum of all the textrank
values is one.

More precisely, if C<$results> is the returned hash, then C<$results-E<gt>{listOfStemmedTaggedSentences}>
contains the array reference generated by the L<getStemmedAndTaggedText|Text::StemTagPOS/getStemmedAndTaggedText>
method of L<Text::StemTagPOS>, C<$results-E<gt>{listOfFilteredSentences}>
contains the array reference generated by L<getTaggedTextToKeep|Text::StemTagPOS/getTaggedTextToKeep>
of L<Text::StemTagPOS>, and C<$results-E<gt>{hashOfTextrankValues}>
holds the hash of the textrank values computed by L<getTextrankOfListOfTokens|Text::Categorize::Textrank/getTextrankOfListOfTokens>.
C<$results-E<gt>{useStemmedWords}> is also set to the value of C<useStemmedWords>.

=over

=item C<listOfStemmedTaggedSentences>

 listOfStemmedTaggedSentences => [...]

C<listOfStemmedTaggedSentences> is the array reference containing the list of stemmed and part-of-speech
tagged sentences from L<Text::StemTagPos>. If C<listOfStemmedTaggedSentences> is not defined, then the
text to be processed should be provided via C<listOfText>.

=item C<listOfText>

 listOfText => [...]

C<listOfText> is an array reference containing the strings of text to be categorized. C<listOfText> is
only used if C<listOfStemmedTaggedSentences> is undefined.

=item C<edgeCreationSpan>

  edgeCreationSpan => 1

For each word in the text, C<edgeCreationSpan> is the number of successive
words used to make an edge in the textrank token graph. For example, if
C<tokenEdgeSpanSize> is two, then given the word sequence C<"apple orange pear">
the edges C<[apple, orange]> and C<[apple, pear]> will be added to the text
graph for the word C<apple>. The default is one.

Note that loop edges are ignored. For example,
if C<edgeCreationSpan> is two, then given the word sequence C<"daba daba doo">
the edge C<[daba, daba]> is disguarded but the edge C<[daba, doo]> is
added to the token graph.

=item C<directedGraph>

  directedGraph => 0

If C<directedGraph> is true, the textranks
are computed from the directed token graph, if false, they
are computed from the undirected version of the graph. The default is false.

=item C<pageRankDampeningFactor>

  pageRankDampeningFactor => 0.85

When computing the textranks of the token graph, the dampening factor
specified by C<pageRankDampeningFactor> will
be used; it should range from zero to one. The default is 0.85.

=begin html

The Wikipedia article on <a href="http://en.wikipedia.org/wiki/PageRank">pagerank</a> has a good explaination of the
<a href="http://en.wikipedia.org/wiki/PageRank#Damping_factor">dampening factor</a>.<br>&nbsp;

=end html

=item C<addEdgesSpanningSentences>

  addEdgesSpanningLists => 1

If C<addEdgesSpanningLists> is true, then when building the token graph, links
between the tokens at the end of a list and the beginning of the next list
will be made. For example, for the lists C<[[qw(This is the first list)], [qw(Here is the second list)]]>
the edge C<[list, Here]> will be added to the token graph. The default is true.

=item C<useStemmedWords>

  useStemmedWords => 1

If C<useStemmedWords> is true, then when building the token graph, the stemmed
words are used as the id of each node, otherwise the original words
are used; in both cases the stemmed or original words are converted to
lowercase. The default is true.

=back

=cut

sub getTextrankInfoOfText
{
  my ($Self, %Parameters) = @_;

  # get the text to process.
  my $listOfStemmedTaggedSentences;
  if (exists ($Parameters{listOfStemmedTaggedSentences}))
  {
    $listOfStemmedTaggedSentences = $Parameters{listOfStemmedTaggedSentences};
  }
  elsif (exists($Parameters{listOfText}))
  {
    $listOfStemmedTaggedSentences = $Self->{posTaggerStemmerEngine}->getStemmedAndTaggedText ($Parameters{listOfText});
  }
  else
  {
    my $logger = Log::Log4perl->get_logger();
    $logger->logdie("error: one of the parameters 'listOfStemmedTaggedSentences' or 'listOfText' must be defined.");
  }

  # set the parameter to use the original or stemmed word.
  my $useStemmedWords = 1;
  $useStemmedWords = $Parameters{useStemmedWords} if exists $Parameters{useStemmedWords};
  my $tokenIndex;
  if ($useStemmedWords) { $tokenIndex = Text::StemTagPOS::WORD_STEMMED; }
  else { $tokenIndex = Text::StemTagPOS::WORD_ORIGINAL; }

  # set the addEdgesSpanningLists option via the addEdgesSpanningSentences flag.
  $Parameters{addEdgesSpanningLists} = 1;
  $Parameters{addEdgesSpanningLists} = $Parameters{addEdgesSpanningSentences} if exists $Parameters{addEdgesSpanningSentences};

  # filter the tagged text down to only the parts-of-speech that are to be kept.
  my $listOfFilteredSentences = $Self->{posTaggerStemmerEngine}->getTaggedTextToKeep (listOfStemmedTaggedSentences => $listOfStemmedTaggedSentences);

  # build the list of sentences containing only the stemmed words kept.
  my @listOfTokens;
  foreach my $sentence (@$listOfFilteredSentences)
  {
    # skip empty sentences.
    next unless ($#$sentence + 1);

    # use only the stemmed word as the token.
    push @listOfTokens, [map {lc $_->[$tokenIndex]} @$sentence];
  }

  # get the textrank of the tokens.
  my $hashOfTextrankValues = getTextrankOfListOfTokens (%Parameters, listOfTokens => \@listOfTokens);

  # store the tagged text, filtered text, and textrank values in a hash.
  # all this info is needed to build the keywords and phrases.
  my %textrankInfo;
  $textrankInfo{listOfStemmedTaggedSentences} = $listOfStemmedTaggedSentences;
  $textrankInfo{listOfFilteredSentences} = $listOfFilteredSentences;
  $textrankInfo{hashOfTextrankValues} = $hashOfTextrankValues;
  $textrankInfo{useStemmedWords} = $useStemmedWords;

  return \%textrankInfo;
}


=head1 INSTALLATION

To install the module run the following commands:

  perl Makefile.PL
  make
  make test
  make install

If you are on a windows box you should use 'nmake' rather than 'make'.

=head1 BUGS

Please email bugs reports or feature requests to C<bug-text-categorize-textrank-en@rt.cpan.org>, or through
the web interface at L<http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Text-Categorize-Textrank-En>.  The author
will be notified and you can be automatically notified of progress on the bug fix or feature request.

=head1 AUTHOR

 Jeff Kubina<jeff.kubina@gmail.com>

=head1 COPYRIGHT

Copyright (c) 2009 Jeff Kubina. All rights reserved.
This program is free software; you can redistribute
it and/or modify it under the same terms as Perl itself.

The full text of the license can be found in the
LICENSE file included with this module.

=head1 KEYWORDS

categorize, english, keywords, keyprhases, nlp, pagerank, textrank

=head1 SEE ALSO

=begin html

This package implements the Textrank algorithm from the report
<a href="http://bit.ly/akSJok">TextRank: Bringing Order into Texts</a>
by <a href="http://www.cse.unt.edu/~rada/">Rada Mihalcea</a> and <a href="www.cse.unt.edu/~tarau/">Paul Tarau</a>;
which is related to <a href="http://en.wikipedia.org/wiki/PageRank">pagerank</a>.

See the Lingua::EN::Tagger <a href="http://cpansearch.perl.org/src/ACOBURN/Lingua-EN-Tagger-0.15/README">README</a>
file for a list of the part-of-speech tags.

=end html

L<Lingua::EN::Tagger>, L<Lingua::Stem::Snowball>,  L<Log::Log4perl>, L<Text::Categorize::Textrank>, L<Text::StemTagPOS>

=cut

1;
# The preceding line will help the module return a true value
