#!/usr/bin/perl -w

#-------------------------------------------------------------------------

# Make sure we pre-declare everything
# Use the necessary Perl modules

use strict;
use NexTrieve qw(PDF);

# Output warning if it is the wrong version

warn <<EOD if $NexTrieve::VERSION ne '0.37';
$0 is not using the right version of the associated Perl modules.
Operation of this script may be in error because of this.
EOD

# Create a NexTrieve object

my $ntv = NexTrieve->new( {PrintError => 1} );

# If there are no arguments specified whatsoever
#  Make sure we can execute external programs when in taint mode
#  Show the POD documentation
#  And exit

if (@ARGV==0 and -t STDIN) {
  $ntv->untaint( $ENV{'PATH'} );
  exec( 'perldoc',$0 );
  exit;
}

# Initialize the bare XML flag
# Initialize the default encoding
# Initialize the maximum title length
# Initialize the list of files
# Define the flag to be used

my $bare = '';
my $defaultencoding = 'iso-8859-1';
my $nopi = '';
my $titlemax = '';
my @file = ();
my $flag;

# For all of the parameters
#  If it is a (new) flag
#   Obtain the flag
#   If forcing binary check, set flag and reset the global flag
#   If forcing no processing instruction, set flag and reset the global flag
#   Reloop
#  Warn user if parameter without known flag and exit

foreach (@ARGV) {
  if (m#^-(\w)#) {
    $flag = $1;
    $bare = 1, $flag = '' if $flag eq 'b';
    $nopi = 1, $flag = '' if $flag eq 'n';
    next;
  }
  die "Must specify type of parameter first\n" unless $flag;

#  Add filename if we're collecting filenames

  push( @file,$_ ) if $flag eq 'f';

# If a default encoding is specified
#  Obtain that encoding

  if ($flag eq 'E') {
    $defaultencoding = $_; $flag = '';

#  Elseif it is a maximum length of a title
#   Die now if strange value
#   Set maximum length and reset flag (no more values for this)

  } elsif ($flag eq 't') {
    die "Cannot specify '$_' as maximum length for title attributes\n"
     unless m#^\d+$#;
    $titlemax = $_; $flag = '';
  }
}

# Add any files that are being piped in
# Make sure any newlines are removed from files

push( @file,<STDIN> ) unless -t STDIN;
chomp( @file );

# If there are files to be processed
#  Make sure they are all valid
# Else (no files)
#  Warn the user and quit

if (@file) {
  @file = map {m#[ <>`]# ? warn "Skipping $_\n" : $_} @file;
} else {
  die "Nothing to be done!\n";
}

# Set the default encoding if there is one specified
# Create the converter object
# If the maximum length of the title as an attribute is limited
#  Set the limitation with an attribute processor

$ntv->DefaultInputEncoding( $defaultencoding ) if $defaultencoding;
my $converter = $ntv->PDF->pdfsimple;
if ($titlemax) {
  $converter->attribute_processor( ['title',sub {substr($_[0],0,$titlemax)}] );
}

# Create the docseq object
# Set the bare XML flag if so specified
# Set the no processor instruction flag if so specified
# Make sure we'll be streaming to STDOUT to reduce memory requirements
# Do the conversion and finish up the stream

my $docseq = $ntv->Docseq;
$docseq->bare( $bare ) if $bare;
$docseq->nopi( $nopi ) if $nopi;
$docseq->stream;
$converter->Docseq( $docseq,@file )->done;

#-------------------------------------------------------------------------

__END__

=head1 pdf2ntvml [-E defaultinputencoding] [-b] [-n] [-t 256] [-f files]

PDF to XML converter for use with NexTrieve.

=head2 Attributes

XML <filename> attribute contains the filename of the HTML file.

XML <title> attribute contains the title of the HTML-file (if any)

=head2 Text-types

XML <title> text-type contains the title of the HTML-file (if any)

=head2 Example Output

 <document>
  <attributes>
   <filename>example.pdf</filename>
   <title>Example</title>
  </attributes>
  <text>
   <title>Example</title>
   Text found in the PDF-file
  </text>
 </document>

=head2 Why <title> both an attribute as well as a text-type?

A match for a word in a query in the <title> text-type can be very much more
significant than when that word would be found in the body text.  However, if
you want to display the information of a hit, it is handy to have the title
(and the filename) of the document available as well.  That is why the title
is added as an attribute as well, even though it can not be used for
constraining a query (at least, not yet).

=head1 Usage

 pdf2ntvml -f file1 file2 file3 > xml

 pdf2ntvml <files.list > xml

=head2 Example

Convert all .pdf files in the "doc_root" directory to XML and store that XML
in the file "xml".

 pdf2ntvml -f doc_root/*.pdf >xml

=head2 Example

Index all of the files located by find command.

 find / --iregex '*.pdf' | pdf2ntvml | docseq | ntvindex -

=head1 Requirements

Requires the availability of the NexTrieve.pm module and associated modules
as found on CPAN (http://www.cpan.org/).  Also requires the availability of
the "pdfinfo" and "pdftotext" programs of the xpdf package, located at
http://www.foolabs.com/xpdf/ .

=head1 Parameter settings

=head2 -f file1 file2 file3: read filenames from command line

If you want to specify the filenames from the command line rather than pipe
them from STDIN, that is possible by specifying the -f parameter, followed
by the list of files you want to process.  This is additional to any filenames
piped through STDIN.

=head2 -E defaultinputencoding

When processing PDF, many older PDF-files do not contain the information
needed for XML to determine which character encoding is being used.  By
specifying the -E parameter, you can specify which encoding should be assumed
if no encoding information is found.  The default is "iso-8859-1".

=head2 -n

If you are merging multiple runs of this script into the same file, you do
not need the <?xml...?> processor instruction to be repeated.  By specifying
this flag, the processor instruction will _not_ be emitted to the XML stream.

=head2 -b

If you are merging results of multiple runs, you may also want the <ntv:docseq>
container not to be emitted.  Specifying the -b (for "bare XML") flag does just
that.

=head2 -t 256: maximum length for title attribute

Some PDF out there in the world contains B<very> long titles.  This is done
by some people to get higher rankings, as many search engines value text in a
title more than text in a body (or only search in the title at all).

Experience has shown that titles of more than 10K are not uncommon.  This
however causes all sorts of problems in the display of the hitlists (where
the title is one of the attributes returned) and in general it brings down
the performance.

The -t parameter allows you to put a maximum length of the title as stored
as an attribute (and therefore returned in the hitlist).  It does B<not> alter
the length of the title stored as a texttype.

The default for -t is B<0>, indicating not limiting of text.

=head1 AUTHOR

Elizabeth Mattijsen, <liz@dijkmat.nl>.

Please report bugs to <perlbugs@dijkmat.nl>.

=head1 COPYRIGHT

Copyright (c) 1995-2002 Elizabeth Mattijsen <liz@dijkmat.nl>. All rights
reserved.  This program is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.

=head1 SEE ALSO

http://www.nextrieve.com and the NexTrieve::xxx modules.

=cut
