=head1 NAME

SenseClusters

=head1 SYNOPSIS

SenseClusters is a suite of Perl programs that supports unsupervised 
clustering of similar contexts. It relies on it's own native methodology, 
and also provides support for Latent Semantic Analysis.

SenseClusters is a complete system that takes users from preprocessing of  
raw text to providing clustered output. It supports the selection of  
features, the creation of various kinds of context representations,  
dimensionality reduction by singular value decomposition, clustering, 
and analysis of results. 

SenseClusters integrates specialized tools such as the Ngram Statistics 
Package (NSP), SVDPACK, the Perl Data Language (PDL) and CLUTO to provide 
a variety of choices and high efficiency at each step in its processing.

=head1 OVERVIEW

SenseClusters supports several different methods of clustering contexts. 
These include the native SenseClusters methodology, which is based on the 
use of first and second order representations of contexts. It also 
includes support for clustering lexical features using the native  
SenseClusters methodology or Latent Semantic Analysis. 

SenseClusters is based strictly on lexical features and does not rely on  
any manually created training data or external knowledge sources, and as  
such is language independent. The only requirement is that the language 
should be able to be tokenized via Perl regular expressions, which can be 
specified by the user. In fact, tokenization is so flexible that features 
could consist of characters, pairs of characters, etc. 

SenseClusters can be applied to the problem of discriminating word  
meanings or ambiguous names, using the target or head word representation.  
This is sometimes also called "headed" data, where each context is 
centered around the given target whose meanings are to be discovered. In 
this case the contexts that contain the given target word are clustered, 
and each cluster is assumed to correspond to a different meaning of that 
word. 

SenseClusters can also be applied to the problem of grouping short units 
of text that have no target or head (which is sometimes referred to as a 
"headless" representation. In this case there is no head or center to the  
context, so the entire context is being clustered to determine the 
meaning or topic of the context as a whole. Email categorization or news  
article clustering are examples of problems that could be approached 
using headless data. 

SenseClusters will automatically determine the number of clusters in the 
data based on a number of different automatic stopping measures we have 
developed, three of which are based on clustering criterion function, 
and one which is an adaptation of the well-known Gap Statistic. 

SenseClusters can also be applied to the problem of clustering words or 
lexical features, in hopes of discovering synonyms, antonyms, or other 
classes of words. 

Broadly speaking, SenseClusters can be used for any task that requires the 
recognition of contextually similar units of text, or words that occur 
in similar contexts.

=head1 DOCUMENTATION

SenseClusters' documentation is available ONLINE at :
http://senseclusters.sourceforge.net/SenseClusters-Code-README.html

For OFFLINE browsing, directory Docs/HTML is provided in SenseClusters' main
package directory and the SenseClusters-Code-README.html file can be found
here and locally browsed.

All programs have inline source code documentation written in pod style 
and this can be browsed from command line as a man page or using 
the 'perldoc' command. For example, 'man bitsimat.pl' or 'perldoc 
bitsimat.pl' will displayed the documentation for the bitsimat.pl program.
Each program also has a --help option to provide information about program 
options. 

=head1 GETTING STARTED

You might first like to run the Demo scripts in Demos/ directory to 
get an idea of SenseClusters' usage and functionality, or try the web 
interface that is provided at http://senseclusters.sourceforge.net.

Demos/ contains scripts that utilize the wrapper program discriminate.pl 
that calls various other programs from the package to run a complete  
experiment. It also contains examples where specialized experiments are  
constructed directly from the programs provided in the package.  In 
general it would be useful to consult the flowcharts in Docs/Flows to 
understand the overall structure of the package. 

The web interface provides an intuitive means of formulating and running  
discriminate.pl commands, so the use of the web interface and certainly  
be instructive in terms of how to formulate discriminate.pl commands.

The contexts that you wish to cluster must be in Senseval-2 format. This 
is a simple XML markup that indicates the beginning and end of each 
context, and allows you to specify a target word and a "correct" 
categorization of the context, if you know that information. There is a
pre-processing  program text2sval.pl in Toolkit/preprocess/plain/ that  
converts plain text data (with a single context on each line) into 
Senseval-2 format. There is also a large amount of sample data 
that is already in Senseval-2 format available at 
http://senseclusters.sourceforge.net

You can also (optionally) provide a separate training file in plain text 
format to be used as the feature selection data. If you don't do this, 
then the features will be selected from the contexts to be clustered.

=head1 PACKAGE ORGANIZATION

After downloading and unpacking SenseClusters, you should find following 
files/directories within SenseClusters' directory.

=over 4

=item * README.SC.pod

This file.

=item * INSTALL

The installation guide, which lists all package dependencies.

=item * discriminate.pl

A wrapper program that acts as a driver for many other programs in 
the package. It clusters the given text instances based on their  
contextual similarities.

=item * Demos/

A directory of scripts that demonstrate SenseClusters' usage and 
functionality. 

=item * Toolkit/

A directory of Perl programs implemented and used by SenseClusters. Users
who are interested to use SenseClusters' tools individually and separately 
without using the wrapper programs are encouraged to browse through the 
Toolkit and Toolkit.pod.

=item * Docs/

A directory of SenseClusters' documentation in html format. 

Directory Docs/Flows/ contains flow diagrams that illustrate how to put 
together the programs provided in SenseClusters' Toolkit with other packages 
like NSP, SVDPACK and CLUTO to run experiments without wrappers.

=item * Testing/ 

A directory of test cases written as C-shell scripts that will test if the 
package is installed properly or not. 

=item * Web/

Contains an easy to use and install web interface for SenseClusters. 

=item * Changes/

A directory of changelogs that document the changes and improvements done 
during each version.

=item * Makefile.PL 

Generates a Makefile on running 'perl Makefile.PL'.

=item * GPL.txt

A copy of the GNU General Public License, the terms under which SenseClusters
is distributed.

=item * FDL.txt

A copy of the GNU Free Documentation License, the terms under which the
documentation of SenseClusters is distributed.

=back

=head1 CONTACT US

SenseClusters was originally developed and maintained by Amruta Purandare  
and Ted Pedersen from September 2002 until August 2004. Since that time 
it has been developed and maintained by Anagha Kulkarni and Ted Pedersen. 

Please join our mailing lists to participate in the package related 
discussions, to post your questions or bugs and also to suggest 
enhancements to the package functionality.

To subscribe to the user's mailing list, visit : 
http://lists.sourceforge.net/lists/listinfo/senseclusters-users

To subscribe to a low volume news mailing list, visit : 
http://lists.sourceforge.net/lists/listinfo/senseclusters-news

To subscribe to the developer's mailing list, visit : 
http://lists.sourceforge.net/lists/listinfo/senseclusters-developers

Recent version of SenseClusters can be downloaded from :
http://senseclusters.sourceforge.net/

=head1 SEE ALSO

SenseClusters' ONLINE Documentation at 
http://senseclusters.sourceforge.net/SenseClusters-Code-README.html

=head1 AUTHORS
 
 Ted Pedersen
 University of Minnesota, Duluth
 tpederse@d.umn.edu
 http://www.d.umn.edu/~tpederse/

 Amruta Purandare
 University of Pittsburgh
 amruta@cs.pitt.edu
 http://www.cs.pitt.edu/~amruta/

 Anagha Kulkarni
 University of Minnesota, Duluth
 kulka020@d.umn.edu
 http://www.d.umn.edu/~kulka020/

 Mahesh Joshi
 University of Minnesota, Duluth
 joshi031@d.umn.edu
 http://www.d.umn.edu/~joshi031/

=head1 ACKNOWLEDGMENTS

This work has been partially supported by a National Science Foundation 
Faculty Early CAREER Development award (Grant #0092784). 

We would also like to express our special thanks to : 

Dr. George Karypis and his research group for developing CLUTO, 
Dr. Michael Berry and the co-developers of SVDPACK and SVDPACKC, 
Christian Soeller and the PDL developers' team, 
and Satanjeev Banerjee for developing the Ngram Statistics Package. 

=head1 COPYRIGHT

Copyright (c) 2003-2006,  Ted Pedersen,  Amruta Purandare,  Anagha Kulkarni, and Mahesh Joshi 

This program is free software; you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free Software
Foundation; either version 2 of the License, or (at your option) any later
version.

This program is distributed in the hope that it will be useful, but WITHOUT
ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with
this program; if not, write to

The Free Software Foundation, Inc.,
59 Temple Place - Suite 330,
Boston, MA  02111-1307, USA.

=cut
