


Simple(3)      User Contributed Perl Documentation      Simple(3)


NNNNAAAAMMMMEEEE
       CompBio::Simple.pm  - Simple OO interface to some basic
       methods useful in bioinformatics.

SSSSYYYYNNNNOOOOPPPPSSSSIIIISSSS
       use CompBio::Simple;

       `my $cbs = CompBio::Simple-'new;>

       `my $fa_seq = $cbs-'tbl_to_fa($tblseq,%params);>

       $tblseq could be either scalar, array by reference, 2d
       array by reference, or a hash by reference; return data
       type is same as submitted (see RETURN_TYPE option under
       general method description). Scalar may be either sequence
       records (one record per line), or a filename. Array's
       should contain an entire sequence record (loci\tsequence)
       in each indexed position. 2D arrays should be ids in the
       first column, sequences in the second, such that
       $array[2][1] would be the aa sequence for the third
       record. Hash's should have the loci as the key, and only
       the sequence as the value. Newlines at the ends of
       sequences in array and hash types are unneccisarry and
       will be stripped off

DDDDEEEESSSSCCCCRRRRIIIIPPPPTTTTIIIIOOOONNNN
       Originally developed at the BioMolecular Engineering
       Research Center (http://bmerc-www.bu.edu), this module is
       intended to take a number of small commonly used methods,
       and make them into a single package.

       The early versions of this module assumed installation on
       our local system.  Although I have tried to correct this
       in the current version, you may find this package requires
       a litle twidling to get working. I'll try to leave
       comments where I think it is most likely, but hopefully
       use of a relational database and local setting changes in
       the base module CompBio.pm will have taken care of it. If
       not _please_ email me at seanq@darwin.bu.edu with the
       details.

       Thanks!

MMMMeeeetttthhhhooooddddssss
       A couple of important general notes. All methods will
       describe the key required arguments and will also indicate
       which arg is the optional %params hash. %params can be
       passed as a hash or reference. Also, most arguments that
       handle sequences or ids accept input as scalar (as is or
       by reference), array (by reference), 2D array (by
       reference), or hash (by reference). scalar may be a
       filename.  If I miss documenting it, try it just in case.
       Unless a parameter option is passed to signify otherwise,
       I assume 'doing the right thing' is to return the same
       type of data construct as recieved. The prefered method
       for submitting sequence data is as a reference to an
       array, each index containing one sequence record (and that
       preferably in table format).

       All methods in Simple.pm accept a %params hash as the
       final argument. Options available for a specific method
       are described in that methods section. Options that can be
       defined for any (well, more than one at least) method are:

       RETURN_TYPE <type>
           Value from: (SCALAR,REFSCALAR,ARRAY,2D,HASH)

           This option overrides the default behavior of
           returning data in the same format submitted and return
           data in the manner specified.

       OUTFILE <filename>
           Opens up <filename> for writing. Results of operation
           are written to file and 0 is returned.

       DEBUG <value>
           A numeric value supplied with this option sets the
           debug level for the specific method call only. Does
           not affect the 'global' debug level set when creating
           a new CompBio::Simple object.

       Also note that the params hash is shared by all methods
       called by the method invoked, so params for those internal
       methods could also be added. The most notable case of
       where this might be used is that check_type is actually
       called by almost every other method (Simple tries to never
       assume or restrict what sequence format is going to be
       used). So check_type's CONFIDENCE option could be included
       to affect it's behavior.

       nnnneeeewwww

       Construct an object for invoking methods in
       CompBio::Simple.

       Options:

       DEBUG: Sets default debug level for all method calls using
       this object. Can be overridden for a specific operation by
       defining a DEBUG option with the method call. Higher
       integer values indicate more output. 1 and 2 are good
       enough generally, 3 and higher can produce overwhelming
       amounts of output, 4 or higher will also cause the program
       to die where it would otherwise warn.

nnnnooootttteeeessss
       OK, so the idea of handeling the output/return method with
       dynamically designed code is good, but current state is
       ugly and re-introduces the problem of converting fa back
       to table on tbl_to_fa type calls. So the kludge below will
       no longer work. Somehow we need to pass the method into
       _munge_input so the right RET_CODE can be designed
       (hopefully more gracefully). i.e. if in tbl_to_fa, just
       send to chosen output stream (how to let user define an
       output stream, such as a network socket?) without
       converting back to tbl. However, current edit state has
       moved return to _munge_input, which only sets it for
       file/stdout. It needs to be set whenever needed, either
       sending to filehandle or into array as nec.

       OK, this whole mess badly needs to be cleaned up. Likely
       we need to either give up on using autoload, or be a lot
       more explicit in the breakdown.




       cccchhhheeeecccckkkk____ttttyyyyppppeeee

       Checks a given sequence or set of sequences for it's type.
       Currently groks fasta(.fa), table(.tbl), raw genome(.raw),
       intelligenics[?](.ig) and coding dna sequence(.cdna)
       types. Each index of the referenced array should be an
       entire sequence record. * It is however nnnnooootttt recomended
       that you load up an entire raw genome into memory to do
       this test - see the perlfunc read entry elsewhere in this
       document *

       `$type = check_type(\@seqs,%parameters);'

       Possible return types are CDNA, TBL, FA, IG, RAW and
       UNKNOWN.

       OPTION:

       CONFIDENCE: Set this parameter to an positive integer
       representing the number of checks against your sequence
       set you would like performed. The default value is 3,
       however if only one or two sequences are in your set only
       one check will be performed.  If CONFIDENCE is set to a
       value equal to or greater than the number of sequences in
       your set, it will be reset to a value approx. 66% of the
       number of sequences in your set (a _v_e_r_y high level of
       confidence).

       Check type will then make CONFIDENCE # of tests on youe
       sequence set, randomly selecting 2 members into a subset
       and checking the format. If more than one format type
       guesses check_type will throw an exception reporting all
       the types found.

       ttttbbbbllll____ttttoooo____ffffaaaa

       Converts sequence records in table (tab delimited, usually
       .tbl file extension) format to fasta format.

       `$fa_seq = $cbs-'tbl_to_fa($tblseq,%params);>

       Extra data fields in the table format will be added to the
       annotation line (>) in the fasta output.

       ttttbbbbllll____ttttoooo____iiiigggg

       Converts sequence records in table (tab delimited) format
       to .ig format.

       `$aref_igseqs = $cbc-'tbl_to_ig(\@tbl_seqs,%params);>

       Extra data fields in the table format will be placed on a
       single comment line (;) in the ig output.

       ffffaaaa____ttttoooo____ttttbbbbllll

       Converts sequence records in fasta format to table format.

       `$aref_faseqs = $cbc-'fa_to_tbl(\@fa_seq);>

       Extra data in the fasta format will be placed in an extra
       tab delimited field after the sequence record.

       Options:

       CLEAN: Reduces signifier to the first strech of non white
       space characters found at least 4 characters long with
       none of the characters (| \ ? / ! *)

       iiiigggg____ttttoooo____ttttbbbbllll

       Accepts ig format sequences in a referenced array, one
       complete sequence record per index. This method returns
       the _s_e_q_u_e_n_c_e(s) in table(.tbl) format contained in a
       referenced array.

       `$aref_igseqs = $cbc-'ig_to_tbl(\@fa_seq);>

       Extra comment lines in the ig format will be placed as
       extra tab delimited values after the sequence record.

       ddddnnnnaaaa____ttttoooo____aaaaaaaa

       Convert dna sequences, containing no whitespace, to the
       amino acid residues as coded by the standard 'universal'
       genetic code. dna sequence may contain standard special
       characters (i.e. R,S,B,N, ect.).

       `$aa = dna_to_aa(\$dna_seq,%params);'

       Options:

       C: Set to a true value to indicate dna should be converted
       to it's complement before translation.

       ALTCODE: A reference to a hash containing alternate coding
       keys where the value is the new aa to code for. Stop
       codons are represented by ".".

       SEQFIX: Set to true to alter first position, making V or L
       an M, and removing stop in last position.

       ccccoooommmmpppplllleeeemmmmeeeennnntttt

       Converts dna to it's complementary strand. DNA sequence is
       submitted as scalar by reference. There is no return as
       sequence is modified. To maintain original sequence, send
       a reference to a copy.

       `complement(\$dna);'

       ssssiiiixxxx____ffffrrrraaaammmmeeee

       `$result = six_frame($raw_file,$id,$seq_len,$out_file);'

       Converts a submitted dna sequence into all 6 frame
       translations to aa.  Output id's have start and stop
       positions encoded; if first value is larger, translated
       from anti-sense.

       Options:

       ALTCODE: A reference to a hash containing alternate coding
       keys where the value is the new aa to code for. Stop
       codons are represented by ".". Multiple allowable values
       for the final dna in the codon may be provided in the
       format AT[TCA].

       ID: Prefix for translated segment identifiers, default is
       SixFrame.

       SEQLEN: Minimum length of aa sequence to return, default
       is 10.

       OUTPUT: A filename to pipe results to. This is recomended
       for large dna sequences such as large contigs or whole
       chromosomes, otherwise results are stored in memory until
       process complete ** this includes the use of the general
       method option of OUTFILE ** . If the value of OUTFILE is
       STDOUT results will be sent directly to standard out.

       AAAAAAAA____HHHHAAAASSSSHHHH

       Creates a hash table with amino acid lookups by codon.
       Includes all cases where even an alternate na code (such
       as M for A or C) would return an unambiguous aa.  Also
       consistent with the complement method in this package, ie,
       lower cases in some contexts, for ease of use with
       six_frame.

        C<%aa_hash = aa_hash;>


EEEEXXXXPPPPOOOORRRRTTTT
       None by default.

HHHHIIIISSSSTTTTOOOORRRRYYYY
       0.01    Original version; created by h2xs 1.20 with
               options

                 -AXC -n CompBio::Simple


       0.42    Got initial set of methods linking to CompBio in.
               Developed the _munge stuff.

       0.43    Brought tests up to same point as in CompBio.
               Added six_frame. Moved the data handeling for
               dna_to_aa to own _munge type internal method.
               Added RETURN_TYPE parameter to _munge input so
               user can pick. Created the
               CompBio::Simple::ArrayFile module to TIE a
               sequence file to an array. Added six_frame using
               same array processing _munge as dna_to_aa.

       0.44    Made seriouse attempt to get the docs caught up.
               Added random sampling and CONFIDENCE to
               check_type.

       0.46    Lots of changes to this version, I'm a bit behind
               in making version notes I'm afraid, and due to
               time constraints, these will be too brief, so I'll
               just try to get up to date.

               Simple seems to now handle using autoload and not
               only passes the built in tests in test.pl, but the
               more brutal usage of a few of the utils.
               Unfortunately this has left a bit of sloppy code
               as I fixed bugs without real rewrites. The design
               of using munges in and out seems to be holding up,
               the major complications comming from IO to and
               from files and the tied array. While that seems to
               have been adressed functionally, I look foward to
               making the code cleaner, and hopefully more
               eleagant, for the next major version.  I also
               think _munge_array_to_scalar has been largely made
               needless by the new layout (CompBio handles loops
               internally in most functions now), and the
               RET_CODE parameter opens up some exiting
               possibilities.

TTTTOOOO DDDDOOOO
       _munge_seq_imput needs to also detect when it just has a
       list of loci and hand request to DB to attempy to fulfil,
       and/or accept a DB param.

       Look at using a tie type method similar to ArrayFile.pm
       for fetching data from a database - simply throwing an
       exception if no db module provided (eval use?).

       rewrite (<sigh>) _munges to detect submited format (fasta,
       etc) and return same format whenever possible by default,
       and adding RETURN_FORMAT parameter, negating the need for
       user to call converter after dna_to_aa for example. *This
       is probably over the top since most methods _are_ the
       converters, but why make myself write if/elsif block in
       all the utils that offer different format returns?*

CCCCOOOOPPPPYYYYRRRRIIIIGGGGHHHHTTTT
       Developed at the BioMolecular Engineering Research Center
       at Boston University under the NHLBI's Programs for
       Genomic Applications grant.

       Copyright Sean Quinlan, Trustees of Boston University
       2000-2001.

       All rights reserved. This program is free software; you
       can redistribute it and/or modify it under the same terms
       as Perl itself.

AAAAUUUUTTTTHHHHOOOORRRR
       Sean Quinlan, seanq@darwin.bu.edu

       Please email me with any changes you make or suggestions
       for changes/additions.  Latest version is available under
       ftp://mcclintock.bu.edu/BMERC/perl/.

       Thank you!

SSSSEEEEEEEE AAAALLLLSSSSOOOO
       _p_e_r_l(1), _C_o_m_p_B_i_o(3).



2001-11-13                 perl v5.6.0                          1


















