******************************************************************************** RefSeq-release92.txt ftp://ftp.ncbi.nlm.nih.gov/refseq/release/release-notes/ NCBI Reference Sequence (RefSeq) Database Release 92 January 4, 2019 Distribution Release Notes Release Size: 86867 organisms 1487640446350 nucleotide bases 50022196212 amino acids 185738687 records ****************************************************************************** This document describes the format and content of the flat files that comprise releases of the NCBI Reference Sequence (RefSeq) database. Additional information about RefSeq is available at: 1. NCBI Bookshelf: a) NCBI Handbook: https://www.ncbi.nlm.nih.gov/books/NBK21091/ b) RefSeq Help (FAQ) https://www.ncbi.nlm.nih.gov/books/NBK50680/ 2. RefSeq Web Sites: RefSeq Home: https://www.ncbi.nlm.nih.gov/refSeq/ RefSeqGene Home: https://www.ncbi.nlm.nih.gov/refseq/rsg/ If you have any questions or comments about RefSeq, the RefSeq release files or this document, please contact NCBI by email at: info@ncbi.nlm.nih.gov. To receive announcements of future RefSeq releases and large updates please subscribe to NCBI's refseq-announce mail list: send email to refseq-announce-subscribe@ncbi.nlm.nih.gov with "subscribe" in the subject line (without quotes) and nothing in the email body OR subscribe using the web interface at: https://www.ncbi.nlm.nih.gov/mailman/listinfo/refseq-announce ============================================================================= TABLE OF CONTENTS ============================================================================= 1. INTRODUCTION 1.1 This release 1.2 Cutoff date 1.3 RefSeq Project Background 1.3.1 Sequence accessions, validation, and annotations 1.3.2 Data assembly, curation, and collaboration 1.3.3 Biologically non-redundant data set 1.3.4 RefSeq and DDBJ/EMBL/GenBank comparison 1.4 Uses and applications of the RefSeq database 2. CONTENT 2.1 Organisms included 2.2 Molecule Types included 2.3 Known Problems, Redundancies, and Inconsistencies 2.4 Release Catalog 2.5 Changes since the previous release 3. ORGANIZATION OF DATA FILES 3.1 FTP Site Organization 3.2 Release Contents 3.3 File Names and Formats 3.4 File Sizes 3.5 Statistics 3.6 Release Catalog 3.7 Removed Records 3.8 Accession Format 3.9 Growth of RefSeq 4. FLAT FILE ANNOTATION 4.1 Main features of RefSeq Flat File 4.1.1 LOCUS, DEFLINE, ACCESSION, KEYWORDS, SOURCE, ORGANISM 4.1.2 REFERENCE, DIRECT SUBMISSION, COMMENT, PRIMARY 4.1.3 NUCLEOTIDE FEATURE ANNOTATION (Gene, mRNA, CDS) 4.1.4 PROTEIN FEATURE ANNOTATION 4.2 Tracking Identifiers 4.2.1 GeneID 4.2.2 Transcript ID 4.2.3 Protein ID 4.2.4 Conserved Domain Database (CDD) ID 5. REFSEQ ADMINISTRATION 5.1 Citing RefSeq 5.2 RefSeq Distribution Formats 5.3 Other Methods of Accessing RefSeq Data 5.4 Request for Corrections and Comments 5.5 Credits and Acknowledgements 5.6 Disclaimer ============================================================================= 1. INTRODUCTION ============================================================================= The NCBI Reference Sequence Project (RefSeq) is an effort to provide the best single collection of naturally occurring biomolecules, representative of the central dogma, for each major organism. Ideally this would include one sequence record for each chromosome, organelle, or plasmid linked on a residue by residue basis to the expressed transcripts, to the translated proteins, and to each mature peptide product. Depending on the organism, we may have some, but not all, of this information at any given time. We pragmatically include the best view we can from available data. Additional information about the RefSeq project is available from: a) RefSeq Web site https://www.ncbi.nlm.nih.gov/refseq/ b) Entrez Books, NCBI Handbook, RefSeq chapter https://www.ncbi.nlm.nih.gov/books/NBK21091/ 1.1 This Release ---------------- The National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM), National Institutes of Health (NIH) is responsible for producing and distributing the RefSeq Sequence Database. Records are provided through a combination of collaboration and in-house processing including some curation by NCBI staff comprised of expert biologists. This is a full release of all NCBI RefSeq records. The RefSeq project is an ongoing effort to provide a curated, non-redundant collection of sequences. This release includes all of the sequence data that we have collected at this time. Although the RefSeq collection is not yet complete, its value as a non-redundant dataset has reached a level that justifies providing full releases. 1.2 Cutoff date --------------- This full release, Release 92, incorporates data available as of January 4, 2019. For more recent data, users are advised to: 1. Download the RefSeq daily update files from the RefSeq FTP site ftp://ftp.ncbi.nlm.nih.gov/refseq/daily/ 2. Use NCBI's Entrez Programming Utilities to download records based on queries or lists of accessions https://www.ncbi.nlm.nih.gov/books/NBK25500/ 3. Use the interactive web query system to query based on date. https://www.ncbi.nlm.nih.gov/nucleotide/ https://www.ncbi.nlm.nih.gov/protein/ 1.3 RefSeq Project Background ----------------------------- 1.3.1 Sequence accessions, validation, and annotation ----------------------------------------------------- Every sequence is assigned a stable accession, version, and gi and all older versions remain available over time. RefSeq accessions have a distinct format (see section 3.6); the underscore ("_") is the primary distinguishing feature of a RefSeq accession. DDBJ/EMBL/GenBank accessions never include an underscore. Sequences are validated in several ways. For example, to confirm that genomic sequence from the region of the mRNA feature really does match the mRNA sequence itself, and that the annotated coding region features really can be translated into the protein sequences they refer to. Validation also checks for valid ASN.1 format. Validation also ensures that consistency is maintained in descriptive information (symbols, gene and protein names) between RefSeq and Gene records. Each molecule is annotated as accurately as possible with the correct organism name, the correct gene symbol for that organism, and reasonable names for proteins where possible. When available, nomenclature provided by official nomenclature groups is used. Note that gene symbols are not required or expected to be unique either across species or within a species. 1.3.2 Data assembly, curation, and collaboration ------------------------------------------------ We welcome collaborations with authoritative groups outside NCBI who are willing to provide the sequences, annotations, or links to phenotypic or organism specific resources. Where such collaborations have not yet developed, NCBI staff have assembled the best view of the organism that we can put together ourselves. In some cases, as with the human genome, NCBI is an active participant in generating the genome assembly and in providing reference sequences to represent the annotated genome. For other genomes, we may compile the data ourselves from DDBJ/EMBL/GenBank or other public sources. For instance, we may simply select the "best" DDBJ/EMBL/GenBank record by automatic means, validate the data format (and correct if needed), and add an essentially unchanged copy to the RefSeq collection, attributed to the original DDBJ/EMBL/GenBank record. In other cases we may provide a record that is very similar to the DDBJ/EMBL/GenBank record, but to which experts at NCBI have added corrected or additional annotation. This latter process can range from minor technical repairs to a manually curated re-annotation of the sequence, often in collaboration with experts outside NCBI. Each record that has been curated, or that is in the pool for future curation, is labeled with the level of curation it has received. Curation status information is provided primarily for transcript and protein records. Curation is carried out on the whole genome level for some smaller genomes such as viral, organelle, and some microbial genomes. Curation status codes are defined in the section 3.2 below. 1.3.3 Biologically non-redundant data set ----------------------------------------- RefSeq provides a biologically non-redundant set of sequences for database searching and gene characterization. It has the advantage of providing an objective and experimentally verifiable definition of "non-redundant" in supplying one example of each natural biomolecule per organism or sample. The small amount of sequence redundancy introduced from close paralogs, alternate splicing products, and genome assembly intermediates is compensated for by the clarity of the model. RefSeq provides the substrate for a variety of conclusions about non-redundancy based on clustering identical sequences, or families of related sequences, without confounding the database itself with these more subjective assessments. 1.3.4 RefSeq and DDBJ/EMBL/GenBank comparison --------------------------------------------- RefSeq is unique in providing a large curated database across many organisms, which precisely and explicitly links genetic (chromosome), expression (mRNA), and functional (protein) sequence data into an integrated whole. DDBJ/EMBL/GenBank also integrates DNA and protein information, and RefSeq is substantially based on sequence records contributed to DDBJ/EMBL/GenBank. However, RefSeq is similar to a review article in that it represents a synthesis and summary of information by a particular group (NCBI or other RefSeq contributors) that is based on the primary data gathered by many others and made part of the scientific record. Also, like a review article, it has the advantage of organizing a large body of diverse data into a single consistent framework with a uniform set of conventions and standards. Note that while based on DDBJ/EMBL/GenBank, RefSeq is distinct from DDBJ/EMBL/GenBank. DDBJ/EMBL/GenBank represents the sequence and annotations supplied by the original authors and is never changed by NCBI or RefSeq staff. DDBJ/EMBL/GenBank remains the primary sequence archive while RefSeq is a summary and synthesis based on that essential primary data. 1.4 Uses and applications of the RefSeq database ------------------------------------------------ A stable, consistent, comprehensive, non-redundant database of genomes and their products provides a valuable sequence resource for similarity searching, gene identification, protein classification, comparative genomics, and selection of probes for gene expression. It also acts as molecular "white pages" by providing a single, uniform point of access for searching at the sequence level, and by connecting the results with a diversity of organism-specific databases or resources unique to that organism or field. ============================================================================= 2. CONTENT ============================================================================= 2.1 Organisms included ---------------------- This number of organisms reported for the release (section 3.5 below) is determined by counting the number of distinct tax_ids included in the release. Tax_ids are provided by the NCBI Taxonomy group. Tax_ids were historically provided for all species and strains having any amount of sequence data. In 2014 NCBI stopped assigning strain-level tax_ids. Strains are now being tracked by the BioSample database. The release includes species ranging from viral to microbial to eukaryotic and includes organisms for which complete and incomplete genomic sequence data is available. The release does not include all species for which some sequence data is available in DDBJ/EMBL/GenBank. The decision to generate RefSeq data for a species or strain depends in part on the amount of sequence data available. Additional species will be represented in the RefSeq collection as more sequence data becomes available. 2.2 Molecule Types Included --------------------------- The RefSeq release includes genomic, transcript, and protein sequence data; however, these molecule types are not provided for all organisms and the sequences provided may not be complete or comprehensive for some species. Transcript RefSeq records may represent protein-coding transcripts or non-coding RNA products; these records are currently only provided for eukaryotic species. Genomic RefSeq records are provided when a sufficient quantity of genomic sequence data is available in DDBJ/EMBL/GenBank. Transcript and protein records may be provided for a species before genomic sequence data is available. 2.3 Known Problems, Redundancies, and Inconsistencies ------------------------------------------------------ Known Problems with RefSeq release 92: ====================================== There are no known problems with RefSeq release 92. Known Redunancies and Inconsistencies: ====================================== The RefSeq collection is an ongoing project that is expected to grow in scope and content over time. Thus it is important to recognize that it is not complete in that some genomes are not yet completely sequenced, some incompletely sequenced genomes may not be included, or some gene products may not yet be represented. RefSeq records may be added, removed, or updated in future releases as new information becomes available and as a result of curation. Known Data inconsistencies: [1] RefSeq status codes are not consistently provided for some species. The goal is to consistently provide a status code for all RefSeq records. The release catalog indicates "UNKNOWN" if a status code was expected but not detected and "na" if a status code is not expected based on the original project plan for provision of this type of information. Status codes will be more consistently applied to all records in the future. [2] The genomic, transcript, and protein collection is known to be incomplete for many species. This is particularly true for those genomes for which a complete genome assembly is not yet available, such as Sus scrofa (pig). As additional sequence data becomes available, the RefSeq representation for this, and other, organisms will increase. [3] Whole genome shotgun (WGS) assemblies of organelle, plastid, or viral genomes are included in the complete node and in the taxonomic group that the whole genome WGS project is reported in (e.g., fungi etc.). Our process flow for WGS data provides a data extraction per WGS project with no distinction by molecule (such as mitochondrial). Therefore, some nodes do not include WGS data or may include WGS data for different taxa. For instance, NZ_ACSJ01000000 includes contigs representing two tax_ids - a bacterium and a phage. The entire WGS project has been processed for the complete node and the microbial node in this release. Therefore, the microbial node includes a small amount of viral sequence and the viral node omits this data. NZ_ACSJ01000001 to NZ_ACSJ01000011 microbial contigs NZ_ACSJ01000012 to NZ_ACSJ01000019 viral contigs [4] Although the goal is to provide a non-redundant collection, some redundancy is included in this release as follows. Redundant Protein records: Alternate Splicing When additional transcripts are provided to represent alternate splicing products, and the alternate splice site occurs in the UTR, then the protein is redundantly provided. Paralogs (eukaryotes) The goal is to provide a RefSeq record for each naturally occurring molecule. Therefore, records are provided for all genes identified including those produced by more recent gene duplication events in which the genes are nearly identical. Redundant Genomic records: Intermediate records For some species, intermediate genomic records are provided to support the assembly and/or annotation of the genome. For example, for human, a chromosome may be represented by a chromosome RefSeq record with a NC_ accession prefix. The chromosome record may consist of many contigs, each represented as a separate record with a NT_ accession prefix. In addition, some curated gene region records, with NG_ accession prefix, may also be provided to support annotation of complex regions. Alternate assemblies Genomic records are provided to represent alternate assemblies of genomic sequence derived from different populations. These records will have varying levels of redundancy and represent polymorphic and haplotype differences in terms of the sequence and annotation. For example, alternate assemblies are provided for different mouse strains and for regions of the human major histocompatibility complex (MHC). The MHC is a highly variable region of chromosome 6 which exhibits variation at the level of both sequence polymorphism and gene content. The alternate assemblies make it possible to represent this alternate gene content. Prokaryotic strains Prokaryotic genome sequence data derived from different strains may be represented as additional RefSeq records. This introduces redundancy but may also add representation for some proteins that are unique to a strain. RefSeq records for a specific strain can be identified by the unique taxonomic ID for that strain. The protein complement is non-redundant. [5] Note that for some organisms, most notably vertebrates, processing to update individual transcript and protein records may occur on a daily basis. Transcript and protein updates may include changes to descriptive information such as publications, names, or feature annotations. Updates can also include changes to the sequence or the addition of new sequence records. Thus information available on transcript and protein records may be more current than the annotated genome. 2.4 Release Catalog ------------------- The Release Catalog documents the full contents of the RefSeq Release. The catalog can be used to identify data of interest. See the format description in section 3.5 for additional information. The release catalog is available at: ftp://ftp.ncbi.nlm.nih.gov/refseq/release/release-catalog/RefSeq-release#.catalog The catalog for previous releases is available in the archive directory: ftp://ftp.ncbi.nlm.nih.gov/refseq/release/release-catalog/archive/ 2.5 Changes since the previous release -------------------------------------- [1] The dbSNP annotation summary has not been updated since the release in May 2018 as NCBI is making a transition in the SNP-retrieval system. The most current report available summarizes the post-Build 151 updates of human SNPs: ftp://ftp.ncbi.nlm.nih.gov/snp/pre_build152/release-notes/RefSeq/refseq88.snp.rpt [2] Matched Annotation by NCBI and EMBL-EBI (MANE) project NCBI/RefSeq and Ensembl/GENCODE are collaborating on the Matched Annotation by NCBI and EMBL-EBI (MANE) project to provide a matched set of well-supported transcripts for human protein-coding genes and define one representative transcript for each gene. Both RefSeq and Ensembl will continue to provide a rich set of alternate transcripts per gene. Further details are available at: https://ncbiinsights.ncbi.nlm.nih.gov/2018/10/11/matched-annotation-by-ncbi-and-embl-ebi-mane-a-new-joint-venture-to-define-a-set-of-representative-transcripts-for-human-protein-coding-genes/ As part of this project, the current release includes updates to approximately 10,000 human RefSeq NM_ and NR_ transcripts. The updates will continue in batches through 2019. Please stay tuned to NCBI Insights for further details as this project progresses. Previous Announcement: ---------------------- [1] The dbSNP annotation summary has not been updated since the release in May 2018 as NCBI is making a transition in the SNP-retrieval system. The most current report available summarizes the post-Build 151 updates of human SNPs: ftp://ftp.ncbi.nlm.nih.gov/snp/re/refseq88.snp.rpt [2] Matched Annotation by NCBI and EMBL-EBI (MANE) project NCBI/RefSeq and Ensembl/GENCODE are now collaborating on the Matched Annotation by NCBI and EMBL-EBI (MANE) project to provide a matched set of well-supported transcripts for human protein-coding genes and define one representative transcript for each gene. Both RefSeq and Ensembl will continue to provide a rich set of alternate transcripts per gene. Further details are available at: https://ncbiinsights.ncbi.nlm.nih.gov/2018/10/11/matched-annotation-by-ncbi-and-embl-ebi-mane-a-new-joint-venture-to-define-a-set-of-representative-transcripts-for-human-protein-coding-genes/ As part of this project, we anticipate updates to a large number of human RefSeq NM_ and NR_ transcripts beginning in the next month and continuing in batches through 2019. Please stay tuned to NCBI Insights for further details as this project progresses. [3] Protein ID mapping file added to ftp site In 2014 and 2015, NCBI re-annotated all prokaryotic genomes, except a small set of Reference Genomes, using NCBI's Prokaryotic Genome Annotation Pipeline based on a new protein data model. This new RefSeq non-redundant protein model is identified by a "WP_" accession prefix, which is different from the traditional RefSeq prokaryotic protein "NP_" or "YP_" accession. This re-annotation resulted in the removal of nearly 7 million NP_ and YP_ accessions as prokaryotic genomes were updated to directly cross-reference the new non-redundant WP_ accessions. For conserved proteins, the same WP accession may appear on thousands of genomes. However, we are aware that the NP_ and YP_ accessions have been used in many publications and biomedical projects, which may refer scientists to NCBI protein pages, which currently provide the new non-redundant proteins with WP_ accessions. The file "NP_YP_WP.txt" is a protein ID mapping file that provides the association of traditional NP_ and YP_ proteins with new WP_ proteins of identical sequences. The ID mapping file consists of five columns IPG - the IPG ID (https://www.ncbi.nlm.nih.gov/ipg/) NP_YP_AccVer - the NP/YP accession and version WP_AccVer - the associated WP accession NP_YP_Taxid - Taxonomy ID NP_YP_Status - the status of NP/YP protein - live: the NP/YP protein is still annotated on Reference Genomes - replaced: the NP/YP protein was replaced by a WP protein - suppressed: the NP/YP protein was first replaced by WP protein, which was subsequently suppressed because it is no longer annotated on any genome - withdrawn: the NP/YP protein is no longer annotated on any genome Additional information: https://www.ncbi.nlm.nih.gov/refseq/about/prokaryotes/#reference_genomes https://www.ncbi.nlm.nih.gov/genome/annotation_prok/ https://www.ncbi.nlm.nih.gov/refseq/about/nonredundantproteins/ https://www.ncbi.nlm.nih.gov/refseq/about/prokaryotes/reannotation/ Announcing Future Changes: -------------------------- [1] Future change: SNP data to be removed from genome assembly records We currently expect to make the following change in the March 2019 RefSeq FTP Release: SNP variation features will no longer be in RefSeq genome assembly records - chromosome and contig records with NC_, NT_, NW_ and AC_ accession prefixes. This change affects both the ASN.1 and flatfile records. Because the number of variants is already enormous and still growing, removing SNP features from these large genomic records will significantly reduce the size of RefSeq FTP files and make downloading and processing easier. We will continue to include SNPs on NG_-prefixed genomic records, and transcript (NM_, NR_, XM_, XR_) and protein (NP_, XP_, YP_) sequences. In addition, the ASN.1 format will be changed to: - remove the bitfield - remove the 'extra' flags More information is available here: https://ncbiinsights.ncbi.nlm.nih.gov/2017/05/09/phasing-out-support-for-non-human-genome-organism-data-in-dbsnp-and-dbvar/ [2] Future change: New accession formats and flatfile parsers In September, GenBank announced that a new accession format is being introduced to accommodate the growth of WGS sequences, with a maximum length for INSDC accessions of 15-17 characters (for example, AZZZAA02123456789). Corresponding RefSeq accessions used for prokaryote genomic records, which add an NZ_ prefix, will be at least 18 characters (e.g., NZ_AZZZAA021234567), and could be as long as 20 characters (e.g., NZ_AZZZAA02123456789). This may require adjustments to code and databases to accommodate the longer length. In particular, the first line of the flatfile format, referred to as the LOCUS line, includes the "Locus Name", which is usually identical to the accession number. Historically, the Locus Name has had a maximum length of 16 characters and was found at positions 13-28, but that may now grow to as long as 20 characters. This may require modifications to flatfile parsers if they rely solely on position. Consider this LOCUS line for a typical RefSeq bacterial genome: LOCUS NZ_ABCD02123456 5868661 bp DNA linear PRI 15-OCT-2018 ------------+--------------+-+---------+---------+---------+---------+--------- 1 13 28 30 40 50 60 70 79 With the new format, the longer Locus Name extends into space originally reserved for the sequence length: LOCUS NZ_AZZZAA021234567 9999999 bp DNA linear PRI 15-OCT-2018 ------------+--------------+-+---------+---------+---------+---------+--------- 1 13 28 30 40 50 60 70 79 The longest Locus Name that may occur with the coming change would span positions 13-32, but still allow sequence lengths of up to 9,999,999 bp without further shifts: LOCUS NZ_AZZZAA02123456789 9999999 bp DNA linear PRI 15-OCT-2018 ------------+--------------+-+---------+---------+---------+---------+--------- 1 13 28 30 40 50 60 70 79 Theoretically, a sequence could have a long Locus Name and longer length, which would result in shifting the subsequent elements of the LOCUS line to the right, past 80 characters: GenBank: LOCUS AZZZAA02123456789 10000000000 bp DNA linear PRI 15-OCT-2018 ------------+--------------+-+---------+---------+---------+---------+--------- 1 13 28 30 40 50 60 70 79 RefSeq: LOCUS NZ_AZZZAA02123456789 10000000000 bp DNA linear PRI 15-OCT-2018 ------------+--------------+-+---------+---------+---------+---------+--------- 1 13 28 30 40 50 60 70 79 Note the theoretical GenBank example requires an individual sequence length of >= 10 GBp and a large WGS project with >= 100 million sequences before shifting the sequence length and other elements to the right, which is not an expected combination in the foreseeable future. The theoretical RefSeq example is also excluded by current RefSeq policies and accession formats. In particular, NZ_ prefix accessions are only used for prokaryote sequences, which rarely exceed 10 Mbp, and single, large WGS projects requiring RefSeq accessions >18 characters are currently excluded from the RefSeq prokaryote genome collection. Since 2003, the GenBank release notes have recommended that flatfile parsers use a whitespace-separated tokens approach in order to accommodate changes like the one above. However, some parsers may need revision if they use a pure position-based approach. From our internal testing, it appears BioPython and BioPerl properly handle most of the examples above, and only have issues with the last theoretical examples where the sequence length no longer ends at position 40. We do recommend adjusting code to accommodate those theoretical examples for future-proofing. We also recommend reviewing any database schemas to make sure they can accommodate the longer accession format. Further information about the revised accession format and its effects on the LOCUS line are available at: https://ncbiinsights.ncbi.nlm.nih.gov/2018/09/19/genbank-expanded-accession-formats/ https://ftp.ncbi.nlm.nih.gov/genbank/gbrel.txt [3] Future change: RefSeq BioProjects We are considering dropping RefSeq BioProjects from organelle records. ============================================================================= 3. ORGANIZATION OF DATA FILES ============================================================================= 3.1 FTP Site Organization ------------------------- RefSeq releases are available on the NCBI FTP site at: ftp://ftp.ncbi.nlm.nih.gov/refseq/release/ Documentation Directories and Files: ------------------------------------ release-catalog/ archive/ --subdirectory, archive of previous catalogs RefSeq-release#.catalog --file, comprehensive list of sequence records included in the current release release#.files.installed --file, list of sequence data files installed release#.removed-records --file, list of removed records that were included the previous release release#.taxon.new --file, list of organisms that have been added to the release since the previous release release#.taxon.update --file, list of organisms for which there has been a change in either the NCBI Tax ID or the organism name. release#.AutonomousProtein2Genomic.gz --file, list of genomic accessions that non-redundant WP protein accessions are annotated on release#.MultispeciesAutonomousProtein2taxname.gz --file, list of NCBI TaxID and species name for the subset of non-redundant WP protein accessions that are annotated on genomic records from more than one species. release#.accession2geneid.gz --file, list of GeneIDs included in the current release release-notes/ archive/ --subdirectory, archive of previous documentation RefSeq-release#.txt --file, this Release notes document release-statistics/ archive/ --subdirectory, archive of previous documentation RefSeq-release#.MMDDYYYY.stats.txt --file, detailed release statistics *.acc_taxid_growth.txt --growth file, where '*' is archaea, bacteria etc. first row identifies column content RefSeq.taxid_growth.txt --organism growth file, release nodes are columns first row identifies column content Sequence Data Directories and Files: ------------------------------------ The RefSeq collection is provided in a redundant fashion to best meet the needs of those who want the full collection as well as those who want a specific sub-set of the collection. Therefore the collection is provided as: 1) the complete collection, and 2) sections as defined by major taxonomic or other logical groupings. A subdirectory exists for each sub-section as follows: archaea bacteria fungi invertebrate mitochondrion other plant plasmid plastid protozoa vertebrate_mammalian vertebrate_other viral In addition, the complete collection is available without these sub-groupings in the subdirectory: complete Note that this directory structure intentionally provides the release data in a redundant fashion. We gave considerable thought to how to package the release to meet the needs of different user groups. For instance, some groups may be interested in retrieving the complete protein set, while other groups may be interested in retrieving data for a more limited number of organisms. We decided to provide logical groupings based on general taxonomic node (viral, mammalian etc.) as well as logical molecule type compartmentalization (e.g., plastid). Thus, all records are provided at least twice, once in the "complete" directory, and a second time in one of the other directories. Some sequences may be provided three times when it is logical to include the record in more than one additional directory. For example, a sequence may be provided in the "complete", "mitochondrion", and "vertebrate_mammalian" directories. We are interested in hearing if you find this structure useful or if you would like information grouped in a different manner. Send suggestions or comments to the NCBI Help Desk at: info@ncbi.nlm.nih.gov 3.2 Release Contents -------------------- A comprehensive list of sequence files provided for the current release is available in: ftp://ftp.ncbi.nlm.nih.gov/refseq/release/release-catalog/release#.files.installed A comprehensive list of sequence records included in the current release is available in: ftp://ftp.ncbi.nlm.nih.gov/refseq/release/release-catalog/release#.catalog File name format indicates the directory node, molecule type, and format type. Name format: complete.10.1.bna.gz |--------|--|-|---|--| 1 2 3 4 5 1. directory location 2. numerical increment -to provide a set of unique file names 3. optional: sub-part number -to provide a unique file name for genomic FASTA files which may be split based on size 4. format type 5. compression Multiple files may be provided for any given molecule and format type, indicated by a numerical increment in the file names. Files of the same molecule type and increment are related by content. Files of different molecule type and the same increment may or may not have related content. For example: complete.1006.bna.gz complete.1006.1.genomic.fna.gz complete.1006.2.genomic.fna.gz -- genomic FASTA split into two sub-parts due to size complete.1006.genomic.gbff.gz -- content related to the two 1006.#.genomic.fna.gz files complete.1006.protein.faa.gz complete.1006.protein.gpff.gz -- contains proteins found in either genomic or rna files of this increment complete.1006.rna.fna.gz complete.1006.rna.gbff.gz -- unrelated to the contents of the genomic files of this increment If you are interested in a complete set of genomic, protein, and rna files for a given tax_id, you must scan all files from the directory. You may also want to consider using the per-assembly files provided at ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/ instead. More information is available at: https://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/ Note that for some molecule and format types, a number increment is skipped. This is not an error. It is also not an error if a filename provided with one release is not provided with a different release. For example: complete.281.genomic.gbff.gz complete.282.genomic.gbff.gz complete.284.genomic.gbff.gz complete.285.genomic.gbff.gz complete.287.genomic.gbff.gz --release 70 did not include files named as complete.283.genomic or complete.286.genomic because complete.283.bna & complete.286.bna did not include genomic data. The RefSeq release processing first produces a comprehensive set of ASN.1 files, ordered by tax_id, and limited by a size constraint. These initial files are further processed to export the records by molecule and format type. If the initial ASN.1 file does not include any records for a given molecule type, such as genomic sequence data, then the corresponding 'genomic' fasta and flatfile records will not be found. The installed release includes a comprehensive report of all files installed for a given release. Please refer to /release-catalog/release#.files.installed (where # is the release number). 3.3 File Names and Formats -------------------------- File names are informative, and indicate the content, molecule type, and file format of each RefSeq release data file. Most filenames utilize this structure: directory.filenumber.subpart.molecule.format.gz 1 2 3 4 5 File Name Key: 1. directory directory level the file is provided in (e.g.,complete, viral etc) 2. file number: large data sets are provided as incrementally numbered files 3. sub-part number: large genomic fasta files may be split to facilitate transfer 4. molecule type of molecule (genomic, rna, or protein); not relevant for ASN.1 format files provided in the "complete" sub-directory 5. format the data format provided in the file; see below For example: complete1.genomic.bna.gz vertebrate_mammalian2.protein.gpff.gz RefSeq Whole Genome Shotgun (WGS) data are provided in files provided per WGS project. Their filenames use a slightly different structure: directoryWGSproject.molecule.format.gz For example: completeNZ_AAAU.bna.gz microbialNZ_AAAV.genomic.fna.gz The filenames for RefSeq non-redundant proteins also use a slightly different structure: directory.nonredundant_protein.filenumber.molecule.format.gz For example: complete.nonredundant_protein.20.protein.faa.gz bacteria.nonredundant_protein.105.protein.gpff.gz The term "non-redundant protein" refers to the representation of identical proteins in the prokaryotic RefSeq protein dataset using a single non-redundant protein accession number (with the prefix 'WP_'). Non-redundant RefSeq protein records, which are currently provided for archaeal and bacterial RefSeq genomes, may be found in RefSeq genomes from multiple species. More information about this type of RefSeq protein record can be be found here: https://www.ncbi.nlm.nih.gov/refseq/about/nonredundantproteins/ ftp://ftp.ncbi.nlm.nih.gov/refseq/release/announcements/WP-proteins-06.10.2013.pdf All RefSeq release files have been compressed with the gzip utility; therefore, an invariant ".gz" suffix is present for all release files. The data that comprises a RefSeq release are available in several file formats, as indicated by the format component in the file name: bna binary ASN.1 format; includes nucleotide and protein gbff GenBank flat file format; nucleotide records gpff GenPept flat file format; protein records fna FASTA format; nucleotide records faa FASTA format; protein records The comprehensive full release is deposited in the "complete" directory and is available in all file types. Binary ASN.1 format is only provided in the complete directory. The remaining directories include all of the remaining file types. The DDBJ/EMBL/GenBank and GenPept flat file format provided in this release matches that seen when accessing the records using the NCBI web site. Notably, some RefSeq record are in the CON division and do not instantiate the sequence on the flat file display, instead a 'join' statement is provided to indicate the assembly instructions. The FASTA files do include the assembled sequences for these CON division RefSeq records. For example, see NC_000022.11 Suggestions regarding the structure of the RefSeq release product and the available formats may be sent to the NCBI Help Desk: info@ncbi.nlm.nih.gov 3.4 File Sizes -------------- RefSeq release files are provided in a range of sizes. Most are limited to several hundred megabytes (MB) and uncompressed ASN.1 file size will not exceed 500 MB. Nucleotide FASTA files are split when they reach 1 gigabyte (GB). Files are compressed to reduce file size and facilitate FTP retrieval. The total size of release 92 is as follows: Extension Size (GB) Type ----------------------------------------------------------- bna 1550.19 ASN.1 gbff 2243.44 GenBank flat file gpff 603.56 GenPept flat file fna 3027.39 FASTA, nucleotide faa 121.49 FASTA, protein Notes: [A] The complete directory provides all file types. The ASN.1 format is only available in the complete directory; the file sizes reported for the remaining file formats represents the redundant total found in the complete plus other directories. 3.5 Statistics --------------- RefSeq release 92 includes sequences from 86867 different organisms. The number of species represented in each Release sub-directory, determined by counting distinct tax IDs, is as follows: archaea 1141 bacteria 53823 complete 86867 fungi 11374 invertebrate 3458 mitochondrion 9289 other 2 plant 3188 plasmid 3910 plastid 3333 protozoa 522 vertebrate_mammalian 1132 vertebrate_other 4008 viral 8196 Counts of accessions and basepairs/residues per molecule type: Accessions Basepairs/Residues Genomic: 30154973 1426893077881 RNA: 25088890 60747368469 Protein: 130366644 50022196212 Wgs master: 128180 0 Complete RefSeq release statistics for each directory are provided in a separate document. Please see: ftp://ftp.ncbi.nlm.nih.gov/refseq/release/release-statistics/ file: RefSeq-release#.MMDDYYYY.stats.txt #: indicates release number MMDDYY: indicates release date as month,day,year Statistics for previous releases are available in the archive subdirectory: ftp://ftp.ncbi.nlm.nih.gov/refseq/release/release-statistics/archive/ 3.6 Release Catalog Format -------------------------- The full non-redundant contents of the release are documented in the release catalog. Available at: ftp://ftp.ncbi.nlm.nih.gov/refseq/release/release-catalog/ The catalog includes the following columns: 1. tax_id 2. Taxon name 3. RefSeq accession.version 4. gi 5. FTP directories data is provided in, '|' separated 6. RefSeq status code 7. sequence length Note: the molecule type for each catalog entry can be inferred from the accession prefix (see below). RefSeq Status Codes are documented on the RefSeq web site. The catalog includes the following terms: na Not Applicable; status codes are not provided for some records UNKNOWN The status code has not yet been applied or status is not applicable to the type of record. REVIEWED The RefSeq record has been the reviewed by NCBI staff or by a collaborator. Some RefSeq records may incorporate expanded sequence and annotation information including additional publications and features. This indicates a curated record. VALIDATED The RefSeq record has undergone an initial review to provide the preferred sequence standard. The record has not yet been subject to final review at which time additional functional information may be provided. This indicates a curated record. PROVISIONAL The RefSeq record has not yet been subject to individual review and is thought to be well supported and to represent a valid transcript and protein. This record is not curated. PREDICTED The RefSeq transcript may represent an ab initio prediction or may be weakly supported by transcripts or protein homology. This record is not curated. INFERRED The RefSeq record is inferred by genome sequence analysis. This record is not curated. MODEL RefSeq records provided via automated processing and are not subject to individual review or revision between builds. This record is not curated. 3.7 Removed Records ------------------- This is a report of accessions that were included in the previous release but are no longer included in the current release. Available at: ftp://ftp.ncbi.nlm.nih.gov/refseq/release/release-catalog/ release#.removed-records file format The file includes the following columns: 1. tax_id 2. species name 3. RefSeq accession.version 4. gi 5. FTP directories data was provided in, in last release 6. RefSeq status code 7. sequence length 8. type of removal type options include: dead protein replaced by accession [original accession is not secondary] permanently suppressed temporarily suppressed [record may become available again in the future] 3.8 RefSeq Accession Format --------------------------- RefSeq accessions are formatted as a two letter prefix, followed by an underscore, followed by six or nine digits, or 4 letters plus eight digits. For example, NM_020236, NP_001107345, and NZ_AABC02000001. The underscore ("_") is the primary distinguishing feature of a RefSeq accession; DDBJ/EMBL/GenBank accessions never include an underscore. RefSeq accession prefixes Prefix Molecule Use context Complete accession format type NC_ DNA Chromosomes; Prefix followed by 6 numbers, followed Linkage Groups by the sequence version number AC_ DNA Chromosomes; Prefix followed by 6 numbers, followed Linkage Groups by the sequence version number NZ_ DNA Chromosomes; Prefix followed by the INSDC accession Scaffolds; number that the RefSeq record is based Used predominantly for on, followed by the RefSeq sequence prokaryotic genomes version number NT_ DNA Scaffolds Prefix followed by 6 or 9 numbers, followed by the sequence version number NW_ DNA Scaffolds Prefix followed by 6 or 9 numbers, followed by the sequence version number NG_ DNA Genomic regions; Prefix followed by 6 numbers, followed A genomic region record may by the sequence version number represent a single or multiple genetic loci (e.g., rRNA targeted locus, RefSeqGene, non-transcribed pseudogene) NM_ mRNA protein-coding transcripts Prefix followed by 6 or 9 numbers, followed by the sequence version number; curated by NCBI staff or a model organism database; these records are referred to as the 'known' RefSeq dataset XM_ mRNA protein-coding transcripts Prefix followed by 6 or 9 numbers, followed by the sequence version number; generated through either the eukaryotic genome annotation pipeline, or the small eukaryotic genome annotation pipeline; records generated via the first method are referred to as the 'model' RefSeq dataset. NR_ RNA non-protein-coding transcripts Prefix followed by 6 or 9 numbers, including lncRNAs, structural followed by the sequence version number; RNAs, transcribed pseudogenes, curated by NCBI staff or a model organism and transcripts with unlikely database; these records are referred to as protein-coding potential from the 'known' RefSeq dataset protein-coding genes XR_ RNA non-protein-coding transcripts, Prefix followed by 6 or 9 numbers, as above followed by the sequence version number generated through either the eukaryotic genome annotation pipeline, or the small eukaryotic genome annotation pipeline; records generated via the first method are referred to as the 'model' RefSeq dataset. NP_ protein Proteins annotated on NM_ Prefix followed by 6 or 9 numbers, transcript accessions or followed by the sequence version number; annotated on genomic molecules curated by NCBI staff or a model organism without an instantiated database; these records are referred to as transcript (e.g. some the 'known' RefSeq dataset mitochondrial genomes, viral genomes, and reference bacterial genomes AP_ protein Proteins annotated on AC_ Prefix followed by 6 or 9 numbers, genomic accessions or annotated followed by the sequence version number on genomic molecules without an instantiated transcript record XP_ protein Proteins annotated on XM_ Prefix followed by 6 or 9 numbers, transcript accessions or followed by the sequence version number annotated on genomic molecules generated through either the eukaryotic without an instantiated genome annotation pipeline, or the small transcript record eukaryotic genome annotation pipeline; records generated via the first method are referred to as the 'model' RefSeq dataset. YP_ protein Proteins annotated on genomic Prefix followed by 6 or 9 numbers, molecules without an followed by the sequence version number instantiated transcript record WP_ protein Proteins that are non-redundant Prefix followed by 9 numbers, followed across multiple strains and by the version number, which is species. A single protein of always '.1' as these records are this type may be annotated not subject to update on more than one prokaryotic genome See online documentation for additional information on WP_ accessions: https://www.ncbi.nlm.nih.gov/refseq/about/nonredundantproteins/ As needed, accession series will be expanded by adding 3 digits, with existing accessions remaining stable. 3.9 Growth of RefSeq -------------------- Release Date Taxons Nucleotides Amino Acids Records 1 Jun 30, 2003 2005 4672871949 263588685 1061675 2 Oct 21, 2003 2124 7745398573 286957682 1097404 3 Jan 13, 2004 2218 7992741222 294647847 1101244 4 Mar 24, 2004 2358 8175128887 318253841 1193457 5 May 3, 2004 2395 8325515623 337229387 1255613 6 Jul 5, 2004 2467 8696371716 365446682 1367206 7 Sep 10, 2004 2558 21072808460 405233619 1579579 8 Oct 31, 2004 2645 26814386658 430300369 1709723 9 Jan 9, 2005 2780 36786975473 470534907 1843944 10 Mar 6, 2005 2827 36893741150 482862858 1893478 11 May 8, 2005 2928 39731702362 507980644 2477893 12 Jul 10,2005 2969 43043256058 608493108 2869675 13 Sep 11, 2005 3060 44727484853 686768902 3400773 14 Nov 20, 2005 3198 47364955367 763761075 3272776 15 Jan 1, 2006 3244 52645441913 810009733 3436263 16 Mar 11, 2006 3397 56175443059 887509001 3715260 17 May 1, 2006 3497 62130037371 927587669 3999859 18 Jul 11, 2006 3695 70474041999 974374765 4186692 19 Sep 10, 2006 3774 70694879544 1012985077 4311543 20 Nov 5, 2006 3919 72679681505 1061797276 4567569 21 Jan 6, 2007 4079 73864990566 1144795927 4742335 22 Mar 5, 2007 4187 82441128546 1215085694 5207865 23 May 8, 2007 4300 83148327110 1291050995 5503385 24 Jul 10, 2007 4511 89856995521 1365916222 6073814 25 Sep 11, 2007 4646 91265840843 1470475398 6515132 26 Nov 4, 2007 4737 99105705485 1495032507 6698250 27 Jan 6, 2008 4926 101059552113 1556356987 7025715 28 Mar 9, 2008 5059 102051350525 1770627427 7914560 29 May 4, 2008 5168 104671101150 1870214220 8376141 30 Jul 7,2008 5395 105074486709 1913447691 8572852 31 Aug 30, 2008 5513 109214348591 2026768719 9145702 32 Nov 10, 2008 5726 111122203221 2089596746 9501764 33 Jan 16, 2009 7773 116001583818 2204073443 10325282 34 Mar 6, 2009 8054 111792574830 2299682138 10021870 35 May 4, 2009 8393 113210655336 2565199170 10993891 36 Jul 2, 2009 8665 117013741530 2756884219 12141825 37 Sep 3, 2009 9005 119151229820 2965450333 12941750 38 Nov 7, 2009 9166 119196622435 3115246540 13436447 39 Jan 23, 2010 10171 118502856500 3221054793 13656433 40 Mar 7, 2010 10291 118645985035 3280528951 13853798 41 May 9, 2010 10567 125500880884 3427514220 14472060 42 Jul 13, 2010 10728 143311839055 3553178673 15038858 43 Sep 5, 2010 10854 148706971456 3761205880 15934055 44 Nov 7, 2010 11354 152241490865 3899827321 16421261 45 Jan 7, 2011 11536 152787094873 3989526325 16748646 46 Mar 8, 2011 11734 153220856222 4064052954 16998463 47 May 7, 2011 12000 162001966044 4226432170 17631876 48 Jul 10, 2011 12235 163771272903 4381572480 18162534 49 Sep 7, 2011 16248 162286146420 4401462131 18236994 50 Nov 8, 2011 16392 168702162406 4529303978 18815153 51 Jan 9, 2012 16609 172751347778 4727472575 19580946 52 Mar 5, 2012 16923 173705194347 4929467422 20235247 53 May 7, 2012 17339 175345433862 5247723883 21286080 54 Jul 9, 2012 17605 176492228688 5456992181 21889466 55 Sep 17, 2012 17994 194971374545 5803694332 23207572 56 Nov 8, 2012 18512 207200464965 6003283860 23892460 57 Jan 8, 2013 21415 227639108990 8895153979 34158511 58 Mar 11,2013 22460 233247214400 9699076220 36938203 59 Apr 29, 2013 24656 256547643663 10081118607 39040745 60 Jul 19, 2013 28560 304686151670 10968281809 40913699 61 Sep 9, 2013 29414 319551394177 11248966865 41958567 62 Nov 10, 2013 31646 361097812819 12364402476 45971929 63 Jan 12, 2014 33485 380736496721 12898823816 48358066 64 Mar 10, 2014 33693 407131829420 13126329523 49538213 65 May 12, 2014 36335 430613954268 13544443640 51770174 66 Jul 7, 2014 41263 464958653006 15380643722 58334707 67 Sep 8, 2014 41913 490800792583 15984799771 61277203 68 Nov 3, 2014 49312 551290496427 16790850066 66078114 69 Jan 2, 2015 51661 594452675642 18690872100 74127019 70 Apr 30, 2015 54118 643051675415 18556381492 74720563 71 Jul 6, 2015 55267 669786114584 19394398061 77730891 72 Aug 27, 2015 54937 705514040682 19748515407 79189847 73 Nov 2, 2015 55966 738575306673 20847187904 83881439 74 Jan 11, 2016 57993 780562546593 22359312327 89458499 75 Mar 7, 2016 58776 807349580822 23386816845 92936289 76 May 9, 2016 59995 859358759387 24586044092 97792976 77 Jun 29, 2016 60892 872938972710 25449517637 100678438 78 Sep 6, 2016 62739 904423741786 27105909174 107045797 79 Oct 31, 2016 64277 941153466527 28214340731 111024999 80 Jan 9, 2017 66224 988758901224 30073388355 118059547 81 Mar 6, 2017 68165 1022393849190 31208765769 121954847 82 May 8, 2017 69035 1066355456886 32674281195 127098389 83 Jul 17, 2017 71356 1121562831367 34113050666 132052465 84 Sep 11, 2017 72965 1158748173657 36673975257 140627690 85 Nov 6, 2017 73996 1204502588476 38371950939 146710309 86 Jan 8, 2018 75218 1224147155468 39198368659 149493466 87 Mar 5, 2018 77225 1266924789413 40799318419 155118991 88 May 14, 2018 79448 1281457514351 42356891903 160224355 89 Jul 9, 2018 81345 1310406641373 43546263891 163859625 90 Sep 10, 2018 84276 1391082745897 46448327052 173956003 91 Nov 5, 2018 85308 1430969078377 48133151229 179672083 92 Jan 4, 2019 86867 1487640446350 50022196212 185738687 Note: Date refers to the data cut-off date, i.e., the release incorporates data available as of the listed date. ============================================================================= 4. FLAT FILE ANNOTATION ============================================================================= 4.1 Main features of RefSeq Flat File ------------------------------------- Also see the RefSeq web site and the NCBI Handbook, RefSeq chapter. https://www.ncbi.nlm.nih.gov/refseq/ https://www.ncbi.nlm.nih.gov/books/NBK21091/ 4.1.1 LOCUS, DEFLINE, ACCESSION, KEYWORDS, SOURCE, ORGANISM -------------------------------------------------------------------- The beginning of each RefSeq record provides information about the accession, length, molecule type, division, and last update date. This is followed by the descriptive DEFINITION line, then by the Accession, version,and GI data, followed by detailed information about the organism and taxomonic lineage. // LOCUS NC_004916 384502 bp DNA linear INV 05-JUN-2012 DEFINITION Leishmania major strain Friedlin complete genome, chromosome 3. ACCESSION NC_004916 VERSION NC_004916.2 GI:389592668 DBLINK Project: 15564 BioProject: PRJNA15564 KEYWORDS RefSeq; complete genome. SOURCE Leishmania major strain Friedlin ORGANISM Leishmania major strain Friedlin Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Leishmaniinae; Leishmania. // Note: Both the GI and VERSION number increment when a sequence is updated, while the ACCESSION remains the same. The GI and "ACCESSION.VERSION" identifiers provide the finest resolution reference to a sequence. 4.1.2 REFERENCE, DIRECT SUBMISSION, COMMENT, PRIMARY ------------------------------------------- REFERENCE: While the majority of RefSeq records do include REFERENCE data, this data is not required and some records do not include any citations. Publications are propagated from the GenBank record(s) from which the RefSeq is derived, provided by collaborating groups and NCBI staff during the curation process, and provided by the National Library of Medicine (NLM) PubMed MeSH indexing staff as they add new articles to PubMed. Functionally relevant citations are added by individual scientists using the Entrez Gene GeneRIF submission form, and a significant volume of citation connections are supplied by the NLM MeSH indexing staff for human, mouse, rat, zebrafish,and cow. This functionality is expected to increase in the future to treat all organisms represented in the RefSeq collection. Citations supplied by the MeSH indexers and individual scientists can be identified by the presence of a REMARK beginning with the text string "GeneRIF". This represents a significant method to keep sequence connections to the literature up-to-date; GeneRIFs add considerable value to the RefSeq collection. For more information on GeneRIFs please see: https://www.ncbi.nlm.nih.gov/gene/about-generif For example, several GeneRIFs have been added to NM_000173.1 including: // REFERENCE 13 (bases 1 to 2480) AUTHORS Poujol,C., Ware,J., Nieswandt,B., Nurden,A.T. and Nurden,P. TITLE Absence of GPIbalpha is responsible for aberrant membrane development during megakaryocyte maturation: ultrastructural study using a transgenic model JOURNAL Exp. Hematol. 30 (4), 352-360 (2002) MEDLINE 21935100 PUBMED 11937271 REMARK GeneRIF: Absence of GPIbalpha is responsible for aberrant membrane development during megakaryocyte maturation; leads to abnormal partitioning of the membrane systems and abnormal proplatelet production. // DIRECT SUBMISSION: A Direct Submission field is provided on some RefSeq records but not all. It is propagated from the underlying GenBank record from which the RefSeq is derived or provided on submissions from collaborating groups. Transcript and protein RefSeqs for human, mouse, rat, zebrafish, and cow do not provide this field as records often include additional data and are not necessarily direct copies of the GenBank submission. COMMENT: A COMMENT is provided for the majority of RefSeq records. We are working to supply a COMMENT more comprehensively in the future. A COMMENT is always provided if the version number and GI have changed. COMMENT sections may include information on: RefSeq Status (PROVISIONAL, INFERRED, VALIDATED REVIEWED, etc.) Information on collaborating groups (e.g. RefSeqGene project) GenBank records(s) from which the RefSeq is derived. Version/GI changes A summary about sequence function Description of transcript variants Sequence note to describe the components of the RefSeq transcript Evidence data describing transcript and RNA-Seq support for the RefSeq transcript 5' and/or 3' completeness of the RefSeq transcript Attributes: examples - 'non-AUG initiation codon', 'Protein has antimicrobial activity' Example: COMMENT section of NM_004323.5 // COMMENT REVIEWED REFSEQ: This record has been curated by NCBI staff. The reference sequence was derived from BG723775.1, BC001936.1, AL161445.10 and CN478628.1. This sequence is a reference standard in the RefSeqGene project. On Feb 17, 2010 this sequence version replaced gi:124494250. Summary: The oncogene BCL2 is a membrane protein that blocks a step in a pathway leading to apoptosis or programmed cell death. The protein encoded by this gene binds to BCL2 and is referred to as BCL2-associated athanogene. It enhances the anti-apoptotic effects of BCL2 and represents a link between growth factor receptors and anti-apoptotic mechanisms. Multiple protein isoforms are encoded by this mRNA through the use of a non-AUG (CUG) initiation codon, and three alternative downstream AUG initiation codons. A related pseudogene has been defined on chromosome X. [provided by RefSeq, Feb 2010]. Transcript Variant: This transcript (1) encodes multiple isoforms due to the use of alternative translation initiation codons. The longest isoform (BAG-1L or p50) is derived from an upstream non-AUG (CUG) start codon, while three shorter isoforms are derived from downstream AUG start codons. The longest isoform (BAG-1L) is represented in this RefSeq. Sequence Note: This RefSeq record was created from transcript and genomic sequence data to make the sequence consistent with the reference genome assembly. The genomic coordinates used for the transcript record were based on transcript alignments. CCDS Note: This CCDS ID represents the longest human BAG1 isoform, known as BAG-1L or p50, as described in the literature, including PMIDs 9396724, 9679980, 9747877 and 17662274. This isoform initiates translation at a non-AUG (CUG) start codon that is well-conserved and present in a strong Kozak signal context. Alternative translation initiation at downstream AUG start codons produces three additional isoforms with shorter N-termini, known as BAG-1M or p46, BAG-1S or p36 (also known as p33), and p29. The most abundant of the shorter isoforms, BAG-1S, is represented by CCDS 55301.1. Evidence in PMIDs 9747877 and 17662274 indicates that these isoforms have distinct subcellular distributions, which may contribute to the multifunctionality of the protein. Publication Note: This RefSeq record includes a subset of the publications that are available for this gene. Please see the Gene record to access additional publications. ##Evidence-Data-START## Transcript exon combination :: Z35491.1, BC001936.1 [ECO:0000332] RNAseq introns :: single sample supports all introns SAMEA2147975, SAMEA2149876 [ECO:0000348] ##Evidence-Data-END## ##RefSeq-Attributes-START## non-AUG initiation codon :: PMID: 9679980, 9396724 ##RefSeq-Attributes-END## COMPLETENESS: complete on the 3' end. // PRIMARY: This section contains the coordinates of the transcript and/or genomic components of the RefSeq. The 'c' in the COMP column indicates that the coordinates are on the complementary strand. Example: NM_004006.2 PRIMARY REFSEQ_SPAN PRIMARY_IDENTIFIER PRIMARY_SPAN COMP 1-44 AL031643.1 20726-20769 c 45-4649 M18533.1 9-4613 4650-4650 AL109609.5 79506-79506 c 4651-5773 M18533.1 4615-5737 5774-5774 AL109609.5 35892-35892 c 5775-12748 M18533.1 5739-12712 12749-13993 BC028720.1 3398-4642 4.1.3 NUCLEOTIDE FEATURE ANNOTATION ----------------------------------- Gene, mRNA, CDS: Every effort is made to consistently provide the Gene and coding sequence (CDS) feature (when relevant). If a RefSeq is based on a GenBank record that is only annotated with the CDS, then a Gene feature is created. mRNA features are provided for most eukaryotic records; this is not yet comprehensively provided and will improve in future releases. Gene Names: Gene symbols and names are provided by external official nomenclature groups for some organisms. If official nomenclature is not available we may use a systemic name provided by the data submittor or apply a more functional name during curation. When official nomenclature is available we may provide additional alternate names for some organisms. Variation: Variation is computed by the dbSNP database staff and added via post-processing to RefSeq records. Miscellaneous: For some records, additional annotation may be provided when identified by the curation staff or provided by a collaborating group. For example, the location of polyA signal and sites may be included. 4.1.4 PROTEIN FEATURE ANNOTATION -------------------------------- Protein Names: Protein names may be provided by a collaborating group, may be based on the Gene Name, or for some records, the curation process may identify the preferred protein name based on that associated with a specific EC number or based on the literature. Protein Products: Signal peptide and mature peptide annotation is provided by propagation from the GenBank submission that the RefSeq is based on, when provided by a collaborating group, or when determined by the curation process. Domains: Domains are computed by alignment to the NCBI Conserved Domain Database database for human, mouse, rat, zebrafish, nematode, and cow. The best hits are annotated on the RefSeq. For some records, additional functionally significant regions of the protein may be annotated by the curation staff. Domain annotation is not provided comprehensively at this time. 4.2 Tracking Identifiers ------------------------ Several identifiers are provided on RefSeq records that can be used to track relationships between annotated features, relationships between RefSeq records, and changes to RefSeq records over time. The GeneID identifies the related Gene, mRNA, and CDS features. Transcript IDs (RefSeq accessions) provide an explicit connection between a transcript feature annotated on a genomic RefSeq record, and the RefSeq transcript record itself. Likewise, the Protein ID (RefSeq accessions) provides the association between the annotated CDS feature on a genomic or transcript RefSeq record, and the protein record itself. Changes to a RefSeq sequence over time can be identified by changes to the GI and version number. 4.2.1 GeneID ------------ A gene feature database cross-reference qualifier (dbxref), the GeneID, is provided on many RefSeq records to support access to the Entrez Gene database. Entrez Gene provides gene-oriented information for a sub-set of the RefSeq collection. Gene includes data for all Eukaryotic genomes, viral genomes, and a representative Prokaryotic genomes. The GeneID provides a distinct tracking identifier for a gene or locus and is provided on the gene, mRNA, and CDS features. The GeneID can be used to identify a set of related features; this is especially useful when multiple products are provided to represent alternate splicing events. For example: NC_000003.12 Homo sapiens chromosome 3, GRCh38.p7 Primary Assembly. // gene 38038595..38122741 /gene="DLEC1" /gene_synonym="CFAP81; DLC-1; DLC1; F56" /note="deleted in lung and esophageal cancer 1; Derived by automated computational analysis using gene prediction method: BestRefSeq,Gnomon." /db_xref="GeneID:9940" <<<--- GeneID /db_xref="HGNC:HGNC:2899" /db_xref="MIM:604050" // When viewing RefSeq records via the internet, the GeneID is hot-linked to Entrez Gene. 4.2.2 Transcript ID ------------------- The transcript_id qualifier found on a mRNA or other RNA feature annotation provides an explicit correspondence between a feature annotation on a genomic record and the RefSeq transcript record. For example: NC_000022.11 Homo sapiens chromosome 22, GRCh38.p7 Primary Assembly. // mRNA complement(46255663..46263322) /gene="PKDREJ" /product="polycystin (PKD) family receptor for egg jelly" /note="Derived by automated computational analysis using gene prediction method: BestRefSeq." /transcript_id="NM_006071.1" <<<--- linked RefSeq transcript /db_xref="GI:5174632" /db_xref="GeneID:10343" /db_xref="HGNC:HGNC:9015" /db_xref="MIM:604670" // 4.2.3 Protein ID ---------------- The protein_id qualifier found on a coding region (CDS) feature provides an explicit correspondance between feature annotation on a genomic or transcript RefSeq record and the RefSeq transcript record. For example: NC_001144.5 Saccharomyces cerevisiae chromosome XII, complete sequence. // CDS complement(16639..17613) /gene="MHT1" /locus_tag="YLL062C" /EC_number="2.1.1.10" /note="S-methylmethionine-homocysteine methyltransferase; functions along with Sam4p in the conversion of S-adenosylmethionine (AdoMet) to methionine to control the methionine/AdoMet ratio" /codon_start=1 /product="S-adenosylmethionine-homocysteine S-methyltransferase MHT1" /protein_id="NP_013038.1" <<<--- linked RefSeq protein /db_xref="GI:6322966" /db_xref="SGD:S000003985" /db_xref="GeneID:850664" // 4.2.4 Conserved Domain Database (CDD) ID ---------------------------------------- Protein domain annotation is calculated by the Conserved Domain Database and is included in RefSeq protein records processed for the FTP site. Domain annotation appears as a Region feature on protein records and is propagated to associated transcript features (if available) as a misc_feat. The feature annotation includes a dbxref cross-reference to the CDD database that is the equivalent of a gi identifier in that it may change over time. The dbxref retrieves a domain model as calculated at a point in time; recalculation of domains by the CDD group may result in a new CDD identifier value. The CDD dbxref values that are available in the RefSeq release, although not stable, will continue to retrieve data from the CDD database where a newer identifier value may be found. For example: VERSION NP_000550.2 GI:28302131 DEFINITION A-gamma globin [Homo sapiens]. // Region 5..142 /region_name="globin" /note="Globins are heme proteins, which bind and transport oxygen; cd01040" /db_xref="CDD:29979" <<--- CDD identifier // ============================================================================= 5. REFSEQ ADMINISTRATION ============================================================================= The National Center for Biotechnology Information (NCBI), National Library of Medicine, National Institutes of Health, is responsible for the production and distribution of the NIH RefSeq Sequence Database. NCBI distributes RefSeq sequence data by anonymous FTP. For more information, you may contact NCBI by email at info@ncbi.nlm.nih.gov or by phone at 301-496-2475. 5.1 Citing RefSeq ----------------- When citing data in RefSeq, it is appropriate to to give the sequence name, and primary accession and version number (or GI). Note, the most accurate citation of the sequence is provided by including the combined accession plus version number or the GI number. It is also appropriate to list a reference for the RefSeq project. Please refer to the RefSeq web site for the most recent publication. https://www.ncbi.nlm.nih.gov/refseq/publications/ 5.2 RefSeq Distribution Formats ------------------------------- Complete flat file releases of the RefSeq database are available via NCBI's anonymous ftp server: ftp://ftp.ncbi.nlm.nih.gov/refseq/release/ Each release is cumulative, incorporating previous data plus new data. Records that have been suppressed are not included in the release. Incremental updates that become available between RefSeq releases are available at: ftp://ftp.ncbi.nlm.nih.gov/refseq/daily ftp://ftp.ncbi.nlm.nih.gov/refseq/cumulative Please refer to the README for additional information: ftp://ftp.ncbi.nlm.nih.gov/refseq/README 5.3 Other Methods of Accessing RefSeq Data ------------------------------------------ Entrez is a molecular biology database system that presents an integrated view of DNA and protein sequence data, structure data, genome data, publications, and other data fields. The Entrez query and retrieval system is produced by the National Center for Biotechnology Information (NCBI) and is available only via the internet. Entrez is accessed at: https://www.ncbi.nlm.nih.gov/Entrez/ RefSeq entries are indexed for retrieval in the Entrez system. The web-based filter restrictions can be used to restrict your query to RefSeq data or to specific subsets of the RefSeq database. Additional specific property restrictions are provided to support querying for RefSeq records with specific STATUS codes. Queries are defined on the RefSeq web site at: https://www.ncbi.nlm.nih.gov/RefSeq/ 5.4 Request for Corrections and Comments ---------------------------------------- We welcome your suggestions to improve the RefSeq collection; we invite groups interested in contributing toward the collection and curation of the RefSeq database to improve the representation of single genes, gene families, or complete genomes to contact us. Please refer to RefSeq accession and version numbers (or GI) and the RefSeq Release number to which your comments apply; it is useful if you indicate the source of data that you found to be problematic (e.g., data on the FTP site, data retrieved on the web site), the entry DEFLINE, and the specific annotation field for which you are suggesting a change. Suggestions and corrections can be sent to: info@ncbi.nlm.nih.gov 5.5 Credits and Acknowledgements -------------------------------- This RefSeq release would not be possible without the support of numerous collaborators and the primary sequence data that is submitted by thousands of laboratories and available in GenBank. The RefSeq project is ambitious in scope and we actively welcome opportunities to work with other groups to provide this collection. We value all of our collaborators; they contribute information with a large range in scope and volume such as completely annotated genomes, advice to improve the sequence or annotation of individual RefSeq records, information about official nomenclature, and information about function. In addition to the significant information collected by collaboration, numerous NCBI staff are involved in infrastructure support, programmatic support, and curation. RefSeq is supported by 3 primary work groups that are associated with Entrez Gene, Entrez Genomes, and the Genome Annotation Pipeline. 5.6 Disclaimer -------------- The United States Government makes no representations or warranties regarding the content or accuracy of the information. The United States Government also makes no representations or warranties of merchantability or fitness for a particular purpose or that the use of the sequences will not infringe any patent, copyright, trademark, or other rights. The United States Government accepts no responsibility for any consequence of the receipt or use of the information. For additional information about RefSeq releases, please contact NCBI by e-mail at info@ncbi.nlm.nih.gov or by phone at (301) 496-2475.