The Chromosome 7 Annotation Project
Home   GBrowse   Clinical Data   Data Tables   Download   Resources   Links   For Families  
Web News and Updates:

August 18, 2004

The August 18th updates and additions include:
 -331 new full length mRNA sequences added
 -ENCODE (encylopedia of DNA elements) regions added
 -376 new clinical cases of defined phenotype
 -Updated HUGO gene nomenclature
 -New FISH data added
 -New BAC end data added
 -Variation data added (large scale genomic, gene expression, and hominoid lineage-specific gene copy number variations)

  • ENCODE project overview

  • In April 2003, the sequence of the human genome was completed. Although this is a significant achievement, much remains to be done. Before the best use of the information contained in the sequence can be made, the identity and precise location of all of the protein-encoding and non-protein-encoding genes will have to be determined. The identity of other functional elements encoded in the DNA sequence, such as promoters and other transcriptional regulatory sequences, along with determinants of chromosome structure and function, such as origins of replication, also remain largely unknown. A comprehensive encyclopedia of all of these features is needed to fully utilize the sequence to better understand human biology, to predict potential disease risks, and to stimulate the development of new therapies to prevent and treat these diseases.

    To encourage discussion and comparison of existing computational and experimental approaches, and to stimulate the development of new ones, the NHGRI proposed to create a highly interactive public research consortium to carry out a pilot project for testing and comparing existing and new methods to identify functional sequences in DNA. Working together in a highly cooperative effort to rigorously analyze a defined portion of the human genome sequence, investigators with diverse backgrounds and expertise will be able to evaluate the relative merits of each of a diverse set of techniques, technologies and strategies in identifying all the functional elements in human genomic sequence, to identify gaps in our ability to annotate genomic sequence, and to consider the abilities of such methods to be scaled up for an effort to analyze the entire human genome.

    On July 23-24, 2002, the NHGRI organized a workshop, the Comprehensive Extraction of Biological Information from Genomic Sequence, to discuss this proposal. The workshop participants resoundingly supported the concept of a pilot project and made a number of recommendations about the project's goals, organization and implementation, which have now been incorporated into NHGRI's plan. The ultimate goal of this project is to improve access to information, resources, ideas, expertise, and technology beyond the scope of any single group, and to affect the entire community of researchers interested in mining genomic sequence. The hoped-for outcome will be a clear path to determining all of the functional elements in the entire human genome sequence and integrating the information in a manner that will guide future basic and clinical research. For more information, visit the ENCODE project website.

  • Variation data overview

  • Three new types of DNA variations have been added to the database.

    Large-scale copy-number variations (LCVs) involve gains or losses of several kilobases to hundreds of kilobases of genomic DNA among phenotypically normal individuals. These LCVs were investigated using array CGH for 55 unrelated individuals. A total of 255 loci across the human genome that contained genomic imbalances were identified. Twenty four variants were present in greater than 10% of the individuals examined. Half of these regions overlap with genes, and many coincide with segmental duplications or gaps in the human genome assembly. This previously unappreciated heterogeneity may underlie certain human phenotypic variation and susceptibility to disease and argues for a more dynamic human genome structure. For additional information, visit the Database of Genomic Variants.

    Iafrate AJ, Feuk L, Rivera MN, Listewnik ML, Donahoe PK, Qi Y, Scherer SW, Lee C.
    Detection of large-scale variation in the human genome.
    Nat Genet. 2004 Aug 1

    A genetic analysis of genome-wide variation in human gene expression identified natural variation in gene expression and confirmed a heritable component to the baseline expression level of many genes. Microarrays were used to measure the baseline expression levels of genes in immortalized B cells from members of Centre d'Etude du Polyporphisme Humain (CEPH) Utah pedigrees. For each of the ~8,500 genes on the array, they estimated the variance of expression level among unrelated individuals (94 CEPH grandparents) and the mean of variance of array replicates. The analysishighlighted those genes which had a greater expression variation between individuals than between replicates.

    Morley M, Molony CM, Weber TM, Devlin JL, Ewens KG, Spielman RS, Cheung VG.
    Genetic analysis of genome-wide variation in human gene expression.
    Nature. 2004 Aug 12;430(7001):743-7. Epub 2004 Jul 21.

    The detection of lineage-specific gene duplication and loss in human and great ape evolution, identified thosegenes which have undergone lineage-specific duplications or contractions among several hominoid lineages. Interspecies cDNA array-based comparative genomic hybridization was used to individually compare copy number variation for 39,711 cDNAs, representing 29,619 human genes, across five hominoid species, including human. A total of 1,005 genes, either as isolated genes or in clusters positionally biased toward rearrangement-prone genomic regions, that produced relative hybridization signals unique to one or more of the hominoid lineages.

    Fortna A, Kim Y, MacLaren E, Marshall K, Hahn G, Meltesen L, Brenton M, Hink R, Burgers S, Hernandez-Boussard T, Karimpour-Fard A, Glueck D, McGavran L, Berry R, Pollack J, Sikela JM.
    Lineage-specific gene duplication and loss in human and great ape evolution.
    PLoS Biol. 2004 Jul;2(7):E207. Epub 2004 Jul 13.

    June 30, 2004

    The June 30th updates and additions include:
     -New TCAG gene annotation release 4
     -3,389 UniGene clusters added
     -25,149 HapMap project SNP's added with genotype data
     -95 new clinical cases of defined phenotype
     -Updated HUGO gene nomenclature
  • TCAG Annotations

  • TCAG annotations are manually curated entries using all available data sources to obtain the most complete and accurate description of all chromosome 7 transcripts. This includes incorporating all data from GenBank (EST, mRNA), RefSeq, Ensembl, LocusLink, UCSC, TIGR Human Gene Index, HORDE (Human Olfactory Receptor Genes), Cytochrome P450 DB, ncRNA Databases, DBTSS (Database of transcription Start Sites), Celera Genomics, TCAG, RIKEN DIGIT gene predictions, Pseudogene databases, (, H-Invitational database and all relevant publications. Additional information and links to gene expression data, OMIM, GO Annotations etc provide a complete centralized resource for the research community. The TCAG Annotation Release 4 includes an additional 153 gene structures, including new known, novel, putative and predicted gene models as well as a number of new pseudogenes and non-coding RNA's (ncRNA).

  • UniGene

  • UniGene is an experimental system for automatically partitioning GenBank sequences into a non-redundant set of gene-oriented clusters. Each UniGene cluster contains sequences that represent a unique gene, as well as related information such as the tissue types in which the gene has been expressed and map location.

  • HapMap Data

  • The goal of the International HapMap Project is to develop a haplotype map of the human genome, the HapMap, which will describe the common patterns of human DNA sequence variation. The HapMap is expected to be a key resource for researchers to use to find genes affecting health, disease, and responses to drugs and environmental factors. The information produced by the Project will be made freely available.

    The Project is a collaboration among scientists in Japan, the U.K., Canada, China, Nigeria, and the U.S. [see participating groups]. The Project officially started with a meeting on October 27-29, 2002 and is expected to take about three years.

  • Genetic variation and use of the HapMap

  • Most common diseases, such as diabetes, cancer, stroke, heart disease, depression, and asthma, are affected by many genes and environmental factors. Although any two unrelated people are the same at about 99.9% of their DNA sequences, the remaining 0.1% is important because it contains the genetic variants that influence how people differ in their risk of disease or their response to drugs. Discovering the DNA sequence variants that contribute to common disease risk offers one of the best opportunities for understanding the complex causes of disease in humans.

    Currently only genotypes for one population (90 individuals from 30 CEPH family trios) are available, but samples from two other populations (Africa & Asia) will also be genotyped in the project. See project website for further information on samples and overall scientific goals.

    May 7, 2004

    The May 7th updates and additions include....
     -2075 H-Invitational cDNA's added with links to the H-INV db
     -247 pseudogenes from
     -17856 non-human mRNA aligned to Chr7
     -31 chromosome 7 ultra-conserved elements added to structural features.
     -400 new clinical cases of defined phenotype
  • H-Invitational cDNA

  • H-Invitational Database (H-InvDB) is a human gene database, with integrative annotation of 41,118 full-length cDNA clones currently available from six high throughput cDNA sequencing projects. This database represents 21,037 cDNA clusters describing their gene structures, functions, novel alternative splicing isoforms, non-coding functional RNAs, functional domains, sub-cellular localizations, mapping of SNPs and microsatellite repeat motifs in relation with orphan diseases, gene expression profiling, and comparative results with mouse full-length cDNAs in the context of molecular evolution.

    For more information, visit the H-invitational website at and see the relevant publication listed below.

    Integrative Annotation of 21,037 Human Genes Validated by Full-Length cDNA Clones.
    PLoS Biol. 2004 Apr 20 [Epub ahead of print]
    PMID: 15103394 [PubMed - as supplied by publisher]

  • Pseudogene (Gerstein Lab)

  • Pseudogenes are sequences of genomic DNA with such similarity to normal genes that they are regarded as non-functional copies or close relatives of genes. This data track was obtained from the Gerstein Lab at Yale Univeristy. For more information visit the Gerstein lab website at and see the relevant publication listed below.

    Zhang Z, Harrison PM, Liu Y, Gerstein M.
    Millions of years of evolution preserved: a comprehensive catalog of the processed pseudogenes in the human genome.
    Genome Res. 13:2541-58 (2003).

  • Non-human mRNA

  • Non-human vertebrate and invertebrate mRNAs were taken from GenBank, and then mapped to chromosome 7 using translated BLAT. 50% identity and 30% coverage were used as cutoff.

  • Ultra-conserved elements in the human genome

  • There are 481 segments longer than 200 bp that are absolutely conserved (100% identity with no insertions or deletions) between orthologous regions of the human, rat and mouse genomes. Nearly all of these segments are also conserved in the chicken and dog genomes, with an average of 95% and 99% identity, respectively. Many are also significantly conserved in fish. These ultraconserved elements of the human genome are most often located either overlapping exons in genes involved in RNA processing or in introns or nearby genes involved in regulation of transcription and development. Along with more than 5,000 sequences of over 100bp that are absolutely conserved among the three sequenced mammals, these represent a class of genetic elements whose functions and evolutionary origins are yet to be determined, but which are more highly conserved between these species than proteins, and appear to be essential for the ontogeny of mammals and other vertebrates.

    Ultraconserved elements in the human genome
    Gill Bejerano, Michael Pheasant, Igor Makunin, Stuart Stephen, W. James Kent, John S. Mattick, and David Haussler
    Published online May 6 2004; 10.1126/science.1098119 (Science Express Reports)

    April 19, 2004

    A second version DNA sequence assembly of human chromosome 7 (named CRA_TCAGchr7v2) has been generated and is publicly available. The assembly, which encompasses 158,329,839 nucleotides (nt) of DNA, represents the most complete description of this chromosome available (download). It was generated utilizing Celera whole genome shotgun information, cloned-based sequence from Genbank, and targeted sequence from The Centre for Applied Genomics (TCAG). In comparison to the first version release (Science, 2003) an additional 376,050 nt is present and 23 additional gaps have been filled. The CRA_TCAGchr7v2 assembly, amongst other variations, contains 704,297 nt of verified sequence not found in NCBI's Build 34 of chromosome 7.

    To fulfill the objective of having every biological and medically relevant feature annotated along the sequence assembly, all annotation records have been updated and new ones added. For example, the new assembly displays information from the most current RefSeq, Ensembl, Celera published, UCSC, TIGR Human Gene Index, dbEST and mRNA datasets. Unique TCAG annotations have also been added including a new 'experimentally-confirmed' gene track, all known structural features, as well as 619 new clinical cases of defined phenotype. A new proprietary segmental duplication browser has also been developed for the chromosome 7 browser.

    February 5, 2004

    The February 5th updates and additions include.....
     -78 new clinical cases added
     -Affymetrix 10K Xba 131 SNP data track added
     -Predicted microRNA targets added (Lewis BP, Shih IH, Jones-Rhoades MW, Bartel DP, Burge CB. Cell. 115(7): 787-98.)
     -Gene expression data from GNF gene expression atlas added for 381 genes
     -New SNP data added
     -Updated HUGO gene nomenclature

    December 19, 2003

    The December 19th updates and additions include.....
     -TCAG gene annotations updated
     -FOSMID data track added
     -721 new mRNA sequences to database
     -110 new clinical cases
     -Spectral genomics clones added
     -Updated HUGO gene nomenclature
     -Kazusa Institute gene expression (FLJ, KIAA) information added in gene summary page
     -Affymetrix gene chip information added in gene summary page
     -Chimpanzee genomic clones added to database

    November 6, 2003

    The November 6th updates and additions include.....
     -Fugu comparative sequence conservation
     -RIKEN mouse imprinted gene data added (PubMed 12819139)
     -140 new clinical cases
     -New gene predictions added
     -New segmental duplication analysis
     -Updated HUGO gene nomenclature
     -ABI Gene Expression Assay on Demand information added in gene summary page
     -Bay Genomics GeneTrap ES cell line information added in gene summary page
     -New STS markers added

    October 15, 2003

    TCAG chromosome 7 sequence and annotations have been made available at NCBI, and can be viewed using Map View.

    September 18, 2003

    The September 18th update includes mouse, rat and tetraodon comparative sequence data, updated RefSeq and Ensembl annotations. Promoter data from the Eukaryotic Promoter Database has been added along with new EST and mRNA sequences.
    The updates and additions include.....
     -Updated RefSeq annotations
     -Updated Ensembl annotations
     -Mouse comparative sequence conservation
     -Rat comparative sequence conservation
     -Tetraodon comparative sequence conservation
     -Promoters from Eukaryotic Promoter Database added
     -Promoters added from "Identification and functional analysis of human transcriptional promoters" Genome Res. 2003 13: 308-312 (website)
     -1951 new EST sequences
     -471 new mRNA sequences
     -138 new BAC end sequences

    August 21, 2003

    The August 21st updates include 243 new clinical cases, and updated gene annotations.
    The updates and additions include.....
     -Updated TCAG annotations
     -243 new clinical cases
     -2 new mouse imprinted genes (Pon2, Pon3)
     -SHH regulatory element (ZRS)
     -New 5' sequence for DLX6

    July 10, 2003

    The July 10th updates include the updated TIGR Human Gene Index release 12.0 (June 2003). The latest June dbEST dataset has been added, along with a number of new clinical data entries. New additional 5' sequence for 158 genes has also been added to the current annotations from the Database of Transcription Start Sites (DBTSS).
    The updates and additions include.....
     - 4650 new EST seqeunces (June 30th dbEST updates)
     - TIGR HGI release 12.0 displayed June 2003 (29212 entries)
     - Additional 5' sequence for 158 genes added from Database of Transcription Start Sites (DBTSS)
     - 232 clinical cases added

    July 9, 2003

    WashU publishes chromosome 7 paper in Nature.

    June 15, 2003

    The June 15th updates include new mRNA and BAC end sequences and new STS and genetic markers. DLX5 has been added as an imprinted gene in the structural features track.
    The updates and additions include.....
     - 207 new mRNA sequences added
     - STS and genetic marker datasets updated
     - BAC end data updated (153 new entries)
     - New imprinted gene DLX5 added

    May 27, 2003

    Multiple tracks in GBrowse and the associated databases have been updated:
     - 183 new gene structures (272 with variants)
     - New track: RIKEN DIGIT gene predictions (211 entries)
     - 66 new breakpoints
     - 72 new STS markers
     - 21 new FISH clones
     - 117 new BAC ends
     - 4 new genetic markers
     - 132 sequence variations, 2 mouse imprinted genes were added in the structural features
     - updated ab initio gene predictions (Genscan and HMMgene)

    May 8, 2003

    799 new mRNAs and 1212 new ESTs added to GBrowse.

    April 29, 2003

    BLAT alignment tool implemented for sequence search.

    April 14, 2003

    The International Human Genome Sequencing Consortium announced the completion of the human DNA sequence.

    April 10, 2003

    TCAG publishes chromosome 7 paper in Science. Launch of this website.