Glossary
This page provides explanations about terms and acronyms often used within ERGA and in the context of Biodiversity Genomics. You can filter the terms alphabetically or according to categories:
(Genome) annotation
The process of identifying the functions of different pieces of a genome. This includes genes that code for proteins and non coding features (e.g. intron-exon structure of protein coding genes, promotors, transposable elements). Typically performed using computational methods, followed by manual curation.
(Genome) assembly
A genome assembly is a representation of an organism’s genome that is made using computer programs to turn (assemble) raw sequence data into longer, continuous sequences.
(Genome) completeness
An estimate of how well a reference genome represents the complete sequence of the target organism. A complete genome should equal the haploid genome size of the target, but may be defined when ‘all chromosomes are gapless and have no runs of 10 or more ambiguous bases, there are no unplaced or unlocalized scaffolds, and all expected chromosomes are present.’ (https://www.ncbi.nlm.nih.gov/assembly/). There are different approaches to estimate the completeness, like BUSCO, analysing K-mers, etc.
ABS
Access & Benefit Sharing
BGE
Biodiversity Genomics Europe. The BGE Project has received funding through a Horizon Europe call on Biodiversity and Ecosystem Services. The overarching BGE project includes two streams of genomic research: reference genomes and barcoding, in an effort to establish ERGA and BIOSCAN as the European nodes of the Earth Biogenome Project and of the International Barcode of Life (IBOL), respectively.
BUSCO
A bioinformatic method (Benchmarking Universal Single-Copy Orthologues) used to estimate the completeness of the coding fraction of an organism’s genome based on the proportion of (lineage specific) single copy orthologous genes that are found in a genome assembly.
Biodiversity genomics
The application of genomic methods to research biodiversity.
CARE Principles
The CARE principles for Indigenous data governance (https://www.gida-global.org/care) provide a governance framework that supports the recognition of rights and interests Indigenous Peoples’ to their physical and digital data as well as their Indigenous Knowledges.
COPO
The Collaborative OPen Omics (COPO) platform is for researchers to publish their research assets, providing metadata annotation and deposition capability. It allows researchers to describe their datasets according to community standards and broker the submission of such data to appropriate repositories whilst tracking the resulting accessions/identifiers. Learn more about COPO in this article by the Earlham Institute.
Chromosome-level assembly
the process of generating a contiguous sequence of all chromosomes of a genome, often aided by genetic maps or proximity ligation techniques (3C-seq, Hi-C); term also used to refer to the resulting genome sequence.
Council meetings
During the monthly ERGA council meetings, the representatives of countries and other genome projects associated with ERGA meet to discuss and vote on important matters related to ERGA’s governance and actions. The council is the main decision making body of the consortium. Learn more about ERGA's structure in our Governance Document.
DAC
Data Analysis Committee
DSI
Digital Sequence Information - learn more: https://www.cbd.int/dsi-gr/
DToL
The Darwin Tree of Life Project aims to sequence the genomes of 70,000 species of eukaryotic organisms in Britain and Ireland.
EBP Genome assembly quality standard 6.C.Q40
Minimum reference standard of 6.C.Q40, i.e. megabase N50 contig continuity and chromosomal scale N50 scaffolding, with less than 1/10,000 error rate. For species with chromosome N50 smaller than a megabase this will be C.C.Q40. Additional recommendations include K-mer completeness >90%, BUSCO complete single-copy single >90%, BUSCO complete single duplicate < 5%, and Gaps/Gbp <1000.
ENA
The European Nucleotide Archive (https://www.ebi.ac.uk/ena) is a global repository for sequence data and provides resources that support management and access to sequence data.
ERGA
European Reference Genome Atlas
ERGA Plenary
Our plenary meetings are open to all registered ERGA members and generally include short updates given by committee chairs and one invited talk on various themes connected to biodiversity genomics (watch the previous ones here).
ERGANews
ERGA’s monthly newsletter, includes important updates about the consortium, each of the committees and associated projects. Our newsletters are usually published on the first Tuesday of each month. All editions of the newsletter are stored here.
Equity Deserving
According to the Canadian Council (https://canadacouncil.ca/glossary/equity-seeking-groups) equity deserving groups are those individual researchers, communities, Peoples, regions or countries that have identified barriers to equal access, opportunities, and resources due to disadvantage and/or discrimination and that are actively seeking, and deserving of social justice and reparation. The discrimination experienced could be caused by attitudinal, historic, social, and environmental barriers that could be based on a plethora of characteristics that are including (but not limited to) sex, age, ethnicity, disability, economic status, gender, gender expression, nationality, race, sexual orientation, and creed.
FAIR Principles
A set of principles to guide appropriate management and curation of scientific data (https://www.go-fair.org/fair-principles/) that emphasise data accessibility and use by ensuring that data are Findable, Accessible, Interoperable, and Reusable. Due to the increasing amount of scientific data being reposited, FAIR guidelines promote a data format that is amenable to automated computational access of data by stakeholders
HE
Horizon Europe, sometimes refers to the BGE project funded under HE
HSM
Hierarchical Storage Management is both a data management and data storage technique which transparently manages the movement of data between the different layers of a tiered storage based on file size thresholds, usage and I/O pressure. Usually, a tiered storage is composed of one or more layers of disk arrays, ordered by capacity, latency, redundancy and storage cost. A slow but economically effective archival layer is at the bottom, composed of magnetic tape libraries and automated tape robots, with the highest capacity and latency. The movement between layers is automatically triggered.
Haplotype
A haplotype refers to the collection of genetic material within an organism that is inherited together. Haplotype may be used to describe a few loci or any number of chromosomes (a chromosome-scale haplotype).
Hi-C
Sequencing-based method used to study three-dimensional interactions among chromatin regions by measuring the frequency of contact between pairs of loci. Since contact frequency is related to the distance between a pair of loci, Hi-C linking information is used to help with scaffolding stages during a genome assembly process.
Hi-C map / graph production
The occurrence and frequency of Hi-C contacts are analysed and used in assembly scaffolding. They are typically visualised in Hi-C 2D heatmaps with the full genome sequence on the X and Y axis and a markup for each observed contact.
HiFi reads
HiFi (High Fidelity) PacBio reads are produced by taking multiple sequences of the same molecule to provide a consensus sequence that is usually 12-20kbp long and has a low error rate (>99.9 % consensus accuracy).
INSDC
International Nucleotide Sequence Database Collaboration (https://www.insdc.org/) is an initiative between the DDBJ, EMBL-EBI and NCBI that together act as a global repository of sequence data and associated metadata, and provide tools and services that allow access to genomic resources.
IsoSeq
This is a sequencing protocol developed by PacBio that aims to sequence full-length transcripts using the accurate, long read capabilities of PacBio HiFi technology. IsoSeq data facilitate analysis of transcriptomes and genome annotation by identifying full-length isoforms of transcripts.
JEDI / DEIJ
Justice, Equity, Diversity, and Inclusion Subcommittee
K-mer
A K-mer is a DNA sequence of length k; for example, the sequence AGCT contains the 3-mers (K-mers of length 3) AGC and GCT.
Library
DNA, cDNA, or RNA that has been prepared for NGS within (usually) a specific size range and containing adapters, which are designed to be appropriate for (a) specific sequencing platform(s).
Metadata
A collection of data that provides contextual information about multiple characteristics of other, corresponding original data.
ONT
Oxford Nanopore Technologies (ONT; https://nanoporetech.com/) is a next generation sequencing technology whereby sequence data are generated from the changes in current that occur as single-stranded DNA or RNA molecules pass through nanoscale protein pores (nanopores). ONT provides long read data (up to several megabases) that facilitate genome assembly.
Omni-C
Modified version of Hi-C that uses a sequence-independent endonuclease during its protocol to produce more even sequence coverage increasing overall resolution.
Open data
Open data are freely accessible and unrestricted data that can be accessed, used,reused and shared with third parties for any purpose.
PUID
A permanent unique identifier is a unique label for an object that does not change, such as the Digital Object Identifier (DOI) attached with a scientific publication.
PacBio
Pacific Biosciences (PacBio; https://www.pacb.com/) is a single-molecule, real time (SMRT) next generation sequencing technology in which sequence data are generated by fluorescent light emission that occurs when a DNA polymerase adds nucleotides. PacBio produces long read data (tens of kilobases) that facilitate genome assembly.
RNA-Seq
RNA-Seq is a technique that determines the complete or partial RNA sequence using NGS. The RNA expression profiles vary in different tissues of the same organism and can be influenced by physiopathological circumstances. RNA-Seq data facilitate genome assembly by providing empirical evidence for annotation of transcribed regions.
Reference genome
An accepted standard representation of an organism’s DNA sequence. High-quality reference genomes typically have high completeness (chromosome-level with few gaps in sequence), few errors, and are annotated and accessible. A reference genome serves as a tool for alignment-based analyses, such as variant calling or RNAseq, and has many other applications, for example, phylogenetics and evolutionary relationships, identification of genes and variants, functional analysis and comparative genomics. Reference genomes referred to as “drafts” are those that are under active construction and refinement, and not yet finalised through manual curation.
SOP
A standard operating procedure (SOP) is a document that provides detailed instructions on how to perform an activity, outlining the step-by-step process required for its execution.
Transcriptome
A transcriptome is a set of aligned RNAseq reads representing RNA collected from a sample or collection of samples. This includes both protein-coding and non-coding transcripts.
Voucher
A voucher specimen is a permanently preserved object (either whole or in part, and/or physical or digital) of an identified organism (verified by a recognised expert) and which is deposited in an accessible facility or database. A voucher provides physical evidence about any specimen’s taxonomic identity. Voucher deposition is a best practice for conducting biodiversity genomics research.
References
-
The European Reference Genome Atlas: piloting a decentralised approach to equitable biodiversity genomics (Glossary)- bioRxiv 2023.09.25.559365; doi: https://doi.org/10.1101/2023.09.25.559365
- How genomics can help biodiversity conservation; doi: https://doi.org/10.1016/j.tig.2023.01.005