Glossary

(Genome) annotation

The process of identifying the functions of different pieces of a genome. This includes genes that code for proteins and non coding features (e.g. intron-exon structure of protein coding genes, promotors, transposable elements). Typically performed using computational methods, followed by manual curation.

(Genome) assembly

A genome assembly is a representation of an organism’s genome that is made using computer programs to turn (assemble) raw sequence data into longer, continuous sequences.

(Genome) completeness

An estimate of how well a reference genome represents the complete sequence of the target organism. A complete genome should equal the haploid genome size of the target, but may be defined when ‘all chromosomes are gapless and have no runs of 10 or more ambiguous bases, there are no unplaced or unlocalized scaffolds, and all expected chromosomes are present.’ (https://www.ncbi.nlm.nih.gov/assembly/). There are different approaches to estimate the completeness, like BUSCO, analysing K-mers, etc.

ABS

Access & Benefit Sharing

BGE

Biodiversity Genomics Europe. The BGE Project has received funding through a Horizon Europe call on Biodiversity and Ecosystem Services. The overarching BGE project includes two streams of genomic research: reference genomes and barcoding, in an effort to establish ERGA and BIOSCAN as the European nodes of the Earth Biogenome Project and of the International Barcode of Life (IBOL), respectively.

BUSCO

A bioinformatic method (Benchmarking Universal Single-Copy Orthologues) used to estimate the completeness of the coding fraction of an organism’s genome based on the proportion of (lineage specific) single copy orthologous genes that are found in a genome assembly.

Biodiversity genomics

The application of genomic methods to research biodiversity.

CARE Principles

The CARE principles for Indigenous data governance (https://www.gida-global.org/care) provide a governance framework that supports the recognition of rights and interests Indigenous Peoples’ to their physical and digital data as well as their Indigenous Knowledges.

CBD

Convention on Biological Diversity

COPO

The Collaborative OPen Omics (COPO) platform is for researchers to publish their research assets, providing metadata annotation and deposition capability. It allows researchers to describe their datasets according to community standards and broker the submission of such data to appropriate repositories whilst tracking the resulting accessions/identifiers. Learn more about COPO in this article by the Earlham Institute.

CS

Citizen Science Committee

Chromosome-level assembly

the process of generating a contiguous sequence of all chromosomes of a genome, often aided by genetic maps or proximity ligation techniques (3C-seq, Hi-C); term also used to refer to the resulting genome sequence.

Council meetings

During the monthly ERGA council meetings, the representatives of countries and other genome projects associated with ERGA meet to discuss and vote on important matters related to ERGA’s governance and actions. The council is the main decision making body of the consortium. Learn more about ERGA's structure in our Governance Document.

DAC

Data Analysis Committee

DSI

Digital Sequence Information - learn more: https://www.cbd.int/dsi-gr/

DToL

The Darwin Tree of Life Project aims to sequence the genomes of 70,000 species of eukaryotic organisms in Britain and Ireland.

EBP

The Earth BioGenome Project

EBP Genome assembly quality standard 6.C.Q40

Minimum reference standard of 6.C.Q40, i.e. megabase N50 contig continuity and chromosomal scale N50 scaffolding, with less than 1/10,000 error rate. For species with chromosome N50 smaller than a megabase this will be C.C.Q40. Additional recommendations include K-mer completeness >90%, BUSCO complete single-copy single >90%, BUSCO complete single duplicate < 5%, and Gaps/Gbp <1000.

EC

European Commission

ELSI

Ethical, Legal, and Social Issues (Committee)

ENA

The European Nucleotide Archive (https://www.ebi.ac.uk/ena) is a global repository for sequence data and provides resources that support management and access to sequence data.

ERGA

European Reference Genome Atlas

ERGA Plenary

Our plenary meetings are open to all registered ERGA members and generally include short updates given by committee chairs and one invited talk on various themes connected to biodiversity genomics (watch the previous ones here).

ERGANews

ERGA’s monthly newsletter, includes important updates about the consortium, each of the committees and associated projects. Our newsletters are usually published on the first Tuesday of each month. All editions of the newsletter are stored here.

Equity Deserving

According to the Canadian Council (https://canadacouncil.ca/glossary/equity-seeking-groups) equity deserving groups are those individual researchers, communities, Peoples, regions or countries that have identified barriers to equal access, opportunities, and resources due to disadvantage and/or discrimination and that are actively seeking, and deserving of social justice and reparation. The discrimination experienced could be caused by attitudinal, historic, social, and environmental barriers that could be based on a plethora of characteristics that are including (but not limited to) sex, age, ethnicity, disability, economic status, gender, gender expression, nationality, race, sexual orientation, and creed.

FAIR Principles

A set of principles to guide appropriate management and curation of scientific data (https://www.go-fair.org/fair-principles/) that emphasise data accessibility and use by ensuring that data are Findable, Accessible, Interoperable, and Reusable. Due to the increasing amount of scientific data being reposited, FAIR guidelines promote a data format that is amenable to automated computational access of data by stakeholders

Galaxy

Galaxy is an open source, web-based platform for data intensive biomedical research.

Genome Report

A genome report is a technical publication that describes all the steps taken to produce a reference genome: sampling, sequencing, assembling, annotating. They often have a standardised format and structure that allows readers to quickly and easily understand the quality of the genome and how it was generated.

GoaT

Genomes On A Tree

HE

Horizon Europe, sometimes refers to the BGE project funded under HE

HSM

Hierarchical Storage Management is both a data management and data storage technique which transparently manages the movement of data between the different layers of a tiered storage based on file size thresholds, usage and I/O pressure. Usually, a tiered storage is composed of one or more layers of disk arrays, ordered by capacity, latency, redundancy and storage cost. A slow but economically effective archival layer is at the bottom, composed of magnetic tape libraries and automated tape robots, with the highest capacity and latency. The movement between layers is automatically triggered.

Haplotype

A haplotype refers to the collection of genetic material within an organism that is inherited together. Haplotype may be used to describe a few loci or any number of chromosomes (a chromosome-scale haplotype).

Hi-C

Sequencing-based method used to study three-dimensional interactions among chromatin regions by measuring the frequency of contact between pairs of loci. Since contact frequency is related to the distance between a pair of loci, Hi-C linking information is used to help with scaffolding stages during a genome assembly process.

Hi-C map / graph production

The occurrence and frequency of Hi-C contacts are analysed and used in assembly scaffolding. They are typically visualised in Hi-C 2D heatmaps with the full genome sequence on the X and Y axis and a markup for each observed contact.

HiFi reads

HiFi (High Fidelity) PacBio reads are produced by taking multiple sequences of the same molecule to provide a consensus sequence that is usually 12-20kbp long and has a low error rate (>99.9 % consensus accuracy).

INSDC

International Nucleotide Sequence Database Collaboration (https://www.insdc.org/) is an initiative between the DDBJ, EMBL-EBI and NCBI that together act as a global repository of sequence data and associated metadata, and provide tools and services that allow access to genomic resources.

ITIC

IT & Infrastructure Committee

IsoSeq

This is a sequencing protocol developed by PacBio that aims to sequence full-length transcripts using the accurate, long read capabilities of PacBio HiFi technology. IsoSeq data facilitate analysis of transcriptomes and genome annotation by identifying full-length isoforms of transcripts.

JEDI / DEIJ

Justice, Equity, Diversity, and Inclusion Subcommittee

K-mer

A K-mer is a DNA sequence of length k; for example, the sequence AGCT contains the 3-mers (K-mers of length 3) AGC and GCT.

Library

DNA, cDNA, or RNA that has been prepared for NGS within (usually) a specific size range and containing adapters, which are designed to be appropriate for (a) specific sequencing platform(s).

M&C

Media & Communications Committee

Metadata

A collection of data that provides contextual information about multiple characteristics of other, corresponding original data.

ONT

Oxford Nanopore Technologies (ONT; https://nanoporetech.com/) is a next generation sequencing technology whereby sequence data are generated from the changes in current that occur as single-stranded DNA or RNA molecules pass through nanoscale protein pores (nanopores). ONT provides long read data (up to several megabases) that facilitate genome assembly.

Omni-C

Modified version of Hi-C that uses a sequence-independent endonuclease during its protocol to produce more even sequence coverage increasing overall resolution.

Open data

Open data are freely accessible and unrestricted data that can be accessed, used,reused and shared with third parties for any purpose.

PUID

A permanent unique identifier is a unique label for an object that does not change, such as the Digital Object Identifier (DOI) attached with a scientific publication.

PacBio

Pacific Biosciences (PacBio; https://www.pacb.com/) is a single-molecule, real time (SMRT) next generation sequencing technology in which sequence data are generated by fluorescent light emission that occurs when a DNA polymerase adds nucleotides. PacBio produces long read data (tens of kilobases) that facilitate genome assembly.

RNA-Seq

RNA-Seq is a technique that determines the complete or partial RNA sequence using NGS. The RNA expression profiles vary in different tissues of the same organism and can be influenced by physiopathological circumstances. RNA-Seq data facilitate genome assembly by providing empirical evidence for annotation of transcribed regions.

Reference genome

An accepted standard representation of an organism’s DNA sequence. High-quality reference genomes typically have high completeness (chromosome-level with few gaps in sequence), few errors, and are annotated and accessible. A reference genome serves as a tool for alignment-based analyses, such as variant calling or RNAseq, and has many other applications, for example, phylogenetics and evolutionary relationships, identification of genes and variants, functional analysis and comparative genomics. Reference genomes referred to as “drafts” are those that are under active construction and refinement, and not yet finalised through manual curation.

SAC

Sequencing and Assembly Committee

SOP

A standard operating procedure (SOP) is a document that provides detailed instructions on how to perform an activity, outlining the step-by-step process required for its execution.

References

The European Reference Genome Atlas: piloting a decentralised approach to equitable biodiversity genomics (Glossary)- bioRxiv 2023.09.25.559365; doi: https://doi.org/10.1101/2023.09.25.559365
How genomics can help biodiversity conservation; doi: https://doi.org/10.1016/j.tig.2023.01.005

Refererences