Global Biodiversity Information Facility

Dataset . 2025

License: CC BY

Data sources: Datacite

Link to

shareShare

Cite

Select content type to embed

All Research products

arrow_drop_down

<script type="text/javascript">
<!--
document.write('<div id="oa_widget"></div>');
document.write('<script type="text/javascript" src="https://www.openaire.eu/index.php?option=com_openaire&view=widget&format=raw&projectId=undefined&type=result"></script>');
-->
</script>

COPY SCRIPT

For further information contact us at helpdesk@openaire.eu

INSDC Sequences

Name: INSDC Sequences
Creator: European Bioinformatics Institute (EMBL-EBI)
Keywords: Metadata

Research datakeyboard_double_arrow_right Dataset 01 Jan 2025 English Publisher:European Nucleotide Archive (EMBL-EBI)

Authors: European Bioinformatics Institute (EMBL-EBI);

doi: 10.15468/sbmztx

INSDC Sequences

- Summary
- Subjects
- Related research
  (199)
- Metrics

tips_and_updates
Recommended

Abstract

This dataset contains INSDC sequence records not associated with environmental sample identifiers or host organisms. The dataset is prepared periodically using the public ENA API (https://www.ebi.ac.uk/ena/portal/api/) by querying data with search parameters: `environmental_sample=False & host=""`EMBL-EBI also publishes other records in separate datasets (https://www.gbif.org/publisher/ada9d123-ddb4-467d-8891-806ea8d94230).The data was then processed as follows:1. Human sequences were excluded.2. For non-CONTIG records, the sample accession number (when available) along with the scientific name were used to identify sequence records corresponding to the same individuals (or group of organism of the same species in the same sample). Only one record was kept for each scientific name/sample accession number.3. Contigs and whole genome shotgun (WGS) records were added individually.4. The records that were missing some information were excluded. Only records associated with a specimen voucher or records containing both a location AND a date were kept.5. The records associated with the same vouchers are aggregated together.6. A lot of records left corresponded to individual sequences or reads corresponding to the same organisms. In practise, these were "duplicate" occurrence records that weren't filtered out in STEP 2 because the sample accession sample was missing. To identify those potential duplicates, we grouped all the remaining records by `scientific_name`, `collection_date`, `location`, `country`, `identified_by`, `collected_by` and `sample_accession` (when available). Then we excluded the groups that contained more than 50 records. The rationale behind the choice of threshold is explained here: https://github.com/gbif/embl-adapter/issues/10#issuecomment-8557579787. To improve the matching of the EBI scientific name to the GBIF backbone taxonomy, we incorporated the ENA taxonomic information. The kingdom, Phylum, Class, Order, Family, and genus were obtained from the ENA taxonomy checklist available here: http://ftp.ebi.ac.uk/pub/databases/ena/taxonomy/sdwca.zipMore information available here: https://github.com/gbif/embl-adapter#readmeYou can find the mapping used to format the EMBL data to Darwin Core Archive here: https://github.com/gbif/embl-adapter/blob/master/DATAMAPPING.md

Keywords

Metadata

199 Research products, page 1 of 20

Occurrence Download
2024IsSourceOf
Occurrence Download
2023IsSourceOf
Occurrence Download
2023IsSourceOf
Occurrence Download
2024IsSourceOf
Occurrence Download
2022IsSourceOf
Occurrence Download
2024IsSourceOf
Occurrence Download
2025IsSourceOf
Occurrence Download
2025IsSourceOf
Occurrence Download
2022IsSourceOf
Occurrence Download
2022IsSourceOf

chevron_left
1
2
3
4
5
chevron_right

Impact byBIP!

	citations This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	1
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average