Biotechnology — Data publication — Preliminary considerations and concepts

This document reviews best practices that: a) respect the existing standardization efforts of life sciences research communities; b) normalize key aspects of data description particularly at the level of the biology being studied (and shared) across the life sciences communities; c) ensure that data are “findable” and useable by other researchers; and d) provide guidance and metrics for assessing the applicability of a particular data sharing plan. This document is applicable to domains in life sciences including biotechnology, genomics (including massively parallel nucleotide sequencing, metagenomics, epigenomics and functional genomics), transcriptomics, translatomics, proteomics, metabolomics, lipidomics, glycomics, enzymology, immunochemistry, life science imaging, synthetic biology, systems biology, systems medicine and related fields.

Biotechnologie — Publication de données — Considérations et concepts préliminaires

General Information

Status
Published
Publication Date
17-May-2021
Current Stage
6060 - International Standard published
Start Date
18-May-2021
Due Date
10-Oct-2022
Completion Date
18-May-2021
Ref Project

Buy Standard

Technical report
ISO/TR 3985:2021 - Biotechnology -- Data publication -- Preliminary considerations and concepts
English language
19 pages
sale 15% off
Preview
sale 15% off
Preview
Draft
ISO/PRF TR 3985:Version 27-mar-2021 - Biotechnology -- Data publication -- Preliminary considerations and concepts
English language
19 pages
sale 15% off
Preview
sale 15% off
Preview

Standards Content (Sample)

TECHNICAL ISO/TR
REPORT 3985
First edition
2021-05
Biotechnology — Data publication
— Preliminary considerations and
concepts
Biotechnologie — Publication de données — Considérations et
concepts préliminaires
Reference number
ISO/TR 3985:2021(E)
©
ISO 2021

---------------------- Page: 1 ----------------------
ISO/TR 3985:2021(E)

COPYRIGHT PROTECTED DOCUMENT
© ISO 2021
All rights reserved. Unless otherwise specified, or required in the context of its implementation, no part of this publication may
be reproduced or utilized otherwise in any form or by any means, electronic or mechanical, including photocopying, or posting
on the internet or an intranet, without prior written permission. Permission can be requested from either ISO at the address
below or ISO’s member body in the country of the requester.
ISO copyright office
CP 401 • Ch. de Blandonnet 8
CH-1214 Vernier, Geneva
Phone: +41 22 749 01 11
Email: copyright@iso.org
Website: www.iso.org
Published in Switzerland
ii © ISO 2021 – All rights reserved

---------------------- Page: 2 ----------------------
ISO/TR 3985:2021(E)

Contents Page
Foreword .v
Introduction .vi
1 Scope . 1
2 Normative references . 1
3 Terms and definitions . 1
4 Abbreviated terms . 4
5 Principles . 4
5.1 General . 4
5.2 Current technologies, approaches and their flaws . 5
5.3 Standards and best practices to facilitate data sharing and reuse . 6
5.3.1 Maximizing value to the payer . 6
5.3.2 Data findability . 6
5.3.3 Data machine and human interpretability . 6
5.3.4 Using accepted controlled vocabularies and naming conventions. 6
5.3.5 Biological annotation technology domain independence . 6
5.3.6 Data locatability using multiple queries. 7
5.4 Additional desirable attributes . 7
5.4.1 Data linkage to a published and openly accessible document describing
the experimental system . 7
5.4.2 Data format linkage to a published and openly accessible document
describing the format . 7
5.4.3 Existing information technology . 7
5.4.4 Development of tools and best practices for creating web friendly and
search engine crawlable data documents . 7
5.5 Essential considerations . 7
5.5.1 Common annotation across multiple data sources . 7
5.5.2 Keyword template . 8
5.5.3 Embedding ontological descriptions . 9
5.5.4 Pseudo-documents . 9
6 Major challenges .10
6.1 General .10
6.2 Domain .10
6.3 Regionalization .10
6.4 Proprietary data .10
6.5 Large number of existing bio-ontologies, controlled vocabularies and terminologies .10
6.6 Large number of existing data repositories and corresponding domain specific
data formats .11
6.7 Large number of funding agencies (e.g. national, educational, philanthropic,
commercial) .11
7 Examples of existing national and regional standards or requirements for data
sharing or publication .11
7.1 General .11
7.2 USA .11
7.3 Canada .11
7.4 European Union .11
7.5 Germany .12
7.6 China .12
7.7 United Kingdom .12
7.8 India .12
7.9 Japan .12
8 Existing legal requirements for data protection .12
© ISO 2021 – All rights reserved iii

---------------------- Page: 3 ----------------------
ISO/TR 3985:2021(E)

8.1 USA .12
8.2 European Union .13
9 Timing of data publication .13
10 Costs of data publication .13
11 Archival data .13
12 Validation and verification of compliance .13
13 Affected stakeholder categories .13
Annex A (informative) Searchability of scientific content on the web .14
Annex B (informative) Example enhanced annotation of text documents .16
Bibliography .17
iv © ISO 2021 – All rights reserved

---------------------- Page: 4 ----------------------
ISO/TR 3985:2021(E)

Foreword
ISO (the International Organization for Standardization) is a worldwide federation of national standards
bodies (ISO member bodies). The work of preparing International Standards is normally carried out
through ISO technical committees. Each member body interested in a subject for whom a technical
committee has been established has the right to be represented on that committee. International
organizations, governmental and non-governmental, in liaison with ISO, also take part in the work.
ISO collaborates closely with the International Electrotechnical Commission (IEC) on all matters of
electrotechnical standardization.
The procedures used to develop this document and those intended for its further maintenance are
described in the ISO/IEC Directives, Part 1. In particular, the different approval criteria needed for the
different types of ISO documents should be noted. This document was drafted in accordance with the
editorial rules of the ISO/IEC Directives, Part 2 (see https:// www .iso .org/ directives -and -policies .html).
Attention is drawn to the possibility that some of the elements of this document may be the subject of
patent rights. ISO shall not be held responsible for identifying any or all such patent rights. Details of
any patent rights identified during the development of the document will be in the Introduction and/or
on the ISO list of patent declarations received (see www .iso .org/ patents).
Any trade name used in this document is information given for the convenience of users and does not
constitute an endorsement.
For an explanation of the voluntary nature of standards, the meaning of ISO specific terms and
expressions related to conformity assessment, as well as information about ISO's adherence to the
World Trade Organization (WTO) principles in the Technical Barriers to Trade (TBT), see www .iso .org/
iso/ foreword .html.
This document was prepared by Technical Committee ISO/TC 276, Biotechnology.
Any feedback or questions on this document should be directed to the user’s national standards body. A
complete listing of these bodies can be found at www .iso .org/ members .html.
© ISO 2021 – All rights reserved v

---------------------- Page: 5 ----------------------
ISO/TR 3985:2021(E)

Introduction
The explosion of life sciences data (big data) has created a need to digitally locate data from diverse
biological assays, obtained in a wide range of laboratories, and from a wide range of experimental
protocols. To be able to extract value from big data, it is necessary that the data are “findable”, and
that the biology measured in the assay is described in a way that it can be located and interpreted.
Data producer’s use of a consistent method to describe the biology that their data represents can
greatly improve the use of big data. This single, unified description of biological data facilitates locating
and extracting value from an abundance of biological data and return increased value to funding
organizations.
Many biotech communities have already developed standard data representations specific to their
[1] [2] [3]
domain . For example, MIAME in the microarray community, OME/OMERO in the imaging and
[4]
microscopy communities, SBML in the systems biology and reaction kinetics community, and MIABIS
[5]
in the biobanking domain . What is lacking is a consistent method of describing the represented
biological information so that the same search, analysis and mining tools can locate data across
the entire range of life science domains. Consensus and guidance are required and provided in this
document for the biotech domain-independent annotation of biological data.
The importance of data sharing as an integral part of biological research is recognized in the research
community. As a result, a diverse set of stakeholders has developed the FAIR (Findable, Accessible,
[7]
Interoperable and Reusable) data sharing principles . The intent of FAIR is to act as a guideline for
sharing and enhancing the reusability of data holdings. Many life science funding organizations also
place increased emphasis on the importance of data sharing. Some require that data sharing plans
are included in grant applications and research contracts, i.e. “data must be made as widely and freely
available as possible while safeguarding the privacy of participants and protecting confidential and
[8]
proprietary data .” Data sharing is equally critical for various national and international research and
biobank networks. Data sharing is known to encourage diversity of analysis and opinion, the testing
of alternative hypotheses and enabling of explorations not envisioned by the original investigators,
resulting in increased value to the funding organization.
This document lays out concepts, challenges, issues and benefits that are relevant to developing
International Standards for data sharing in life science research and provides an overview for specifying
standards and best practices that enable data sharing.
vi © ISO 2021 – All rights reserved

---------------------- Page: 6 ----------------------
TECHNICAL REPORT ISO/TR 3985:2021(E)
Biotechnology — Data publication — Preliminary
considerations and concepts
1 Scope
This document reviews best practices that:
a) respect the existing standardization efforts of life sciences research communities;
b) normalize key aspects of data description particularly at the level of the biology being studied (and
shared) across the life sciences communities;
c) ensure that data are “findable” and useable by other researchers; and
d) provide guidance and metrics for assessing the applicability of a particular data sharing plan.
This document is applicable to domains in life sciences including biotechnology, genomics (including
massively parallel nucleotide sequencing, metagenomics, epigenomics and functional genomics),
transcriptomics, translatomics, proteomics, metabolomics, lipidomics, glycomics, enzymology,
immunochemistry, life science imaging, synthetic biology, systems biology, systems medicine and
related fields.
2 Normative references
There are no normative references in this document.
3 Terms and definitions
For the purposes of this document, the following terms and definitions apply.
ISO and IEC maintain terminological databases for use in standardization at the following addresses:
— ISO Online browsing platform: available at https:// www .iso .org/ obp
— IEC Electropedia: available at http:// www .electropedia .org/
3.1
big data
bigdata
extensive datasets (3.7) — primarily in the data (3.2) characteristics of volume, variety, velocity, and/
or variability — that require a scalable technology for efficient storage, manipulation, management,
and analysis
Note 1 to entry: Big data is commonly used in many ways, for example as the name of the scalable technology
used to handle big data extensive data sets.
Note 2 to entry: Big data includes any data that are aggregated into a repository of much larger size than the
component data parts. For example, the collection of abstracts of biological publications represents a big data set
with more than 20 million entries.
[SOURCE: ISO/IEC 20546:2019, 3.1.2, modified — “bigdata” was given as an alternative term and Note 2
to entry was added.]
© ISO 2021 – All rights reserved 1

---------------------- Page: 7 ----------------------
ISO/TR 3985:2021(E)

3.2
data
reinterpretable representation of information in a formalized manner suitable for communication,
interpretation or processing
[SOURCE: ISO/IEC 2382:2015, 2121272, modified — All three notes were removed.]
3.3
data archiver
archiver
individual or organization responsible for the long-term persistence of data and the access to that data
Note 1 to entry: An archiver receives data from a producer and can be funded by the same or different payer.
3.4
data consumer
consumer
user
individual or organization that uses data as a starting point
Note 1 to entry: In the research domain, a data consumer is a scientist or research group.
Note 2 to entry: In the medical domain, a data consumer can be a physician or patient. In some cases, consumer
can also be payer.
3.5
data producer
producer
organization or individual that carries out an experiment or measurement, funded by a payer (3.11),
and producing a data set
Note 1 to entry: In the research domain producer is typically a researcher, in the commercial domain the producer
can be a contract laboratory.
3.6
data publication
publication
any of several forms in which data are made available to a wider community
Note 1 to entry: This includes traditional scientific publications in journals as well as the sharing of data via a
public repository such as GENBANK. Data publication is typically, though not always, carried out by an entity
dedicated to the collection and dissemination of data, e.g. a data archiver (3.3).
Note 2 to entry: The “wider community” refers to data consumers, other than the individuals or organization
that obtained the data.
3.7
data set
dataset
identifiable collection of data
[SOURCE: ISO 19115-1:2014, 4.3, modified — “dataset” was given as an alternative term and Note 1 to
entry was deleted.]
3.8
data sharing
sharing
making data (e.g. numerical, textual, images) available to, and findable by, others
Note 1 to entry: Data are not truly shared, if they cannot be found.
2 © ISO 2021 – All rights reserved

---------------------- Page: 8 ----------------------
ISO/TR 3985:2021(E)

3.9
data sharing plan
formalized description of how a data producer (3.5) will accomplish the task of data sharing (3.8)
3.10
metadata
meta-data
data that define and describe other data
[SOURCE: ISO/IEC 11179-1:2015, 3.2.16, modified — “meta-data” was added as an alternative term.]
3.11
payer
organization responsible for funding research
Note 1 to entry: This can be a government organization such as a national research institute, a philanthropic
organization, a private research organization or, in the medical case a national or private insurance organization.
3.12
proprietary data
data stored in such a way that by design and implementation they are not accessible to everyone
Note 1 to entry: Proprietary data include, but are not limited to, data proprietary to an organization such as a
company, or data proprietary to an individual such as health records.
Note 2 to entry: Proprietary data are the opposite of public data (3.13).
3.13
public data
data stored in such a way that by design and implementation they are accessible to everyone
Note 1 to entry: Public data are the opposite of proprietary data (3.12).
3.14
regionalization
process of expressing a text or data in a particular human language
Note 1 to entry: This includes not only the textual part of the document but also the date formats and varying
usages and meanings of commas (,) and periods (.) in numeric formats.
3.15
reification
expression of data or knowledge in a specific language or syntax
Note 1 to entry: Examples include expressing or converting structured data from one format to another, such as
from JSON to XML.
Note 2 to entry: Reification also means making a topic represent the subject of another topic map construct in the
same topic map according to ISO/IEC 13250-2:2006, 3.11.
3.16
repurposing
practice of using data in a manner other than which it was originally collected
Note 1 to entry: For example, microscope images originally collected for cell counting purposes might be
repurposed and used to measure cell morphology.
© ISO 2021 – All rights reserved 3

---------------------- Page: 9 ----------------------
ISO/TR 3985:2021(E)

4 Abbreviated terms
BBSRC Biotechnology and Biological Sciences Research Council
ChEBI Chemical Entities of Biological Interest
DNA Deoxyribonucleic Acid
EOSC European Open Science Cloud
EU European Union
FAIR Findable, Accessible, Interoperable, Reusable
CASRN Chemical Abstracts Service Registry Number
HTML Hypertext Markup Language
MIABIS Minimum Information about Biobank Information Sharing
MIAME Minimum Information about a Microarray Experiment
NCBI National Center for Biotechnology Information
NIH United States Department of Health and Human Services, National Institutes of Health
OME Open Microscopy Environment
OMERO Open Microscopy Environment Remote objects
OSPP Open Science Policy Platform of the European Union
OWL W3C Web Ontology Language
PDF Portable Document Format
PID Persistent Identifier
POD Plain Old Documentation
RDF Resource Description Framework
UCSD University of California San Diego
URI Uniform Resource Identifier
URL Uniform Resource Locator
USA United States of America
SBML Systems Biology Markup Language
VEGFa Vascular Endothelial Growth Factor a
XML Extensible Markup Language
5 Principles
5.1 General
Data sharing by definition is more than simply the publication of summary statistics in tables. It also
[8]
includes sharing of raw data from which the summaries are generated .
The challenge to both researchers and funding agencies is determining what and how data are shared
and what metrics might be used to judge the suitability of a sharing plan. For example, the breadth and
variety of science supported by the US National Institutes of Health (NIH) prevents the precise content
for documentation, its presentation or its transport to be stipulated, i.e. one size does not fit all. As a
result, the NIH encourages discussion of data sharing standards and practices between disciplines and
[8]
professional societies to create a supportive data sharing environment .
This view, however, leaves the researcher, reviewer and funding agency without enough guidance and
metrics to judge a plan. In addition, it lacks any attempt at standardizing any of the aspects of the data
across technology domains, leaving open the potential for ineffective data sharing.
4 © ISO 2021 – All rights reserved

---------------------- Page: 10 ----------------------
ISO/TR 3985:2021(E)

FOUNDATIONAL CONCEPT: At the level of biological description, differences between life science
technologies vanish suggesting that a unifying standard spanning all the individual life science data
communities can be used for data sharing (See Figure 1).
NOTE In the case shown here four technologies have been applied to the study of somitogenesis, a phase
of early embryonic development. Each technology domain (highlighted as - - - -) has its own data and metadata
specification. There is a critical need for a common, high level annotation scheme that describes the biology
(highlighted as — ·· — ·· —) included in an experiment or model in a (bio)technology-independent fashion.
Figure 1 — Multiple (bio)technologies can be applied to study a biological or biomedical
problem.
Consistent annotation of the biological content of data aims at:
a) technology domain independence (i.e. not bound to a certain method or technology);
b) findability of the data;
c) data interoperability (facilitation of data integration);
d) facilitation of data reuse and repurposing.
5.2 Current technologies, approaches and their flaws
Factors that can contribute to the lack of effective sharing and reuse of biological data include:
a) Many communities and their data formats were established before the internet and search engines
were available.
© ISO 2021 – All rights reserved 5

---------------------- Page: 11 ----------------------
ISO/TR 3985:2021(E)

b) The data are not “published” or only partially published (e.g. only available in the form of a summary
such as group averages).
c) The data are published in a form, format or location that is not easily interpreted or located.
d) The data are published in a suitable format and at a suitable location, but the terms used in the
document are not standard nomenclature making finding the data difficult.
5.3 Standards and best practices to facilitate data sharing and reuse
5.3.1 Maximizing value to the payer
Any standard aiming to facilitate data sharing and reuse maximizes value to the payer by maximizing
the number of users and uses of data. Uses of data often extend beyond what was envisioned by the
payer and data producer.
5.3.2 Data findability
Data findability by authorized users is essential. In the best case, it will not be necessary to first locate
the URI associated with the data source. Current data search applications use artificial intelligence
to learn ontological terms that are missing from controlled vocabularies and identify relationships
between them, those already in use and query terms to make recommendations for synonyms. Database
communities, commercial and public, have already begun to use this type of application. However, data
producers can proactively consult ontological databases to determine controlled vocabularies that
best render their data findable. Furthermore, “findability” is independent of the biotech domain, and
that domain's standard repository, since the life sciences data are described in biological terms and are
indexed by web search engines. Searchability of scientific content on the web is covered in Annex A.
5.3.3 Data machine and human interpretability
Life science data can be both machine and human interpretable. The development of “reification
technologies” that can present a data document in multiple ways can greatly improve the ability to
create documents that are both machine and human interpretable. An example of this technology
partially exists in the systems biology domain, especially when it comes to modelling biological
[4]
processes. Systems Biology Markup Language (SBML) is a free and open interchange format for
computer models of biological processes. An SBML format file describes a set of mathemat
...

TECHNICAL ISO/TR
REPORT 3985
First edition
Biotechnology — Data publication
— Preliminary considerations and
concepts
PROOF/ÉPREUVE
Reference number
ISO/TR 3985:2021(E)
©
ISO 2021

---------------------- Page: 1 ----------------------
ISO/TR 3985:2021(E)

COPYRIGHT PROTECTED DOCUMENT
© ISO 2021
All rights reserved. Unless otherwise specified, or required in the context of its implementation, no part of this publication may
be reproduced or utilized otherwise in any form or by any means, electronic or mechanical, including photocopying, or posting
on the internet or an intranet, without prior written permission. Permission can be requested from either ISO at the address
below or ISO’s member body in the country of the requester.
ISO copyright office
CP 401 • Ch. de Blandonnet 8
CH-1214 Vernier, Geneva
Phone: +41 22 749 01 11
Email: copyright@iso.org
Website: www.iso.org
Published in Switzerland
ii PROOF/ÉPREUVE © ISO 2021 – All rights reserved

---------------------- Page: 2 ----------------------
ISO/TR 3985:2021(E)

Contents Page
Foreword .v
Introduction .vi
1 Scope . 1
2 Normative references . 1
3 Terms and definitions . 1
4 Abbreviated terms . 4
5 Principles . 4
5.1 General . 4
5.2 Current technologies, approaches and their flaws . 5
5.3 Standards and best practices to facilitate data sharing and reuse . 6
5.3.1 Maximizing value to the payer . 6
5.3.2 Data findability . 6
5.3.3 Data machine and human interpretability . 6
5.3.4 Using accepted controlled vocabularies and naming conventions. 6
5.3.5 Biological annotation technology domain independence . 6
5.3.6 Data locatability using multiple queries. 7
5.4 Additional desirable attributes . 7
5.4.1 Data linkage to a published and openly accessible document describing
the experimental system . 7
5.4.2 Data format linkage to a published and openly accessible document
describing the format . 7
5.4.3 Existing information technology . 7
5.4.4 Development of tools and best practices for creating web friendly and
search engine crawlable data documents . 7
5.5 Essential considerations . 7
5.5.1 Common annotation across multiple data sources . 7
5.5.2 Keyword template . 8
5.5.3 Embedding ontological descriptions . 9
5.5.4 Pseudo-documents . 9
6 Major challenges .10
6.1 General .10
6.2 Domain .10
6.3 Regionalization .10
6.4 Proprietary data .10
6.5 Large number of existing bio-ontologies, controlled vocabularies and terminologies .10
6.6 Large number of existing data repositories and corresponding domain specific
data formats .10
6.7 Large number of funding agencies (e.g. national, educational, philanthropic,
commercial) .11
7 Examples of existing national and regional standards or requirements for data
sharing or publication .11
7.1 General .11
7.2 USA .11
7.3 Canada .11
7.4 European Union .11
7.5 Germany .11
7.6 China .12
7.7 United Kingdom .12
7.8 India .12
7.9 Japan .12
8 Existing legal requirements for data protection .12
© ISO 2021 – All rights reserved PROOF/ÉPREUVE iii

---------------------- Page: 3 ----------------------
ISO/TR 3985:2021(E)

8.1 USA .12
8.2 European Union .12
9 Timing of data publication .13
10 Costs of data publication .13
11 Archival data .13
12 Validation and verification of compliance .13
13 Affected stakeholder categories .13
Annex A (informative) Searchability of scientific content on the web .14
Annex B (informative) Example enhanced annotation of text documents .16
Bibliography .17
iv PROOF/ÉPREUVE © ISO 2021 – All rights reserved

---------------------- Page: 4 ----------------------
ISO/TR 3985:2021(E)

Foreword
ISO (the International Organization for Standardization) is a worldwide federation of national standards
bodies (ISO member bodies). The work of preparing International Standards is normally carried out
through ISO technical committees. Each member body interested in a subject for whom a technical
committee has been established has the right to be represented on that committee. International
organizations, governmental and non-governmental, in liaison with ISO, also take part in the work.
ISO collaborates closely with the International Electrotechnical Commission (IEC) on all matters of
electrotechnical standardization.
The procedures used to develop this document and those intended for its further maintenance are
described in the ISO/IEC Directives, Part 1. In particular, the different approval criteria needed for the
different types of ISO documents should be noted. This document was drafted in accordance with the
editorial rules of the ISO/IEC Directives, Part 2 (see https:// www .iso .org/ directives -and -policies .html).
Attention is drawn to the possibility that some of the elements of this document may be the subject of
patent rights. ISO shall not be held responsible for identifying any or all such patent rights. Details of
any patent rights identified during the development of the document will be in the Introduction and/or
on the ISO list of patent declarations received (see www .iso .org/ patents).
Any trade name used in this document is information given for the convenience of users and does not
constitute an endorsement.
For an explanation of the voluntary nature of standards, the meaning of ISO specific terms and
expressions related to conformity assessment, as well as information about ISO's adherence to the
World Trade Organization (WTO) principles in the Technical Barriers to Trade (TBT), see www .iso .org/
iso/ foreword .html.
This document was prepared by Technical Committee ISO/TC 276, Biotechnology.
Any feedback or questions on this document should be directed to the user’s national standards body. A
complete listing of these bodies can be found at www .iso .org/ members .html.
© ISO 2021 – All rights reserved PROOF/ÉPREUVE v

---------------------- Page: 5 ----------------------
ISO/TR 3985:2021(E)

Introduction
The explosion of life sciences data (bigdata) has created a need to digitally locate data from diverse
biological assays, obtained in a wide range of laboratories, and from a wide range of experimental
protocols. To be able to extract value from bigdata, it is necessary that the data be “findable”, and that
the biology measured in the assay is described in a way that it can be located and interpreted. Data
producer’s use of a consistent method to describe the biology that their data represents can greatly
improve the use of big data. This single, unified description of biological data facilitates locating
and extracting value from an abundance of biological data and return increased value to funding
organizations.
Many biotech communities have already developed standard data representations specific to their
[1] [2] [3]
domain . For example, MIAME in the microarray community, OME/OMERO in the imaging and
[4]
microscopy communities, SBML in the systems biology and reaction kinetics community, and MIABIS
[5]
in the biobanking domain . What is lacking is a consistent method of describing the represented
biological information so that the same search, analysis and mining tools can locate data across
the entire range of life science domains. Consensus and guidance are required and provided in this
document for the biotech domain-independent annotation of biological data.
The importance of data sharing as an integral part of biological research is recognized in the research
community. As a result, a diverse set of stakeholders have developed the FAIR (Findable, Accessible,
[7]
Interoperable and Reusable) data sharing principles . The intent of FAIR is to act as a guideline for
sharing and enhancing the reusability of data holdings. Many life science funding organizations also
place increased emphasis on the importance of data sharing. Some require that data sharing plans
are included in grant applications and research contracts, i.e. “data must be made as widely and freely
available as possible while safeguarding the privacy of participants and protecting confidential and
[8]
proprietary data .” Data sharing is equally critical for various national and international research and
biobank networks. Data sharing is known to encourage diversity of analysis and opinion, the testing
of alternative hypotheses and enabling of explorations not envisioned by the original investigators,
resulting in increased value to the funding organization.
This document lays out concepts, challenges, issues and benefits that are relevant to developing
International Standards for data sharing in life science research and provides an overview for specifying
standards and best practices that enable data sharing.
vi PROOF/ÉPREUVE © ISO 2021 – All rights reserved

---------------------- Page: 6 ----------------------
TECHNICAL REPORT ISO/TR 3985:2021(E)
Biotechnology — Data publication — Preliminary
considerations and concepts
1 Scope
This document reviews best practices that:
a) respect the existing standardization efforts of life sciences research communities;
b) normalize key aspects of data description particularly at the level of the biology being studied (and
shared) across the life sciences communities;
c) ensure that data are “findable” and useable by other researchers; and
d) provide guidance and metrics for assessing the applicability of a particular data sharing plan.
This document is applicable to domains in life sciences including biotechnology, genomics (including
massively parallel nucleotide sequencing, metagenomics, epigenomics and functional genomics),
transcriptomics, translatomics, proteomics, metabolomics, lipidomics, glycomics, enzymology,
immunochemistry, life science imaging, synthetic biology, systems biology, systems medicine and
related fields.
2 Normative references
There are no normative references in this document.
3 Terms and definitions
For the purposes of this document, the following terms and definitions apply.
ISO and IEC maintain terminological databases for use in standardization at the following addresses:
— ISO Online browsing platform: available at https:// www .iso .org/ obp
— IEC Electropedia: available at http:// www .electropedia .org/
3.1
big data
bigdata
extensive datasets (3.7) — primarily in the data (3.2) characteristics of volume, variety, velocity, and/
or variability — that require a scalable technology for efficient storage, manipulation, management,
and analysis
Note 1 to entry: Big data is commonly used in many ways, for example as the name of the scalable technology
used to handle big data extensive data sets.
Note 2 to entry: Big data includes any data that are aggregated into a repository of much larger size than the
component data parts. For example, the collection of abstracts of biological publications represents a big data set
with more than 20 million entries.
[SOURCE: ISO/IEC 20546:2019, 3.1.2, modified — “bigdata” was given as an alternative term and Note 2
to entry was added.]
© ISO 2021 – All rights reserved PROOF/ÉPREUVE 1

---------------------- Page: 7 ----------------------
ISO/TR 3985:2021(E)

3.2
data
reinterpretable representation of information in a formalized manner suitable for communication,
interpretation or processing
[SOURCE: ISO/IEC 2382:2015, 2121272, modified — All three notes were removed.]
3.3
data archiver
archiver
individual or organization responsible for the long-term persistence of data and the access to that data
Note 1 to entry: An archiver receives data from a producer and can be funded by the same or different payer.
3.4
data consumer
consumer
user
individual or organization that uses data as a starting point
Note 1 to entry: In the research domain, a data consumer is a scientist or research group.
Note 2 to entry: In the medical domain, a data consumer can be a physician or patient. In some cases, consumer
can also be payer.
3.5
data producer
producer
organization or individual that carries out an experiment or measurement, funded by a payer (3.11),
and producing a data set
Note 1 to entry: In the research domain producer is typically a researcher, in the commercial domain the producer
can be a contract laboratory.
3.6
data publication
publication
any of several forms in which data are made available to a wider community
Note 1 to entry: This includes traditional scientific publications in journals as well as the sharing of data via a
public repository such as GENBANK. Data publication is typically, though not always, carried out by an entity
dedicated to the collection and dissemination of data, e.g. a data archiver (3.3).
Note 2 to entry: The “wider community” refers to data consumers, other than the individuals or organization
that obtained the data.
3.7
data set
dataset
identifiable collection of data
[SOURCE: ISO 19115-1:2014, 4.3, modified — “dataset” was given as an alternative term and Note 1 to
entry was deleted.]
3.8
data sharing
sharing
making data (e.g. numerical, textual, images) available to, and findable by, others
Note 1 to entry: Data are not truly shared, if they cannot be found.
2 PROOF/ÉPREUVE © ISO 2021 – All rights reserved

---------------------- Page: 8 ----------------------
ISO/TR 3985:2021(E)

3.9
data sharing plan
formalized description of how a data producer (3.5) will accomplish the task of data sharing (3.8)
3.10
metadata
meta-data
data that define and describe other data
[SOURCE: ISO/IEC 11179-1:2015, 3.2.16, modified — “meta-data” was added as an alternative term.]
3.11
payer
organization responsible for funding research
Note 1 to entry: This can be a government organization such as a national research institute, a philanthropic
organization, a private research organization or, in the medical case a national or private insurance organization.
3.12
proprietary data
data stored in such a way that by design and implementation are not accessible to everyone
Note 1 to entry: Proprietary data include, but are not limited to, data proprietary to an organization such as a
company, or data proprietary to an individual such as health records.
Note 2 to entry: Proprietary data are the opposite of public data (3.13).
3.13
public data
data stored in such a way that by design and implementation are accessible to everyone
Note 1 to entry: Public data are the opposite of proprietary data (3.12).
3.14
regionalization
process of expressing a text or data in a particular human language
Note 1 to entry: This includes not only the textual part of the document but also the date formats and varying
usages and meanings of commas (,) and periods (.) in numeric formats.
3.15
reification
expression of data or knowledge in a specific language or syntax
Note 1 to entry: Examples include expressing or converting structured data from one format to another, such as
from JSON to XML.
Note 2 to entry: Reification also means making a topic represents the subject of another topic map construct in
the same topic map according to ISO/IEC 13250-2:2006, 3.11.
3.16
repurposing
practice of using data in a manner other than which it was originally collected
Note 1 to entry: For example, microscope images originally collected for cell counting purposes might be
repurposed and used to measure cell morphology.
© ISO 2021 – All rights reserved PROOF/ÉPREUVE 3

---------------------- Page: 9 ----------------------
ISO/TR 3985:2021(E)

4 Abbreviated terms
BBSRC Biotechnology and Biological Sciences Research Council
ChEBI Chemical Entities of Biological Interest
DNA Deoxyribonucleic Acid
EOSC European Open Science Cloud
EU European Union
FAIR Findable, Accessible, Interoperable, Reusable
CASRN Chemical Abstracts Service Registry Number
HTML Hypertext Markup Language
MIABIS Minimum Information about Biobank Information Sharing
MIAME Minimum Information about a Microarray Experiment
NCBI National Center for Biotechnology Information
NIH United States Department of Health and Human Services, National Institutes of Health
OME Open Microscopy Environment
OMERO Open Microscopy Environment Remote objects
OSPP Open Science Policy Platform of the European Union
OWL W3C Web Ontology Language
PDF Portable Document Format
PID Persistent Identifier
POD Plain Old Documentation
RDF Resource Description Framework
UCSD University of California San Diego
URI Uniform Resource Identifier
URL Uniform Resource Locator
USA United States of America
SBML Systems Biology Markup Language
VEGFa Vascular Endothelial Growth Factor a
XML Extensible Markup Language
5 Principles
5.1 General
Data sharing by definition is more than simply publication of summary statistics in tables. It also
[8]
includes sharing of raw data from which the summaries are generated .
The challenge to both researchers and funding agencies is determining what and how data are shared
and what metrics might be used to judge the suitability of a sharing plan. For example, the breadth and
variety of science supported by the US National Institutes of Health (NIH) prevents the precise content
for documentation, its presentation or its transport to be stipulated, i.e. one size does not fit all. As a
result, the NIH encourages discussion of data sharing standards and practices between disciplines and
[8]
professional societies to create a supportive data sharing environment .
This view, however, leaves the researcher, reviewer and funding agency without enough guidance and
metrics to judge a plan. In addition, it lacks any attempt at standardizing any of the aspects of the data
across technology domains, leaving open the potential for ineffective data sharing.
4 PROOF/ÉPREUVE © ISO 2021 – All rights reserved

---------------------- Page: 10 ----------------------
ISO/TR 3985:2021(E)

FOUNDATIONAL CONCEPT: At the level of biological description, differences between life science
technologies vanish suggesting that a unifying standard spanning all the individual life science data
communities can be used for data sharing (See Figure 1).
NOTE In the case shown here four technologies have been applied to the study of somitogenesis, a phase
of early embryonic development. Each technology domain (highlighted as - - - -) has its own data and metadata
specification. There is a critical need for a common, high level annotation scheme that describes the biology
(highlighted as — ·· — ·· —) included in an experiment or model in a (bio)technology-independent fashion.
Figure 1 — Multiple (bio)technologies can be applied to study a biological or biomedical
problem.
Consistent annotation of the biological content of data aims at:
a) technology domain independence (i.e. not bound to a certain method or technology);
b) findability of the data;
c) data interoperability (facilitation of data integration);
d) facilitation of data reuse and repurposing.
5.2 Current technologies, approaches and their flaws
Factors that can contribute to the lack of effective sharing and reuse of biological data include:
a) Many communities and their data formats were established before the internet and search engines
were available.
© ISO 2021 – All rights reserved PROOF/ÉPREUVE 5

---------------------- Page: 11 ----------------------
ISO/TR 3985:2021(E)

b) The data are not “published” or only partially published (e.g. only available in the form of a summary
such as group averages).
c) The data are published in a form, format or location that is not easily interpreted or located.
d) The data are published in a suitable format and at a suitable location, but the terms used in the
document are not standard nomenclature making finding the data difficult.
5.3 Standards and best practices to facilitate data sharing and reuse
5.3.1 Maximizing value to the payer
Any standard aiming to facilitate data sharing and reuse maximizes value to the payer by maximizing
the number of users and uses of data. Uses of data often extend beyond what was envisioned by the
payer and data producer.
5.3.2 Data findability
Data findability by authorized users is essential. In the best case, it will not be necessary to first locate
the URI associated with the data source. Current data search applications use artificial intelligence
to learn ontological terms that are missing from controlled vocabularies and identify relationships
between them, those already in use and query terms to make recommendations for synonyms. Database
communities, commercial and public, have already begun to use this type of application. However, data
producers can proactively consult ontological databases to determine controlled vocabularies that
best render their data findable. Furthermore, “findability” is independent of the biotech domain, and
that domain's standard repository, since the life sciences data are described in biological terms and are
indexed by web search engines. Searchability of scientific content on the web is covered in Annex A.
5.3.3 Data machine and human interpretability
Life science data can be both machine and human interpretable. The development of “reification
technologies” that can present a data document in multiple ways can greatly improve the ability to
create documents that are both machine and human interpretable. An example of this technology
partially exists in the systems biology domain, especially when it comes to modelling biological
[4]
processes. Systems Biology Markup Language (SBML) is a free and open interchange format for
computer models of biological processes. An SBML fo
...

Questions, Comments and Discussion

Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.