Biotechnology -- Data publication -- Preliminary considerations and concepts

This document reviews best practices that: a) respect the existing standardization efforts of life sciences research communities; b) normalize key aspects of data description particularly at the level of the biology being studied (and shared) across the life sciences communities; c) ensure that data are “findable” and useable by other researchers; and d) provide guidance and metrics for assessing the applicability of a particular data sharing plan. This document is applicable to domains in life sciences including biotechnology, genomics (including massively parallel nucleotide sequencing, metagenomics, epigenomics and functional genomics), transcriptomics, translatomics, proteomics, metabolomics, lipidomics, glycomics, enzymology, immunochemistry, life science imaging, synthetic biology, systems biology, systems medicine and related fields.

Biotechnologie -- Publication de données -- Considérations et concepts préliminaires

General Information

Status
Published
Publication Date
17-May-2021
Current Stage
5060 - Close of voting Proof returned by Secretariat
Start Date
14-Apr-2021
Completion Date
14-Apr-2021
Ref Project

Buy Standard

Technical report
ISO/TR 3985:2021 - Biotechnology -- Data publication -- Preliminary considerations and concepts
English language
19 pages
sale 15% off
Preview
sale 15% off
Preview
Draft
ISO/PRF TR 3985:Version 27-mar-2021 - Biotechnology -- Data publication -- Preliminary considerations and concepts
English language
19 pages
sale 15% off
Preview
sale 15% off
Preview

Standards Content (sample)

TECHNICAL ISO/TR
REPORT 3985
First edition
2021-05
Biotechnology — Data publication
— Preliminary considerations and
concepts
Biotechnologie — Publication de données — Considérations et
concepts préliminaires
Reference number
ISO/TR 3985:2021(E)
ISO 2021
---------------------- Page: 1 ----------------------
ISO/TR 3985:2021(E)
COPYRIGHT PROTECTED DOCUMENT
© ISO 2021

All rights reserved. Unless otherwise specified, or required in the context of its implementation, no part of this publication may

be reproduced or utilized otherwise in any form or by any means, electronic or mechanical, including photocopying, or posting

on the internet or an intranet, without prior written permission. Permission can be requested from either ISO at the address

below or ISO’s member body in the country of the requester.
ISO copyright office
CP 401 • Ch. de Blandonnet 8
CH-1214 Vernier, Geneva
Phone: +41 22 749 01 11
Email: copyright@iso.org
Website: www.iso.org
Published in Switzerland
ii © ISO 2021 – All rights reserved
---------------------- Page: 2 ----------------------
ISO/TR 3985:2021(E)
Contents Page

Foreword ..........................................................................................................................................................................................................................................v

Introduction ................................................................................................................................................................................................................................vi

1 Scope ................................................................................................................................................................................................................................. 1

2 Normative references ...................................................................................................................................................................................... 1

3 Terms and definitions ..................................................................................................................................................................................... 1

4 Abbreviated terms .............................................................................................................................................................................................. 4

5 Principles ..................................................................................................................................................................................................................... 4

5.1 General ........................................................................................................................................................................................................... 4

5.2 Current technologies, approaches and their flaws .................................................................................................. 5

5.3 Standards and best practices to facilitate data sharing and reuse ............................................................. 6

5.3.1 Maximizing value to the payer ............................................................................................................................. 6

5.3.2 Data findability .................................................................................................................................................................. 6

5.3.3 Data machine and human interpretability ................................................................................................ 6

5.3.4 Using accepted controlled vocabularies and naming conventions....................................... 6

5.3.5 Biological annotation technology domain independence ............................................................ 6

5.3.6 Data locatability using multiple queries...................................................................................................... 7

5.4 Additional desirable attributes ................................................................................................................................................. 7

5.4.1 Data linkage to a published and openly accessible document describing

the experimental system ........................................................................................................................................... 7

5.4.2 Data format linkage to a published and openly accessible document

describing the format .................................................................................................................................................. 7

5.4.3 Existing information technology ........................................................................................................................ 7

5.4.4 Development of tools and best practices for creating web friendly and

search engine crawlable data documents .................................................................................................. 7

5.5 Essential considerations ................................................................................................................................................................. 7

5.5.1 Common annotation across multiple data sources ............................................................................ 7

5.5.2 Keyword template .......................................................................................................................................................... 8

5.5.3 Embedding ontological descriptions .............................................................................................................. 9

5.5.4 Pseudo-documents ........................................................................................................................................................ 9

6 Major challenges ................................................................................................................................................................................................10

6.1 General ........................................................................................................................................................................................................10

6.2 Domain ........................................................................................................................................................................................................10

6.3 Regionalization ....................................................................................................................................................................................10

6.4 Proprietary data ..................................................................................................................................................................................10

6.5 Large number of existing bio-ontologies, controlled vocabularies and terminologies .........10

6.6 Large number of existing data repositories and corresponding domain specific

data formats ...........................................................................................................................................................................................11

6.7 Large number of funding agencies (e.g. national, educational, philanthropic,

commercial) ............................................................................................................................................................................................11

7 Examples of existing national and regional standards or requirements for data

sharing or publication .................................................................................................................................................................................11

7.1 General ........................................................................................................................................................................................................11

7.2 USA .................................................................................................................................................................................................................11

7.3 Canada .........................................................................................................................................................................................................11

7.4 European Union ..................................................................................................................................................................................11

7.5 Germany .....................................................................................................................................................................................................12

7.6 China .............................................................................................................................................................................................................12

7.7 United Kingdom ..................................................................................................................................................................................12

7.8 India ...............................................................................................................................................................................................................12

7.9 Japan .............................................................................................................................................................................................................12

8 Existing legal requirements for data protection ..............................................................................................................12

© ISO 2021 – All rights reserved iii
---------------------- Page: 3 ----------------------
ISO/TR 3985:2021(E)

8.1 USA .................................................................................................................................................................................................................12

8.2 European Union ..................................................................................................................................................................................13

9 Timing of data publication ......................................................................................................................................................................13

10 Costs of data publication ...........................................................................................................................................................................13

11 Archival data ..........................................................................................................................................................................................................13

12 Validation and verification of compliance ..............................................................................................................................13

13 Affected stakeholder categories ........................................................................................................................................................13

Annex A (informative) Searchability of scientific content on the web ...........................................................................14

Annex B (informative) Example enhanced annotation of text documents ................................................................16

Bibliography .............................................................................................................................................................................................................................17

iv © ISO 2021 – All rights reserved
---------------------- Page: 4 ----------------------
ISO/TR 3985:2021(E)
Foreword

ISO (the International Organization for Standardization) is a worldwide federation of national standards

bodies (ISO member bodies). The work of preparing International Standards is normally carried out

through ISO technical committees. Each member body interested in a subject for whom a technical

committee has been established has the right to be represented on that committee. International

organizations, governmental and non-governmental, in liaison with ISO, also take part in the work.

ISO collaborates closely with the International Electrotechnical Commission (IEC) on all matters of

electrotechnical standardization.

The procedures used to develop this document and those intended for its further maintenance are

described in the ISO/IEC Directives, Part 1. In particular, the different approval criteria needed for the

different types of ISO documents should be noted. This document was drafted in accordance with the

editorial rules of the ISO/IEC Directives, Part 2 (see https:// www .iso .org/ directives -and -policies .html).

Attention is drawn to the possibility that some of the elements of this document may be the subject of

patent rights. ISO shall not be held responsible for identifying any or all such patent rights. Details of

any patent rights identified during the development of the document will be in the Introduction and/or

on the ISO list of patent declarations received (see www .iso .org/ patents).

Any trade name used in this document is information given for the convenience of users and does not

constitute an endorsement.

For an explanation of the voluntary nature of standards, the meaning of ISO specific terms and

expressions related to conformity assessment, as well as information about ISO's adherence to the

World Trade Organization (WTO) principles in the Technical Barriers to Trade (TBT), see www .iso .org/

iso/ foreword .html.
This document was prepared by Technical Committee ISO/TC 276, Biotechnology.

Any feedback or questions on this document should be directed to the user’s national standards body. A

complete listing of these bodies can be found at www .iso .org/ members .html.
© ISO 2021 – All rights reserved v
---------------------- Page: 5 ----------------------
ISO/TR 3985:2021(E)
Introduction

The explosion of life sciences data (big data) has created a need to digitally locate data from diverse

biological assays, obtained in a wide range of laboratories, and from a wide range of experimental

protocols. To be able to extract value from big data, it is necessary that the data are “findable”, and

that the biology measured in the assay is described in a way that it can be located and interpreted.

Data producer’s use of a consistent method to describe the biology that their data represents can

greatly improve the use of big data. This single, unified description of biological data facilitates locating

and extracting value from an abundance of biological data and return increased value to funding

organizations.

Many biotech communities have already developed standard data representations specific to their

[1] [2] [3]

domain . For example, MIAME in the microarray community, OME/OMERO in the imaging and

[4]

microscopy communities, SBML in the systems biology and reaction kinetics community, and MIABIS

[5]

in the biobanking domain . What is lacking is a consistent method of describing the represented

biological information so that the same search, analysis and mining tools can locate data across

the entire range of life science domains. Consensus and guidance are required and provided in this

document for the biotech domain-independent annotation of biological data.

The importance of data sharing as an integral part of biological research is recognized in the research

community. As a result, a diverse set of stakeholders has developed the FAIR (Findable, Accessible,

[7]

Interoperable and Reusable) data sharing principles . The intent of FAIR is to act as a guideline for

sharing and enhancing the reusability of data holdings. Many life science funding organizations also

place increased emphasis on the importance of data sharing. Some require that data sharing plans

are included in grant applications and research contracts, i.e. “data must be made as widely and freely

available as possible while safeguarding the privacy of participants and protecting confidential and

[8]

proprietary data .” Data sharing is equally critical for various national and international research and

biobank networks. Data sharing is known to encourage diversity of analysis and opinion, the testing

of alternative hypotheses and enabling of explorations not envisioned by the original investigators,

resulting in increased value to the funding organization.

This document lays out concepts, challenges, issues and benefits that are relevant to developing

International Standards for data sharing in life science research and provides an overview for specifying

standards and best practices that enable data sharing.
vi © ISO 2021 – All rights reserved
---------------------- Page: 6 ----------------------
TECHNICAL REPORT ISO/TR 3985:2021(E)
Biotechnology — Data publication — Preliminary
considerations and concepts
1 Scope
This document reviews best practices that:

a) respect the existing standardization efforts of life sciences research communities;

b) normalize key aspects of data description particularly at the level of the biology being studied (and

shared) across the life sciences communities;
c) ensure that data are “findable” and useable by other researchers; and

d) provide guidance and metrics for assessing the applicability of a particular data sharing plan.

This document is applicable to domains in life sciences including biotechnology, genomics (including

massively parallel nucleotide sequencing, metagenomics, epigenomics and functional genomics),

transcriptomics, translatomics, proteomics, metabolomics, lipidomics, glycomics, enzymology,

immunochemistry, life science imaging, synthetic biology, systems biology, systems medicine and

related fields.
2 Normative references
There are no normative references in this document.
3 Terms and definitions
For the purposes of this document, the following terms and definitions apply.

ISO and IEC maintain terminological databases for use in standardization at the following addresses:

— ISO Online browsing platform: available at https:// www .iso .org/ obp
— IEC Electropedia: available at http:// www .electropedia .org/
3.1
big data
bigdata

extensive datasets (3.7) — primarily in the data (3.2) characteristics of volume, variety, velocity, and/

or variability — that require a scalable technology for efficient storage, manipulation, management,

and analysis

Note 1 to entry: Big data is commonly used in many ways, for example as the name of the scalable technology

used to handle big data extensive data sets.

Note 2 to entry: Big data includes any data that are aggregated into a repository of much larger size than the

component data parts. For example, the collection of abstracts of biological publications represents a big data set

with more than 20 million entries.

[SOURCE: ISO/IEC 20546:2019, 3.1.2, modified — “bigdata” was given as an alternative term and Note 2

to entry was added.]
© ISO 2021 – All rights reserved 1
---------------------- Page: 7 ----------------------
ISO/TR 3985:2021(E)
3.2
data

reinterpretable representation of information in a formalized manner suitable for communication,

interpretation or processing
[SOURCE: ISO/IEC 2382:2015, 2121272, modified — All three notes were removed.]
3.3
data archiver
archiver

individual or organization responsible for the long-term persistence of data and the access to that data

Note 1 to entry: An archiver receives data from a producer and can be funded by the same or different payer.

3.4
data consumer
consumer
user
individual or organization that uses data as a starting point

Note 1 to entry: In the research domain, a data consumer is a scientist or research group.

Note 2 to entry: In the medical domain, a data consumer can be a physician or patient. In some cases, consumer

can also be payer.
3.5
data producer
producer

organization or individual that carries out an experiment or measurement, funded by a payer (3.11),

and producing a data set

Note 1 to entry: In the research domain producer is typically a researcher, in the commercial domain the producer

can be a contract laboratory.
3.6
data publication
publication
any of several forms in which data are made available to a wider community

Note 1 to entry: This includes traditional scientific publications in journals as well as the sharing of data via a

public repository such as GENBANK. Data publication is typically, though not always, carried out by an entity

dedicated to the collection and dissemination of data, e.g. a data archiver (3.3).

Note 2 to entry: The “wider community” refers to data consumers, other than the individuals or organization

that obtained the data.
3.7
data set
dataset
identifiable collection of data

[SOURCE: ISO 19115-1:2014, 4.3, modified — “dataset” was given as an alternative term and Note 1 to

entry was deleted.]
3.8
data sharing
sharing

making data (e.g. numerical, textual, images) available to, and findable by, others

Note 1 to entry: Data are not truly shared, if they cannot be found.
2 © ISO 2021 – All rights reserved
---------------------- Page: 8 ----------------------
ISO/TR 3985:2021(E)
3.9
data sharing plan

formalized description of how a data producer (3.5) will accomplish the task of data sharing (3.8)

3.10
metadata
meta-data
data that define and describe other data

[SOURCE: ISO/IEC 11179-1:2015, 3.2.16, modified — “meta-data” was added as an alternative term.]

3.11
payer
organization responsible for funding research

Note 1 to entry: This can be a government organization such as a national research institute, a philanthropic

organization, a private research organization or, in the medical case a national or private insurance organization.

3.12
proprietary data

data stored in such a way that by design and implementation they are not accessible to everyone

Note 1 to entry: Proprietary data include, but are not limited to, data proprietary to an organization such as a

company, or data proprietary to an individual such as health records.
Note 2 to entry: Proprietary data are the opposite of public data (3.13).
3.13
public data

data stored in such a way that by design and implementation they are accessible to everyone

Note 1 to entry: Public data are the opposite of proprietary data (3.12).
3.14
regionalization
process of expressing a text or data in a particular human language

Note 1 to entry: This includes not only the textual part of the document but also the date formats and varying

usages and meanings of commas (,) and periods (.) in numeric formats.
3.15
reification
expression of data or knowledge in a specific language or syntax

Note 1 to entry: Examples include expressing or converting structured data from one format to another, such as

from JSON to XML.

Note 2 to entry: Reification also means making a topic represent the subject of another topic map construct in the

same topic map according to ISO/IEC 13250-2:2006, 3.11.
3.16
repurposing
practice of using data in a manner other than which it was originally collected

Note 1 to entry: For example, microscope images originally collected for cell counting purposes might be

repurposed and used to measure cell morphology.
© ISO 2021 – All rights reserved 3
---------------------- Page: 9 ----------------------
ISO/TR 3985:2021(E)
4 Abbreviated terms
BBSRC Biotechnology and Biological Sciences Research Council
ChEBI Chemical Entities of Biological Interest
DNA Deoxyribonucleic Acid
EOSC European Open Science Cloud
EU European Union
FAIR Findable, Accessible, Interoperable, Reusable
CASRN Chemical Abstracts Service Registry Number
HTML Hypertext Markup Language
MIABIS Minimum Information about Biobank Information Sharing
MIAME Minimum Information about a Microarray Experiment
NCBI National Center for Biotechnology Information

NIH United States Department of Health and Human Services, National Institutes of Health

OME Open Microscopy Environment
OMERO Open Microscopy Environment Remote objects
OSPP Open Science Policy Platform of the European Union
OWL W3C Web Ontology Language
PDF Portable Document Format
PID Persistent Identifier
POD Plain Old Documentation
RDF Resource Description Framework
UCSD University of California San Diego
URI Uniform Resource Identifier
URL Uniform Resource Locator
USA United States of America
SBML Systems Biology Markup Language
VEGFa Vascular Endothelial Growth Factor a
XML Extensible Markup Language
5 Principles
5.1 General

Data sharing by definition is more than simply the publication of summary statistics in tables. It also

[8]
includes sharing of raw data from which the summaries are generated .

The challenge to both researchers and funding agencies is determining what and how data are shared

and what metrics might be used to judge the suitability of a sharing plan. For example, the breadth and

variety of science supported by the US National Institutes of Health (NIH) prevents the precise content

for documentation, its presentation or its transport to be stipulated, i.e. one size does not fit all. As a

result, the NIH encourages discussion of data sharing standards and practices between disciplines and

[8]
professional societies to create a supportive data sharing environment .

This view, however, leaves the researcher, reviewer and funding agency without enough guidance and

metrics to judge a plan. In addition, it lacks any attempt at standardizing any of the aspects of the data

across technology domains, leaving open the potential for ineffective data sharing.

4 © ISO 2021 – All rights reserved
---------------------- Page: 10 ----------------------
ISO/TR 3985:2021(E)

FOUNDATIONAL CONCEPT: At the level of biological description, differences between life science

technologies vanish suggesting that a unifying standard spanning all the individual life science data

communities can be used for data sharing (See Figure 1).

NOTE In the case shown here four technologies have been applied to the study of somitogenesis, a phase

of early embryonic development. Each technology domain (highlighted as - - - -) has its own data and metadata

specification. There is a critical need for a common, high level annotation scheme that describes the biology

(highlighted as — ·· — ·· —) included in an experiment or model in a (bio)technology-independent fashion.

Figure 1 — Multiple (bio)technologies can be applied to study a biological or biomedical

problem.
Consistent annotation of the biological content of data aims at:

a) technology domain independence (i.e. not bound to a certain method or technology);

b) findability of the data;
c) data interoperability (facilitation of data integration);
d) facilitation of data reuse and repurposing.
5.2 Current technologies, approaches and their flaws

Factors that can contribute to the lack of effective sharing and reuse of biological data include:

a) Many communities and their data formats were established before the internet and search engines

were available.
© ISO 2021 – All rights reserved 5
---------------------- Page: 11 ----------------------
ISO/TR 3985:2021(E)

b) The data are not “published” or only partially published (e.g. only available in the form of a summary

such as group averages).

c) The data are published in a form, format or location that is not easily interpreted or located.

d) The data are published in a suitable format and at a suitable location, but the terms used in the

document are not standard nomenclature making finding the data difficult.
5.3 Standards and best practices to facilitate data sharing and reuse
5.3.1 Maximizing value to the payer

Any standard aiming to facilitate data sharing and reuse maximizes value to the payer by maximizing

the number of users and uses of data. Uses of data often extend beyond what was envisioned by the

payer and data producer.
5.3.2 Data findability

Data findability by authorized users is essential. In the best case, it will not be necessary to first locate

the URI associated with the data source. Current data search applications use artificial intelligence

to learn ontological terms that are missing from controlled vocabularies and identify relationships

between them, those already in use and query terms to make recommendations for synonyms. Database

communities, commercial and public, have already begun to use this type of application. However, data

producers can proactively consult ontological databases to determine controlled vocabularies that

best render their data findable. Furthermore, “findability” is independent of the biotech domain, and

that domain's standard repository, since the life sciences data are described in biological terms and are

indexed by web search engines. Searchability of scientific content on the web is covered in Annex A.

5.3.3 Data machine and human interpretability

Life science data can be both machine and human interpretable. The development of “reification

technologies” that can present a data document in multiple ways can greatly improve the ability to

create documents that are both machine and human interpretable. An example of this technology

partially exists in the systems biology domain, especially when it comes to modelling biological

[4]

processes. Systems Biology Markup Language (SBML) is a free and open interchange format for

computer models of biological processes. An SBML format file describes a set of mathemat

...

TECHNICAL ISO/TR
REPORT 3985
First edition
Biotechnology — Data publication
— Preliminary considerations and
concepts
PROOF/ÉPREUVE
Reference number
ISO/TR 3985:2021(E)
ISO 2021
---------------------- Page: 1 ----------------------
ISO/TR 3985:2021(E)
COPYRIGHT PROTECTED DOCUMENT
© ISO 2021

All rights reserved. Unless otherwise specified, or required in the context of its implementation, no part of this publication may

be reproduced or utilized otherwise in any form or by any means, electronic or mechanical, including photocopying, or posting

on the internet or an intranet, without prior written permission. Permission can be requested from either ISO at the address

below or ISO’s member body in the country of the requester.
ISO copyright office
CP 401 • Ch. de Blandonnet 8
CH-1214 Vernier, Geneva
Phone: +41 22 749 01 11
Email: copyright@iso.org
Website: www.iso.org
Published in Switzerland
ii PROOF/ÉPREUVE © ISO 2021 – All rights reserved
---------------------- Page: 2 ----------------------
ISO/TR 3985:2021(E)
Contents Page

Foreword ..........................................................................................................................................................................................................................................v

Introduction ................................................................................................................................................................................................................................vi

1 Scope ................................................................................................................................................................................................................................. 1

2 Normative references ...................................................................................................................................................................................... 1

3 Terms and definitions ..................................................................................................................................................................................... 1

4 Abbreviated terms .............................................................................................................................................................................................. 4

5 Principles ..................................................................................................................................................................................................................... 4

5.1 General ........................................................................................................................................................................................................... 4

5.2 Current technologies, approaches and their flaws .................................................................................................. 5

5.3 Standards and best practices to facilitate data sharing and reuse ............................................................. 6

5.3.1 Maximizing value to the payer ............................................................................................................................. 6

5.3.2 Data findability .................................................................................................................................................................. 6

5.3.3 Data machine and human interpretability ................................................................................................ 6

5.3.4 Using accepted controlled vocabularies and naming conventions....................................... 6

5.3.5 Biological annotation technology domain independence ............................................................ 6

5.3.6 Data locatability using multiple queries...................................................................................................... 7

5.4 Additional desirable attributes ................................................................................................................................................. 7

5.4.1 Data linkage to a published and openly accessible document describing

the experimental system ........................................................................................................................................... 7

5.4.2 Data format linkage to a published and openly accessible document

describing the format .................................................................................................................................................. 7

5.4.3 Existing information technology ........................................................................................................................ 7

5.4.4 Development of tools and best practices for creating web friendly and

search engine crawlable data documents .................................................................................................. 7

5.5 Essential considerations ................................................................................................................................................................. 7

5.5.1 Common annotation across multiple data sources ............................................................................ 7

5.5.2 Keyword template .......................................................................................................................................................... 8

5.5.3 Embedding ontological descriptions .............................................................................................................. 9

5.5.4 Pseudo-documents ........................................................................................................................................................ 9

6 Major challenges ................................................................................................................................................................................................10

6.1 General ........................................................................................................................................................................................................10

6.2 Domain ........................................................................................................................................................................................................10

6.3 Regionalization ....................................................................................................................................................................................10

6.4 Proprietary data ..................................................................................................................................................................................10

6.5 Large number of existing bio-ontologies, controlled vocabularies and terminologies .........10

6.6 Large number of existing data repositories and corresponding domain specific

data formats ...........................................................................................................................................................................................10

6.7 Large number of funding agencies (e.g. national, educational, philanthropic,

commercial) ............................................................................................................................................................................................11

7 Examples of existing national and regional standards or requirements for data

sharing or publication .................................................................................................................................................................................11

7.1 General ........................................................................................................................................................................................................11

7.2 USA .................................................................................................................................................................................................................11

7.3 Canada .........................................................................................................................................................................................................11

7.4 European Union ..................................................................................................................................................................................11

7.5 Germany .....................................................................................................................................................................................................11

7.6 China .............................................................................................................................................................................................................12

7.7 United Kingdom ..................................................................................................................................................................................12

7.8 India ...............................................................................................................................................................................................................12

7.9 Japan .............................................................................................................................................................................................................12

8 Existing legal requirements for data protection ..............................................................................................................12

© ISO 2021 – All rights reserved PROOF/ÉPREUVE iii
---------------------- Page: 3 ----------------------
ISO/TR 3985:2021(E)

8.1 USA .................................................................................................................................................................................................................12

8.2 European Union ..................................................................................................................................................................................12

9 Timing of data publication ......................................................................................................................................................................13

10 Costs of data publication ...........................................................................................................................................................................13

11 Archival data ..........................................................................................................................................................................................................13

12 Validation and verification of compliance ..............................................................................................................................13

13 Affected stakeholder categories ........................................................................................................................................................13

Annex A (informative) Searchability of scientific content on the web ...........................................................................14

Annex B (informative) Example enhanced annotation of text documents ................................................................16

Bibliography .............................................................................................................................................................................................................................17

iv PROOF/ÉPREUVE © ISO 2021 – All rights reserved
---------------------- Page: 4 ----------------------
ISO/TR 3985:2021(E)
Foreword

ISO (the International Organization for Standardization) is a worldwide federation of national standards

bodies (ISO member bodies). The work of preparing International Standards is normally carried out

through ISO technical committees. Each member body interested in a subject for whom a technical

committee has been established has the right to be represented on that committee. International

organizations, governmental and non-governmental, in liaison with ISO, also take part in the work.

ISO collaborates closely with the International Electrotechnical Commission (IEC) on all matters of

electrotechnical standardization.

The procedures used to develop this document and those intended for its further maintenance are

described in the ISO/IEC Directives, Part 1. In particular, the different approval criteria needed for the

different types of ISO documents should be noted. This document was drafted in accordance with the

editorial rules of the ISO/IEC Directives, Part 2 (see https:// www .iso .org/ directives -and -policies .html).

Attention is drawn to the possibility that some of the elements of this document may be the subject of

patent rights. ISO shall not be held responsible for identifying any or all such patent rights. Details of

any patent rights identified during the development of the document will be in the Introduction and/or

on the ISO list of patent declarations received (see www .iso .org/ patents).

Any trade name used in this document is information given for the convenience of users and does not

constitute an endorsement.

For an explanation of the voluntary nature of standards, the meaning of ISO specific terms and

expressions related to conformity assessment, as well as information about ISO's adherence to the

World Trade Organization (WTO) principles in the Technical Barriers to Trade (TBT), see www .iso .org/

iso/ foreword .html.
This document was prepared by Technical Committee ISO/TC 276, Biotechnology.

Any feedback or questions on this document should be directed to the user’s national standards body. A

complete listing of these bodies can be found at www .iso .org/ members .html.
© ISO 2021 – All rights reserved PROOF/ÉPREUVE v
---------------------- Page: 5 ----------------------
ISO/TR 3985:2021(E)
Introduction

The explosion of life sciences data (bigdata) has created a need to digitally locate data from diverse

biological assays, obtained in a wide range of laboratories, and from a wide range of experimental

protocols. To be able to extract value from bigdata, it is necessary that the data be “findable”, and that

the biology measured in the assay is described in a way that it can be located and interpreted. Data

producer’s use of a consistent method to describe the biology that their data represents can greatly

improve the use of big data. This single, unified description of biological data facilitates locating

and extracting value from an abundance of biological data and return increased value to funding

organizations.

Many biotech communities have already developed standard data representations specific to their

[1] [2] [3]

domain . For example, MIAME in the microarray community, OME/OMERO in the imaging and

[4]

microscopy communities, SBML in the systems biology and reaction kinetics community, and MIABIS

[5]

in the biobanking domain . What is lacking is a consistent method of describing the represented

biological information so that the same search, analysis and mining tools can locate data across

the entire range of life science domains. Consensus and guidance are required and provided in this

document for the biotech domain-independent annotation of biological data.

The importance of data sharing as an integral part of biological research is recognized in the research

community. As a result, a diverse set of stakeholders have developed the FAIR (Findable, Accessible,

[7]

Interoperable and Reusable) data sharing principles . The intent of FAIR is to act as a guideline for

sharing and enhancing the reusability of data holdings. Many life science funding organizations also

place increased emphasis on the importance of data sharing. Some require that data sharing plans

are included in grant applications and research contracts, i.e. “data must be made as widely and freely

available as possible while safeguarding the privacy of participants and protecting confidential and

[8]

proprietary data .” Data sharing is equally critical for various national and international research and

biobank networks. Data sharing is known to encourage diversity of analysis and opinion, the testing

of alternative hypotheses and enabling of explorations not envisioned by the original investigators,

resulting in increased value to the funding organization.

This document lays out concepts, challenges, issues and benefits that are relevant to developing

International Standards for data sharing in life science research and provides an overview for specifying

standards and best practices that enable data sharing.
vi PROOF/ÉPREUVE © ISO 2021 – All rights reserved
---------------------- Page: 6 ----------------------
TECHNICAL REPORT ISO/TR 3985:2021(E)
Biotechnology — Data publication — Preliminary
considerations and concepts
1 Scope
This document reviews best practices that:

a) respect the existing standardization efforts of life sciences research communities;

b) normalize key aspects of data description particularly at the level of the biology being studied (and

shared) across the life sciences communities;
c) ensure that data are “findable” and useable by other researchers; and

d) provide guidance and metrics for assessing the applicability of a particular data sharing plan.

This document is applicable to domains in life sciences including biotechnology, genomics (including

massively parallel nucleotide sequencing, metagenomics, epigenomics and functional genomics),

transcriptomics, translatomics, proteomics, metabolomics, lipidomics, glycomics, enzymology,

immunochemistry, life science imaging, synthetic biology, systems biology, systems medicine and

related fields.
2 Normative references
There are no normative references in this document.
3 Terms and definitions
For the purposes of this document, the following terms and definitions apply.

ISO and IEC maintain terminological databases for use in standardization at the following addresses:

— ISO Online browsing platform: available at https:// www .iso .org/ obp
— IEC Electropedia: available at http:// www .electropedia .org/
3.1
big data
bigdata

extensive datasets (3.7) — primarily in the data (3.2) characteristics of volume, variety, velocity, and/

or variability — that require a scalable technology for efficient storage, manipulation, management,

and analysis

Note 1 to entry: Big data is commonly used in many ways, for example as the name of the scalable technology

used to handle big data extensive data sets.

Note 2 to entry: Big data includes any data that are aggregated into a repository of much larger size than the

component data parts. For example, the collection of abstracts of biological publications represents a big data set

with more than 20 million entries.

[SOURCE: ISO/IEC 20546:2019, 3.1.2, modified — “bigdata” was given as an alternative term and Note 2

to entry was added.]
© ISO 2021 – All rights reserved PROOF/ÉPREUVE 1
---------------------- Page: 7 ----------------------
ISO/TR 3985:2021(E)
3.2
data

reinterpretable representation of information in a formalized manner suitable for communication,

interpretation or processing
[SOURCE: ISO/IEC 2382:2015, 2121272, modified — All three notes were removed.]
3.3
data archiver
archiver

individual or organization responsible for the long-term persistence of data and the access to that data

Note 1 to entry: An archiver receives data from a producer and can be funded by the same or different payer.

3.4
data consumer
consumer
user
individual or organization that uses data as a starting point

Note 1 to entry: In the research domain, a data consumer is a scientist or research group.

Note 2 to entry: In the medical domain, a data consumer can be a physician or patient. In some cases, consumer

can also be payer.
3.5
data producer
producer

organization or individual that carries out an experiment or measurement, funded by a payer (3.11),

and producing a data set

Note 1 to entry: In the research domain producer is typically a researcher, in the commercial domain the producer

can be a contract laboratory.
3.6
data publication
publication
any of several forms in which data are made available to a wider community

Note 1 to entry: This includes traditional scientific publications in journals as well as the sharing of data via a

public repository such as GENBANK. Data publication is typically, though not always, carried out by an entity

dedicated to the collection and dissemination of data, e.g. a data archiver (3.3).

Note 2 to entry: The “wider community” refers to data consumers, other than the individuals or organization

that obtained the data.
3.7
data set
dataset
identifiable collection of data

[SOURCE: ISO 19115-1:2014, 4.3, modified — “dataset” was given as an alternative term and Note 1 to

entry was deleted.]
3.8
data sharing
sharing

making data (e.g. numerical, textual, images) available to, and findable by, others

Note 1 to entry: Data are not truly shared, if they cannot be found.
2 PROOF/ÉPREUVE © ISO 2021 – All rights reserved
---------------------- Page: 8 ----------------------
ISO/TR 3985:2021(E)
3.9
data sharing plan

formalized description of how a data producer (3.5) will accomplish the task of data sharing (3.8)

3.10
metadata
meta-data
data that define and describe other data

[SOURCE: ISO/IEC 11179-1:2015, 3.2.16, modified — “meta-data” was added as an alternative term.]

3.11
payer
organization responsible for funding research

Note 1 to entry: This can be a government organization such as a national research institute, a philanthropic

organization, a private research organization or, in the medical case a national or private insurance organization.

3.12
proprietary data

data stored in such a way that by design and implementation are not accessible to everyone

Note 1 to entry: Proprietary data include, but are not limited to, data proprietary to an organization such as a

company, or data proprietary to an individual such as health records.
Note 2 to entry: Proprietary data are the opposite of public data (3.13).
3.13
public data

data stored in such a way that by design and implementation are accessible to everyone

Note 1 to entry: Public data are the opposite of proprietary data (3.12).
3.14
regionalization
process of expressing a text or data in a particular human language

Note 1 to entry: This includes not only the textual part of the document but also the date formats and varying

usages and meanings of commas (,) and periods (.) in numeric formats.
3.15
reification
expression of data or knowledge in a specific language or syntax

Note 1 to entry: Examples include expressing or converting structured data from one format to another, such as

from JSON to XML.

Note 2 to entry: Reification also means making a topic represents the subject of another topic map construct in

the same topic map according to ISO/IEC 13250-2:2006, 3.11.
3.16
repurposing
practice of using data in a manner other than which it was originally collected

Note 1 to entry: For example, microscope images originally collected for cell counting purposes might be

repurposed and used to measure cell morphology.
© ISO 2021 – All rights reserved PROOF/ÉPREUVE 3
---------------------- Page: 9 ----------------------
ISO/TR 3985:2021(E)
4 Abbreviated terms
BBSRC Biotechnology and Biological Sciences Research Council
ChEBI Chemical Entities of Biological Interest
DNA Deoxyribonucleic Acid
EOSC European Open Science Cloud
EU European Union
FAIR Findable, Accessible, Interoperable, Reusable
CASRN Chemical Abstracts Service Registry Number
HTML Hypertext Markup Language
MIABIS Minimum Information about Biobank Information Sharing
MIAME Minimum Information about a Microarray Experiment
NCBI National Center for Biotechnology Information

NIH United States Department of Health and Human Services, National Institutes of Health

OME Open Microscopy Environment
OMERO Open Microscopy Environment Remote objects
OSPP Open Science Policy Platform of the European Union
OWL W3C Web Ontology Language
PDF Portable Document Format
PID Persistent Identifier
POD Plain Old Documentation
RDF Resource Description Framework
UCSD University of California San Diego
URI Uniform Resource Identifier
URL Uniform Resource Locator
USA United States of America
SBML Systems Biology Markup Language
VEGFa Vascular Endothelial Growth Factor a
XML Extensible Markup Language
5 Principles
5.1 General

Data sharing by definition is more than simply publication of summary statistics in tables. It also

[8]
includes sharing of raw data from which the summaries are generated .

The challenge to both researchers and funding agencies is determining what and how data are shared

and what metrics might be used to judge the suitability of a sharing plan. For example, the breadth and

variety of science supported by the US National Institutes of Health (NIH) prevents the precise content

for documentation, its presentation or its transport to be stipulated, i.e. one size does not fit all. As a

result, the NIH encourages discussion of data sharing standards and practices between disciplines and

[8]
professional societies to create a supportive data sharing environment .

This view, however, leaves the researcher, reviewer and funding agency without enough guidance and

metrics to judge a plan. In addition, it lacks any attempt at standardizing any of the aspects of the data

across technology domains, leaving open the potential for ineffective data sharing.

4 PROOF/ÉPREUVE © ISO 2021 – All rights reserved
---------------------- Page: 10 ----------------------
ISO/TR 3985:2021(E)

FOUNDATIONAL CONCEPT: At the level of biological description, differences between life science

technologies vanish suggesting that a unifying standard spanning all the individual life science data

communities can be used for data sharing (See Figure 1).

NOTE In the case shown here four technologies have been applied to the study of somitogenesis, a phase

of early embryonic development. Each technology domain (highlighted as - - - -) has its own data and metadata

specification. There is a critical need for a common, high level annotation scheme that describes the biology

(highlighted as — ·· — ·· —) included in an experiment or model in a (bio)technology-independent fashion.

Figure 1 — Multiple (bio)technologies can be applied to study a biological or biomedical

problem.
Consistent annotation of the biological content of data aims at:

a) technology domain independence (i.e. not bound to a certain method or technology);

b) findability of the data;
c) data interoperability (facilitation of data integration);
d) facilitation of data reuse and repurposing.
5.2 Current technologies, approaches and their flaws

Factors that can contribute to the lack of effective sharing and reuse of biological data include:

a) Many communities and their data formats were established before the internet and search engines

were available.
© ISO 2021 – All rights reserved PROOF/ÉPREUVE 5
---------------------- Page: 11 ----------------------
ISO/TR 3985:2021(E)

b) The data are not “published” or only partially published (e.g. only available in the form of a summary

such as group averages).

c) The data are published in a form, format or location that is not easily interpreted or located.

d) The data are published in a suitable format and at a suitable location, but the terms used in the

document are not standard nomenclature making finding the data difficult.
5.3 Standards and best practices to facilitate data sharing and reuse
5.3.1 Maximizing value to the payer

Any standard aiming to facilitate data sharing and reuse maximizes value to the payer by maximizing

the number of users and uses of data. Uses of data often extend beyond what was envisioned by the

payer and data producer.
5.3.2 Data findability

Data findability by authorized users is essential. In the best case, it will not be necessary to first locate

the URI associated with the data source. Current data search applications use artificial intelligence

to learn ontological terms that are missing from controlled vocabularies and identify relationships

between them, those already in use and query terms to make recommendations for synonyms. Database

communities, commercial and public, have already begun to use this type of application. However, data

producers can proactively consult ontological databases to determine controlled vocabularies that

best render their data findable. Furthermore, “findability” is independent of the biotech domain, and

that domain's standard repository, since the life sciences data are described in biological terms and are

indexed by web search engines. Searchability of scientific content on the web is covered in Annex A.

5.3.3 Data machine and human interpretability

Life science data can be both machine and human interpretable. The development of “reification

technologies” that can present a data document in multiple ways can greatly improve the ability to

create documents that are both machine and human interpretable. An example of this technology

partially exists in the systems biology domain, especially when it comes to modelling biological

[4]

processes. Systems Biology Markup Language (SBML) is a free and open interchange format for

computer models of biological processes. An SBML fo
...

Questions, Comments and Discussion

Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.