Technologie de l'information -- Représentation des informations génomiques

General Information

Status
Published
Current Stage
5020 - FDIS ballot initiated: 2 months. Proof sent to secretariat
Start Date
24-Jul-2020
Completion Date
24-Jul-2020
Ref Project

Buy Standard

Standard
ISO/IEC DIS 23092-4 - Information technology -- Genomic information representation
English language
6 pages
limited time 15% off
Preview
limited time 15% off
Preview

Standards Content (sample)

DRAFT INTERNATIONAL STANDARD
ISO/IEC DIS 23092-4
ISO/IEC JTC 1/SC 29 Secretariat: JISC
Voting begins on: Voting terminates on:
2020-01-22 2020-04-15
Information technology — Genomic information
representation —
Part 4:
Reference software
ICS: 35.040.99
THIS DOCUMENT IS A DRAFT CIRCULATED
FOR COMMENT AND APPROVAL. IT IS
THEREFORE SUBJECT TO CHANGE AND MAY
NOT BE REFERRED TO AS AN INTERNATIONAL
STANDARD UNTIL PUBLISHED AS SUCH.
IN ADDITION TO THEIR EVALUATION AS
BEING ACCEPTABLE FOR INDUSTRIAL,
This document is circulated as received from the committee secretariat.
TECHNOLOGICAL, COMMERCIAL AND
USER PURPOSES, DRAFT INTERNATIONAL
STANDARDS MAY ON OCCASION HAVE TO
BE CONSIDERED IN THE LIGHT OF THEIR
POTENTIAL TO BECOME STANDARDS TO
WHICH REFERENCE MAY BE MADE IN
Reference number
NATIONAL REGULATIONS.
ISO/IEC DIS 23092-4:2020(E)
RECIPIENTS OF THIS DRAFT ARE INVITED
TO SUBMIT, WITH THEIR COMMENTS,
NOTIFICATION OF ANY RELEVANT PATENT
RIGHTS OF WHICH THEY ARE AWARE AND TO
PROVIDE SUPPORTING DOCUMENTATION. ISO/IEC 2020
---------------------- Page: 1 ----------------------
ISO/IEC DIS 23092-4:2020(E)
COPYRIGHT PROTECTED DOCUMENT
© ISO/IEC 2020

All rights reserved. Unless otherwise specified, or required in the context of its implementation, no part of this publication may

be reproduced or utilized otherwise in any form or by any means, electronic or mechanical, including photocopying, or posting

on the internet or an intranet, without prior written permission. Permission can be requested from either ISO at the address

below or ISO’s member body in the country of the requester.
ISO copyright office
CP 401 • Ch. de Blandonnet 8
CH-1214 Vernier, Geneva
Phone: +41 22 749 01 11
Fax: +41 22 749 09 47
Email: copyright@iso.org
Website: www.iso.org
Published in Switzerland
ii © ISO/IEC 2020 – All rights reserved
---------------------- Page: 2 ----------------------
ISO/IEC DIS 23092-4:2020(E)
Contents Page

Foreword ........................................................................................................................................................................................................................................iv

Introduction ..................................................................................................................................................................................................................................v

1 Scope ................................................................................................................................................................................................................................. 1

2 Normative references ...................................................................................................................................................................................... 1

3 Terms and definitions ..................................................................................................................................................................................... 2

4 Abbreviated terms .............................................................................................................................................................................................. 2

5 Copyright Disclaimer for Software Modules ............................................................................................................................ 2

6 Genomic Model ....................................................................................................................................................................................................... 2

6.1 Genomic Model availability .......................................................................................................................................................... 2

6.2 Compilation and usage of the Genomic Model ............................................................................................................ 3

6.3 Decoding Software ............................................................................................................................................................................... 3

6.3.1 Decoding Software modules .................................................................................................................................. 3

6.3.2 Feature availability......................................................................................................................................................... 3

© ISO/IEC 2020 – All rights reserved iii
---------------------- Page: 3 ----------------------
ISO/IEC DIS 23092-4:2020(E)
Foreword

ISO (the International Organization for Standardization) and IEC (the International Electrotechnical

Commission) form the specialized system for worldwide standardization. National bodies that

are members of ISO or IEC participate in the development of International Standards through

technical committees established by the respective organization to deal with particular fields of

technical activity. ISO and IEC technical committees collaborate in fields of mutual interest. Other

international organizations, governmental and non-governmental, in liaison with ISO and IEC, also

take part in the work.

The procedures used to develop this document and those intended for its further maintenance are

described in the ISO/IEC Directives, Part 1. In particular, the different approval criteria needed for

the different types of document should be noted. This document was drafted in accordance with the

editorial rules of the ISO/IEC Directives, Part 2 (see www .iso .org/ directives).

Attention is drawn to the possibility that some of the elements of this document may be the subject

of patent rights. ISO and IEC shall not be held responsible for identifying any or all such patent

rights. Details of any patent rights identified during the development of the document will be in the

Introduction and/or on the ISO list of patent declarations received (see www .iso .org/ patents) or the IEC

list of patent declarations received (see http:// patents .iec .ch).

Any trade name used in this document is information given for the convenience of users and does not

constitute an endorsement.

For an explanation of the voluntary nature of standards, the meaning of ISO specific terms and

expressions related to conformity assessment, as well as information about ISO's adherence to the

World Trade Organization (WTO) principles in the Technical Barriers to Trade (TBT) see www .iso .org/

iso/ foreword .html.

This document was prepared by Joint Technical Committee ISO/IEC JTC 1, Information technology,

Subcommittee SC 29, Coding of audio, picture, multimedia and hypermedia information

A list of all parts in the ISO/IEC 23092 series can be found on the ISO website.

Any feedback or questions on this document should be directed to the user’s national standards body. A

complete listing of these bodies can be found at www .iso .org/ members .html.
iv © ISO/IEC 2020 – All rights reserved
---------------------- Page: 4 ----------------------
ISO/IEC DIS 23092-4:2020(E)
Introduction

The advent of High-Throughput Sequencing (HTS) technologies has the potential to boost the adoption

of genomic information in everyday practice, ranging from biological research to personalized genomic

medicine in the clinic. As a consequence, an extraordinarily growing volume of generated data has been

recorded during the last few years, and an even more pronounced growth is expected in the near future.

At the moment genomic information is mostly exchanged through a variety of data formats, such as

FASTA/FASTQ for unaligned sequencing reads and SAM/BAM/CRAM for aligned reads. With respect

to such formats, ISO/IEC 23092 provides a new solution for the representation and compression of

genome sequencing information by:

— specifying an abstract representation of the sequencing data rather than a specific format with its

direct implementation

— being designed at a time point when technologies and use cases are more mature. This permits to

address one limitation of the textual SAM format, for which incremental ad-hoc addition of features

followed along the years, resulting in an overall redundant and suboptimal format which at the

same time results not general and unnecessarily complicated

— normatively separating free-field user-defined information with no clear semantics from the

normative genomic data representation. This allows a fully interoperable and automatic exchange

of information between different data producers.

— allowing multiplexing of relevant meta-data information with the data since data and metadata are

partitioned at different conceptual levels.

— following a strict and supervised development process which has proven successful in the last

30 years in the domain of digital media for the transport format, the file format, the compressed

representation and the application program interfaces.

This document provides the enabling technology that will allow the community to create an ecosystem

of novel, interoperable, solutions in the field of genomic information processing. In particular it offers:

— Consistent, general and properly designed format definitions and data structures to store sequencing

and alignment information. A robust framework which can be used as a foundation to implement

different compression algorithms

— Speed and flexibility in the selective access to coded data, by means of newly designed data clustering

and optimized storage methodologies

— Low latency in data transmission and consequent fast availability at remote locations, based on

transmission protocols inspired by real-time application domains

— Built-in privacy and protection of sensitive information, thanks to a flexible framework which

allows customizable secured access at all layers of the data hierarchy

— Reliability of the technology and interoperability among tools and systems, owing to the provision

of a normative procedure to assess conformance to the standard on an exhaustive dataset

— Support to the implementation of a complete ecosystem of compliant devices and applications,

through the availability of a normative reference implementation covering the totality of the

specification.

The fundamental structure of the ISO/IEC 23092 series data representation is the genomic record. The

genomic record is a data structure consisting of either a single sequence read, or a paired sequence

read, and its associated sequencing and alignment information; it may contain detailed mapping and

alignment data, a single or paired read identifier (read name) and quality values.

© ISO/IEC 2020 – All rights reserved v
---------------------- Page: 5 ----------------------
ISO/IEC DIS 23092-4:2020(E)

Without breaking traditional approaches, the genomic record introduced in the ISO/IEC 23092 series

provides a more compact, simpler and manageable data structure grouping all the information related

to a single DNA template, from simple sequencing data to sophisticated alignment information.

The genomic record, although it is an appropriate logic data structure for interaction and manipulation of

coded information, is not a suitable atomic data structure for compression. To achieve high compression

ratios, it is necessary to group genomic records into clusters and to transform the information of the

same type into sets of descriptors structured into homogeneous blocks. Furthermore, when dealing

with selective data access, the genomic record is a too small unit to allow effective and fast information

retrieval.

For these reasons, this document introduces the concept of access unit, which is the fundamental

structure for coding and access to information in the compressed domain.

The access unit is the smallest data structure that can be decoded by a decoder compliant with Part 2

of this Specification. An access unit is
...

Questions, Comments and Discussion

Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.