SIST ISO 24619:2014
(Main)Language resource management - Persistent identification and sustainable access (PISA)
Language resource management - Persistent identification and sustainable access (PISA)
ISO 24619:2011 specifies requirements for the persistent identifier (PID) framework and for using PIDs as references and citations of language resources in documents as well as in language resources themselves. In this context, examples of language resources include such works as digital dictionaries, language-purposed terminological resources, machine-translation lexica, annotated multimedia/multimodal corpora, text corpora that have been annotated with, for example, morpho-syntactic information, and the like. Computational and applied linguists and information specialists create such resources.
ISO 24619:2011 also addresses issues of persistence and granularity of references to resources, first by requiring that persistent references be implemented by using a PID framework and further by imposing requirements on any PID frameworks used for this purpose.
PID frameworks also allow the association of general metadata with the identifier, which can also contain citation information. ISO 24619:2011 specifies minimum requirements for effective use of PIDs in language resources and cites the use of several possible existing standards and de-facto standards.
Gestion des ressources langagières - Identification et accès pérennes
Upravljanje z jezikovnimi viri - Stalna identifikacija in trajen dostop (PISA)
Ta mednarodni standard določa zahteve za okvir stalne identifikacije (PID) in za uporabo stalne identifikacije in citatov iz jezikovnih virov v dokumentih in jezikovnih virih. V tem smislu primeri jezikovnih virov vključujejo dela, kot so digitalni slovarji, terminološki viri za jezikovno rabo, strojno-prevedeno besedišče, označeni multimedijski/multimodalni korpusi, besedilni korpusi, ki so bili označeni z npr. oblikoslovno-skladenjskimi informacijami, in podobno. Te vire ustvarjajo jezikoslovci na področjih računalniškega in uporabnega jezikoslovja. Ta mednarodni standard obravnava tudi vprašanja o vztrajnosti in razdrobljenost sklicevanja na vire, najprej z zahtevo, da se uporabijo stalne reference z uporabo okvira PID, in nato s predpisovanjem zahtev za vse okvire PID, ki se uporabljajo v ta namen. Okviri PID omogočajo tudi pridružitev splošnih metapodatkov identifikatorjem, ki tudi lahko vsebujejo informacije citiranja. Ta mednarodni standard določa minimalne zahteve za učinkovito uporabo PID-ov v jezikovnih virih in navaja uporabo več mogočih obstoječih standardov in »de facto« standardov, kot so: ISO 690 [16], APA [3], MLA [9] za informacije citiranja, ISO/IEC 21000-17, IETF RFC 5147, Annotea [2], časovni del [22], XPointer za sintakso identifikatorja dela in PURL [23], ARK [18], Handle System [24] in DOI [14].
General Information
Standards Content (Sample)
SLOVENSKI STANDARD
01-september-2014
Upravljanje z jezikovnimi viri - Stalna identifikacija in trajen dostop (PISA)
Language resource management - Persistent identification and sustainable access
(PISA)
Gestion des ressources langagières - Identification et accès pérennes
Ta slovenski standard je istoveten z: ISO 24619:2011
ICS:
01.020 Terminologija (načela in Terminology (principles and
koordinacija) coordination)
01.140.20 Informacijske vede Information sciences
35.240.30 Uporabniške rešitve IT v IT applications in information,
informatiki, dokumentiranju in documentation and
založništvu publishing
2003-01.Slovenski inštitut za standardizacijo. Razmnoževanje celote ali delov tega standarda ni dovoljeno.
INTERNATIONAL ISO
STANDARD 24619
First edition
2011-05-15
Language resource management —
Persistent identification and sustainable
access (PISA)
Gestion des ressources langagières — Identification et accès pérennes
Reference number
©
ISO 2011
© ISO 2011
All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized in any form or by any means,
electronic or mechanical, including photocopying and microfilm, without permission in writing from either ISO at the address below or
ISO's member body in the country of the requester.
ISO copyright office
Case postale 56 • CH-1211 Geneva 20
Tel. + 41 22 749 01 11
Fax + 41 22 749 09 47
E-mail copyright@iso.org
Web www.iso.org
Published in Switzerland
ii © ISO 2011 – All rights reserved
Contents Page
Foreword .iv
Introduction.v
1 Scope.1
2 Normative references.1
3 Terms and definitions .2
3.1 Resources .2
3.2 Identifiers .4
3.3 Roles, institutions and services .5
3.4 Actions .6
4 Background.6
5 Requirements for PID frameworks and PID use.8
5.1 General .8
5.2 PID framework requirements .8
5.3 PID usage .9
5.4 Citation information and persistent identifiers .10
5.5 Referencing resource parts.10
5.6 Collections .11
6 Complementary requirements .11
6.1 Granularity of identifiers.11
6.2 Recommendations .12
Annex A (informative) Independent resources, aggregated resources, and parts of resources .13
Annex B (informative) Persistent identifier system implementations.22
Annex C (informative) Abbreviated terms .25
Bibliography.27
Alphabetical Index.29
Foreword
ISO (the International Organization for Standardization) is a worldwide federation of national standards bodies
(ISO member bodies). The work of preparing International Standards is normally carried out through ISO
technical committees. Each member body interested in a subject for which a technical committee has been
established has the right to be represented on that committee. International organizations, governmental and
non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely with the
International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization.
International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 2.
The main task of technical committees is to prepare International Standards. Draft International Standards
adopted by the technical committees are circulated to the member bodies for voting. Publication as an
International Standard requires approval by at least 75 % of the member bodies casting a vote.
Attention is drawn to the possibility that some of the elements of this document may be the subject of patent
rights. ISO shall not be held responsible for identifying any or all such patent rights.
ISO 24619 was prepared by Technical Committee ISO/TC 37, Terminology and other language and content
resources, Subcommittee SC 4, Language resource management.
iv © ISO 2011 – All rights reserved
Introduction
References and citations are an important part of documents and papers. Traditionally authors use them to
provide proper acknowledgment to the author(s) of other papers as a source for their work or use them to
support their argumentation. Citations usually contain information that enables a reader to establish the
possible relevance of the cited paper and to identify it unambiguously. Any librarian or knowledgeable person
is able to retrieve the document using well-established procedures based on the information in the citation.
The availability of directly accessible documents on the web has inspired the practice of adding a web location
[4]
(URI ) to the citation information. This practice has made it possible to access referenced documents
directly in web browsers as well as in other document viewers. This practice is already recommended in
standards like ISO 690, although the emphasis there is more on identifying published resources and parts
than on providing sustainable access to them. Increasingly often, such references need to be exploited by
machines and software applications as well as by people, requiring reliable availability of the referenced
resources. Problems with access that occur when resources are relocated have led to the use of persistent
[23], [24] [18], [19], [24]
identifier (PID) frameworks . Current approaches address the resource relocation problem
by introducing resolver services that translate a resource identifier to its actual current location. These resolver
services have an added advantage of permitting the association of additional metadata with the identifier.
[14]
Elaborate frameworks such as the Digital Object Identifier (DOI) , use this feature to manage extra
services, for instance copyright information.
The practice of using persistent identifiers to cite and reference scientific data, along with individual resources
as well as data sets, is less well developed. It is no less powerful, however, in that it allows readers of a paper,
or users of a knowledge resource, direct access to the primary scientific data to which the resource refers.
When using references to access scientific data, including language resources, it becomes important to be
able also to refer to and access parts of resources. This is especially true in the domain of language
resources, where several layers of granularity are usually superimposed on the same data set or resource
collection. Therefore, discussions in this International Standard concerning the use and requirements for PID
frameworks extensively explore how these frameworks can deal efficiently with identifying and accessing parts
of resources. Special recommendations indicate how to approach the granularity issue when issuing PIDs for
resources and resource collections.
The need to apply PID frameworks for identifying resources contained in scientific data sets has also
increased since modern archives and repositories have begun to weave a network of related complex
resources that may be distributed over several locations. In these cases, permanent linkage is a prerequisite.
In a multimedia lexicon for instance, a lexical item can refer to images not necessarily physically in the lexicon,
or that are even referenced at a different site under control of a different organization. However, the link
between the lexicon item and the image must remain valid, even if some servers or files are subject to
relocation over time. Emerging e-Science scenarios, which make use of distributed services processing
distributed resources, are also completely dependent on having transparent access from any processing
service, irrespective of where it is located or what organization may operate it. This implies that resolving
resource references should not be hampered in any way by unnecessary dependencies involving reliance on
unsustainable or unpredictable services, whether they are technical or organizational.
The requirement that services like PID frameworks be accessible to the whole community of language
resource and technology providers is further complicated by the need to provide resolvable PIDs without
imposing commercial dependencies on resource providers other than the fundamental and well-established
requirements for maintaining resources on the Internet.
INTERNATIONAL STANDARD ISO 24619:2011(E)
Language resource management — Persistent identification
and sustainable access (PISA)
1 Scope
This International Standard specifies requirements for the persistent identifier (PID) framework and for using
PIDs as references and citations of language resources in documents as well as in language resources
themselves. In this context, examples of language resources include such works as digital dictionaries,
language-purposed terminological resources, machine-translation lexica, annotated multimedia/multimodal
corpora, text corpora that have been annotated with, for example, morpho-syntactic information, and the like.
Computational and applied linguists and information specialists create such resources.
This International Standard also addresses issues of persistence and granularity of references to resources,
first by requiring that persistent references be implemented by using a PID framework and further by imposing
requirements on any PID frameworks used for this purpose.
PID frameworks also allow the association of general metadata with the identifier, which can also contain
citation information. This International Standard specifies minimum requirements for effective use of PIDs in
language resources and cites the use of several possible existing standards and de-facto standards, such as:
[16] [3] [9] [2]
ISO 690 , APA , MLA for citation information, ISO/IEC 21000-17, IETF RFC 5147, Annotea ,
[22] [23] [18] [24]
temporal-fragment , XPointer for part identifier syntax and PURL , ARK , Handle System and
[14]
DOI .
2 Normative references
The following referenced documents are indispensable for the application of this document. For dated
references, only the edition cited applies. For undated references, the latest edition of the referenced
document (including any amendments) applies.
ISO 12620:2009, Terminology and other language and content resources — Specification of data categories
and management of a Data Category Registry for language resources
ISO/IEC 21000-17:2006, Information technology — Multimedia framework (MPEG-21) — Part 17: Fragment
Identification of MPEG Resources
W3C 2003, XPointer Framework: [online] W3C Recommendation 25 March 2003 [viewed 2010-08-04].
Available from: http://www.w3.org/TR/xptr-framework/
WILDE, E. and DUERST, M. URI Fragment Identifiers for the text/plain Media Type, IETF RFC 5147, April 2008
[viewed 2010-12-22]. Available from: http://www.rfc-editor.org/rfc/rfc5147.txt
3 Terms and definitions
For the purposes of this document, the following terms and definitions apply.
3.1 Resources
3.1.1
resource
digital object on the web with a specific identity that can be addressed with a URI (3.2.2)
NOTE 1 Adapted from IETF RFC 3986.
NOTE 2 In the context of this International Standard, a resource can also be a language resource that has an online
representation.
NOTE 3 A resource can have several representations. Depending on the PID framework (3.2.5), identification of a
[8]
specific representation can be encoded in the identifier (ARK, see B.3) or be left to the content negotiating process
between the web client (3.3.8) that uses the resolved PID to fetch the resource (3.1.1) and the resource server (3.3.6).
3.1.2
language resource
digital resource that provides information about one or more languages
NOTE Language resources cover lexicographical, terminological, morpho-syntactical, corpus-related, or semantic
resources or digital resources used to study linguistic phenomena like texts and multimedia/multimodal recordings. They
are created and used by linguists, information specialists, lexicographers and terminologists, among others. They
frequently comprise many small records compiled within a larger work, and are often authoritative in nature, such as
standardized terminologies and glossaries issued by standards bodies such as ISO, IETF, W3C, etc.
3.1.3
complex resource
resource (3.1.1) consisting of multiple constituent parts, each of which can be accessed individually
NOTE A complex resource can be a federated resource if its constituent parts are distributed over different
repositories (3.1.6).
3.1.4
collection
grouping of any number of resources (3.1.1) that need to be referenced as a whole
3.1.5
published collection
purposefully built collection of resources that is maintained as an independent entity by an archive (3.1.7) or
repository (3.1.6) and for which adequate citation (3.1.16) information is available
3.1.6
digital repository
repository
facility that provides reliable access to managed digital resources (3.1.1)
3.1.7
archive
digital archive
repository (3.1.6) dedicated to the long-term preservation of its associated data
NOTE Often the data in digital archives are also available online, which highlights the need for reliable persistent
identifiers (3.2.4).
2 © ISO 2011 – All rights reserved
3.1.8
resource collection incarnation
incarnation
virtual embodiment of a disparate, otherwise non-aggregated collection (3.1.4) assembled for a specific
purpose that is referenced by a single PID (3.2.4) concatenated with a part identifier (3.2.7) in order to
access the components of the collection
NOTE A bibliography or index can use a single PID together with extensions to provide access to components in a
set of resources (3.1.1) used in the production of a monograph or project without actually collecting the physical files in
one location, which is to say that the individual items remain in their original locations, but are referenced as parts of a
virtual whole.
3.1.9
version
particular form or variation of a resource (3.1.1) that differs from other instantiations of the resource in at least
one aspect or item of information
NOTE Versions are often identified in sequential order (e.g. Version 1, 2, etc.), but version identification of dynamic
resources subject to frequent change is often achieved by assigning a date-time stamp.
3.1.10
snapshot
instantaneous copy of a resource (3.1.1) representing the status of the resource or collection at a single point
in time
3.1.11
abstract resource
non-network-retrievable resource identified by a URI (3.2.2), usually a concept such as a class or property
NOTE It is practice, for example in RDFS (RDF Schema) or OWL (web ontology language) ontologies, to identify
abstract resources using URIs. Web architecture does not require any information resource to be retrievable with this kind
of URI. If an identifier for an abstract resource is not meant to be dereferenced (3.4.1), such as can be the case with an
XML namespace URI, it is not meaningful to issue a PID (3.2.4) for this resource.
3.1.12
resource part
part
identifiable, accessible entity embedded in an independent resource (3.1.1) or in a larger part thereof
NOTE Parts can be embedded in other parts. In dynamic web environments, subsetting into parts is subject to
change and interpretation, which requires a certain level of user decision-making to designate and identify such sub-
entities.
3.1.13
fragment
some portion or subset of a primary resource (3.1.1), some view on representations of the primary resource,
or some other resource defined or described as a component of the resource defined or described by those
representations
NOTE 1 Adapted from IETF RFC 3986.
NOTE 2 In this International Standard, the term fragment is used only in the IETF RFC 3986 sense, when in a web
context a client application (3.3.5) retrieves the fragment from a containing resource.
3.1.14
terminal part
part (3.1.12) of a resource (3.1.1) that is not subdivided into smaller parts
3.1.15
internal part
part (3.1.12) of a resource (3.1.1) that is both embedded in the resource and subdivided into smaller parts
3.1.16
citation
information object containing information that directs a reader's or user's attention from one resource (3.1.1)
to another
3.1.17
reference
digital object that links to data stored elsewhere
NOTE Although citation (3.1.16) and reference are commonly used as near-synonyms, for purposes of this
International Standard, citations provide information for human readers and users, while references include the precise
location where the referenced resource (3.1.1) can be found. References can be machine-readable, and can be
configured as actionable given the required criteria.
3.1.18
annotation tier
separate information layer containing comments, notes, explanations, or other types of external remarks that
can be attached to a resource (3.1.1)
NOTE For instance, maps or images can be annotated with supplemental information, or text corpora can be
annotated in either in-line or standoff mode.
3.1.19
standoff annotation
annotations held outside the document that is being annotated
3.2 Identifiers
3.2.1
identifier
digital identifier
sequence of characters associated with digital, non-digital, or abstract entities, such as books, images,
reports, metadata records or events
3.2.2
URI
Uniform Resource Identifier
string of characters used to identify or name a resource (3.1.1) with a syntax as defined in IETF RFC 3986
3.2.3
URI naming scheme
top level of the URI naming structure
NOTE 1 Every scheme specifies its own syntax conventions for URIs (3.2.2).
NOTE 2 Typical URI schemes include http, https, ftp, mailto, etc. and are registered with IANA.
3.2.4
PID
persistent identifier
unique identifier (3.2.1) that ensures permanent access for a digital object by providing access to it
independently of its physical location or current ownership
NOTE Unique in this context means that the PID will not be issued again for other resources. However, the same PID
can reference different representations or incarnations (3.1.8) of the resource at the discretion of the resource provider.
4 © ISO 2011 – All rights reserved
3.2.5
PID framework
scheme for specifying identifier strings [PID (3.2.4) scheme] for web-accessible digital objects together with a
mechanism that enables the resolution of these identifiers into the object's current URI (3.1.1)
NOTE 1 A PID framework in the sense of this International Standard facilitates access to both individual objects and to
parts (3.1.12) and fragments (3.1.13) contained in such objects. A PID framework can be solely dependent on existing
web resolution protocols or it can entail the interaction of proxy-based resolvers.
NOTE 2 A PID framework in the sense of this International Standard also allows resolution of other information
associated with the PID.
3.2.6
actionable identifier
URI (3.2.2) that has a resource-associated identifier (3.2.1) that is suitably encoded, such that when the URI
is embedded in a web document and “clicked” on, the browser will be redirected to the resource (3.1.1), and
possibly supplementary services related to the resource
NOTE 1 This functionality implies that the URI points to a suitable resolver proxy (3.3.7).
NOTE 2 In some PID frameworks (3.2.5), the PIDs (3.2.4) are URIs and are automatically actionable.
3.2.7
resource part identifier
part identifier
string of characters that refers to a resource part (3.1.12) that can be identified by some means within a given
resource type (time in media, area in an image, record in a data stream, etc.)
NOTE Part identifiers in the sense of this International Standard are intended for server-side resolution in contrast to
client-side resolution, which is characteristic of fragment identifiers (3.2.8).
3.2.8
fragment identifier
identifier (3.2.1) used to reference a part (3.1.12) of a resource (3.1.1) in a web context
NOTE 1 Adapted from IETF RFC 3986.
NOTE 2 A fragment identifier component as defined in IETF RFC 3986 is indicated by the presence of a number sign
(“#”) character and terminated by the end of the URI (3.2.2). Fragments (3.1.13) in the sense of this RFC are resolved
and retrieved from the resource by the local client application (3.3.5).
[27]
NOTE 3 There is a W3C draft proposal to change this handling of fragments .
3.3 Roles, institutions and services
3.3.1
archiving institution
institution responsible for maintaining a digital archive (3.1.7)
3.3.2
resource provider
organization that makes a resource (3.1.1) available online
NOTE A resource can also be a service.
3.3.3
resolver
PID resolver
software application that translates an identifier (3.2.1) into another more suitable identifier, specifically that
translates a resource PID (3.2.4) into its URI (3.2.2) and in this way points a client application to the location of
the resource (3.1.1)
3.3.4
resolution system
system designed to support the submission of a persistent identifier (3.2.4) to a network service in order to
receive in return one or more pieces of current information related to the identified object, e.g. a location (URI)
(3.2.2) of the object or metadata
NOTE The complete resolution system can be viewed as “the PID resolver” (3.3.3) but is often implemented as
different resolvers or resolver services.
3.3.5
client application
software application that accesses a remote service usually on another computer system
3.3.6
resource server
computer that ultimately provides access to the object referenced by a specific client application request
3.3.7
resolver proxy
HTTP resolver proxy
application that implements a service supporting the use of urlified (3.4.3) PIDs (3.2.4) to access resources or
other PID-related information, or both
3.3.8
web client
client application capable of accessing resources on the web using the HTTP protocol
3.4 Actions
3.4.1
dereference
to access the value referred to by a reference (3.1.17)
NOTE When used within the context of dereferencing a URI (3.2.4), it means obtaining a representation of the
resource to which the URI points.
3.4.2
resolve
to translate an identifier (3.2.1) into another name or address suitable for accessing a resource
NOTE The resolution process may require multiple steps in order to obtain a suitable address for a resource.
3.4.3
urlify an identifier
to encode an identifier (3.2.1) as a suitable URI (3.2.4)
NOTE For example, this might be done with the purpose of creating an actionable identifier (3.2.6).
4 Background
PIDs can exist in all kinds of electronic resources and this International Standard does not make explicit
statements about them, but the type of resource targeted by a PID has consequences for the requirements
imposed on the individual PID. Resources can be characterized into three major types:
⎯ independent resources as shown in Figure 1;
⎯ any part of such an individual resource that requires further specification;
⎯ a collection of resources that is referred to as a whole.
6 © ISO 2011 – All rights reserved
electronic electronic
resource resource
unique and
persistent identif ier
associated
metadata
Figure 1 — Using unique PIDs to point from a source resource to a target resource
This International Standard concerns how to uniquely reference an electronic resource in a machine-readable
way. In Figure 1, a unique and persistent identifier (PID) included in a source resource points to a target
resource. The PID can be associated with metadata of different sorts.
The nature of a resource in this context is very broad and the means of referring to it is subject to context. An
image, for instance, either can be an independent resource associated with its own unique PID and can be
referenced as such, or can be embedded in a document where it lacks an identity of its own, in which case it is
a part of that document. In addition, a reference can point to a part of this image. An individual resource can
stand alone in one environment and be treated as part of a complex resource in another environment. An
internal part of a resource may be viewed as a terminal part, but further processing in a dynamic environment
may result in an entity that itself comes to contain accessible sub-parts. This International Standard is
designed to support all these cases.
In the case of complex language resources, some resources should be assigned their own individual
persistent identifiers. Other resources act as containing resources that have many constituent parts, in which
case the containing resource should be assigned a PID, while its parts can be referenced by appending part
identifiers to this PID. This International Standard provides guidelines for determining the appropriate
approach to take with respect to any given resource.
This International Standard utilizes existing standards and practices for resource part and fragment identifier
formats, where available, and provides guidelines for situations where current standards are inadequate or do
not apply. A further discussion of resource types targeted by this International Standard may be found in
Annex A.
With respect to collections of language resources, the standard takes two types of collections into account:
⎯ Collections of resources that are maintained as complex resources in a more or less published static form
so that the definition of the collection as such is maintained as an independent entity by an archive or
repository, which then also provides a persistent identifier for such a collection. The archiving institution is
responsible for maintaining the connection between the PID and the collection represented as a metadata
entry in a catalogue, for example.
⎯ A different type of collection that was not preconceived as a collection by its creators or the archiving
institution(s) but achieves its status as a complex resource based on some research or other work that
needs to be verifiable, such as the preparation of a monograph or the conduct of a scholarly or scientific
project. Such collections, although purposefully constructed by the creator, may not have any significance
outside the context of the original work for which they were created. Referring from the research
documents to the collection may become tedious when the collection contains hundreds of individual
resources. As a consequence, there is a need to refer to these types of collections with a PID that is
associated with all its constituent resources and appropriate metadata. Of course this kind of reference is
only possible if there is an incarnation of the collection.
5 Requirements for PID frameworks and PID use
5.1 General
Current standards and practices for using references and citations, especially in the domain of language
resources, can be found in Annex A. This section focuses initially on requirements for the PID framework itself
and thereafter on requirements for using PIDs as references and citations of language resources.
5.2 PID framework requirements
5.2.1 General
A PID framework in the sense of this International Standard shall support the following:
a) resolution of a single PID to multiple URIs or services;
b) association and access to related metadata;
c) adequate security to prevent malicious or accidental modification of PID/URI mappings and PID/metadata
associations;
d) addressing of parts of a resource (part or fragment identifiers, or both);
e) encoding of the PID as a URI to render identifiers actionable in web documents without requiring client
modifications.
5.2.2 Accommodating duplicate resources
It is common to provide duplicate or mirror resources or copies residing on different resource servers for data
preservation purposes and to provide high-speed access. The PID framework should support this kind of
duplication by allowing multiple URIs to be associated with a single PID.
5.2.3 Accessing resource metadata
Next to providing a reliable URI to the resource, PID frameworks are also used to associate metadata with the
resource in a secure and reliable way. Although this International Standard does not require metadata of any
particular type to be available, it does require the possibility of resolving the PID to an associated metadata
record encoded in XML format, so other services may be built on this feature.
5.2.4 Secure and reliable administration
The PID framework shall provide adequate security so that only the owner or caretaker of the resource can
change the PID/URI mapping or the associated metadata.
5.2.5 Resource part identifiers
It is impossible to provide a PID for every identifiable part of a resource, or even to identify globally all possible
options for segmenting resources into parts. Consequently, PID frameworks shall provide a system for
assigning part or fragment identifiers in combination with the resource PID. Since the objective is to use a
single string for identifying the resource part, PID syntax should support the concatenation of the PID and the
part identifier. For example, the PID resolving system should be configurable such that a PID part identifier
combination will resolve into a URI that can be correctly interpreted by the resource server in order to deliver
the requested resource part [see A.2 a)].
8 © ISO 2011 – All rights reserved
A complex resource “A” with constituents x, y and z is identified by PID “1839/A”.
The PID resolver translates the identifier “1839/A” into the URI http://oserver/objectA, which can be
understood by the resource server to deliver object “A”. The part “z” of resource “A” is identified by
PID “1839/A#z”. To enable easy dereferencing by the resource server, the PID resolver should be able to
translate the identifier “1839/A#z” into the URI http://oserver/objectA?part=z or some similar query string that
can be understood by the resource server for object A to deliver part “z”.
Figure 2 — Processing part identifiers by the PID resolver
(using the handle server as an implementation)
5.2.6 Urlified PIDs
The PID framework should provide a proxy resolver implementation that is able to resolve urlified PIDs. This
allows web clients to resolve such an identifier using the HTTP scheme without needing special browser plug-
ins or other special software. In some PID frameworks, the PID is already an HTTP URI, in which case a
separate proxy resolver is superfluous.
5.3 PID usage
Citations of web-accessible digital language resources should be accompanied by a PID that resolves either
to a URI for the resource itself or to a metadata record describing the resource. The latter option can be used
if the resource itself cannot be made (immediately) available or in case of collections, in which case the
metadata record should contain the identifiers for its constituents if available.
If the PID resolves to the URI of the resource, a metadata record in XML format and compliant with a declared
schema pertaining to the resource should be associated separately with the identifier and made available via
the resolving system. Citation information should be included in the metadata record. No other requirements
are specified for the metadata record associated with the resource identifier.
In web documents the identifier (for instance, a handle) embedded in citations should in addition to a PID
conforming to the PID scheme syntax also be presented in a urlified form that is encoded as a URI so that it
becomes actionable in a web browser or other document viewer application. For example, using the Handle
System (HS) as an implementation:
1839/00-0000-0000-0000-4 -> http://hdl.handle.net/1839/00-0000-0000-0000-0000-4
The resolver address http://hdl.handle.net identifies an HTTP resolver proxy, which is an application that can
receive HTTP requests in the form of urlified PIDs and uses the “real” PID resolver to redirect the client to the
resource. With the Handle System as an example, the HTTP resolver proxy is either the central resolver proxy
of the Handle System or a Handle System resolver proxy guaranteed to be available by the resource provider.
Resource part specifications, which are analogous to page or chapter specifications, should be appended
where relevant to the PID using an appropriate delimiter character, for instance using a scheduled extension
of the HS:
1839/00-0000-0000-0000-4@time(100s,200s)
This extension refers to a segment of an audio file (the part identifier is non-standard).
If a compatible fragment identifier exists for the resource type, this element can be added to the encoded URI
in order to create a composite actionable identifier. An HS example is:
http://handle.net/1839/00-0000-0000-0000-0000-4?urlappend=#ffp(track_ID=101)*mp(/~time(’npt’,’50’))
This example uses the special function “urlappend” of the handle proxy resolver to append a fragment
identifier to the resolved handle.
5.4 Citation information and persistent identifiers
Existing practices can be maintained provided that URIs are replaced by PIDs and document part
specifications are converted into part identifiers or fragment identifiers. References from web documents shall
[13]
also provide a urlified PID if PID syntax does not comply with an IANA-registered URI scheme .
5.5 Referencing resource parts
5.5.1 General
Any applicable existing ISO or IETF standard for identifying a part of a resource can be used as a part
identifier, unless otherwise specified. For resources where no such standard exists, it is permissible to use
human-readable text in the citation, for instance: 10 s to 120 s for a time segment. This approach will not,
however, work for software clients.
When using part or fragment identifiers for retrieving or dereferencing part of a resource, it is important to
clarify the difference between using fragment identifiers such as those defined in IETF RFC 3986 and using
the functionality of a suitable resource server. If the PID resolving process delivers a URI including a fragment
identifier, for instance, using the following URI (with fragment specification according to the IETF RFC 5147
proposed standard for plain text media):
http://myserver/myresource#line=10,20
this URI will cause a web browser to fetch the whole document, and the browser itself will then isolate the
document part from lines 10 to 20 and present it to the user. When the document is small, this is acceptable
behaviour. However, when there is a need to present a fragment from a large 2 GB media file, it is necessary
[22]
to use a special resource server and a URI such as with a part specification from Annodex :
http://videoserver.com/videoA.anx?t=15.0/30.0
This notation will cause the resource server to transfer only the video segment from 15,0 s to 30,0 s.
Updates to this International Standard are expected to enlarge the list of applicable resource part/fragment
identifier formats.
5.5.2 Media (time series)
ISO/IEC 21000-17 for MPEG-21 resources shall apply for media and time series. Annodex syntax can be
[22]
used for specifying time intervals in URI queries and fragments for all applicable media. For other formats,
the part identifier will depend on the format used.
5.5.3 Textual resources
For XML-encoded textual resources, XPointer shall apply. IETF RFC 5147 shall apply for plain text
documents. For other formats, the part identifier will depend on the format used.
10 © ISO 2011 – All rights reserved
5.5.4 Metadata registries, terminologies, ontologies
In accordance with ISO 12620:2009, each data category specification in the ISO/TC 37 Data Category
Registry (DCR) shall have its own PID, which is formed as a concatenation of the PID for the DCR as a whole.
[30]
These PIDs are configured as “cool” URIs as per current practice of the World Wide Web Consortium.
Specific parts of an RDF graph can be addressed using one of the proposed RDF query languages, but no
definitive specification is currently available. For other formats, the part identifier will depend on the format
used.
5.6 Collections
Existing “published” collections shall be assigned an associated PID that is maintained by the collection
sponsor. The PID shall refer to a description representing the collection, which can be a catalogue entry or an
individual metadata description. For virtual collections, the PID should refer to a machine-readable metadata
description which provides access to the relevant information, in particular to the resources that are included,
referenced by their own individual PIDs.
6 Complementary requirements
6.1 Granularity of identifiers
With respect to granularity, this International Standard distinguishes between the identification of parts and of
fragments, as indicated in 5.5. A fragment identifier is defined by IETF RFC 3986 as an optional component of
a URI reference. In conformance with IETF RFC 3986, a URI can be assigned an optional fragment identifier,
whereby the identifier is separated from the rest of the URI reference by a # (number sign) character. The
separator is not considered part of the fragment identifier.
The URI with the fragment identifier can be used by an application to identify and usually to access a specific
resource that is part of or embedded in a primary resource. The format of the fragment identifier depends on
the resource type.
Interpreting and dereferencing the fragment identifier is a web client function, and therefore requires the
complete primary resource to be downloaded, after which a client application can extract the required
fragment. Furthermore, since the fragment identifier is not passed to other systems during the process of
retrieval, some intermediaries in the web architecture (such as proxies) have no interaction with fragment
identifiers, and HTTP redirection does not account for fragments. As an exception, fragment identifiers for
RDF documents do not refer to parts of the document, but rather to the object in that document being
described as having that fragment identifier. As a consequence, the use of fragment identifiers in combination
with a URI allows the client application to isolate part of a resource based on knowledge specific to the client
application.
In contrast to fragment resolution procedures, using a part identifier such as “z” in
http://myserver/myObjectService?part=z
1)
relies on the remote server to isolate the part and send it to the client application .
1) This strict separation of roles between a web client and server when dereferencing fragments may change. The media
fragments working group has produced a W3C draft where it is proposed that web clients negotiate the transportation of
[27]
only part of the resource from the server .
6.2 Recommendations
This International Standard supports different levels of granularity. The following recommendations are
designed to encourage efficiency and promote interoperability with other naming schemes.
⎯ If there is an existing identifier scheme for a type of resources, for instance, ISBN, this level of granularity
should be retained, which is to say that no new PIDs should be issued without very good reasons, such
as for chapters. Chapters would preferably be addressed using part identifiers in conjunction with the PID
of the book.
⎯ If the resource is associated with the complete content of a digital file, an individual PID should probably
be assigned for this resource.
⎯ If the resource is autonomous and exists outside a larger context, an individual PID should probably be
assigned for this resource.
⎯ If a resource should be citable apart from any containing resource, an individual PID should probably be
assigned for this resource.
These recommendations are, however, subject to the needs of resource creators with respect to the level of
...
2003-01.Slovenski inštitut za standardizacijo. Razmnoževanje celote ali delov tega standarda ni dovoljeno.Upravljanje z jezikovnimi viri - Stalna identifikacija in trajen dostop (PISA)Gestion des ressources langagières - Identification et accès pérennesLanguage resource management - Persistent identification and sustainable access (PISA)01.140.20Informacijske vedeInformation sciencesICS:Ta slovenski standard je istoveten z:ISO 24619:2011SIST ISO 24619:2014en,fr,de01-september-2014SIST ISO 24619:2014SLOVENSKI
STANDARD
Reference numberISO 24619:2011(E)© ISO 2011
INTERNATIONAL STANDARD ISO24619First edition2011-05-15Language resource management — Persistent identification and sustainable access (PISA) Gestion des ressources langagières — Identification et accès pérennes SIST ISO 24619:2014
©
ISO 2011 All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying and microfilm, without permission in writing from either ISO at the address below or ISO's member body in the country of the requester. ISO copyright office Case postale 56 • CH-1211 Geneva 20 Tel.
+ 41 22 749 01 11 Fax
+ 41 22 749 09 47 E-mail
copyright@iso.org Web
www.iso.org Published in Switzerland
ii © ISO 2011 – All rights reserved
Independent resources, aggregated resources, and parts of resources.13 Annex B (informative)
Persistent identifier system implementations.22 Annex C (informative)
Abbreviated terms.25 Bibliography.27 Alphabetical Index.29
INTERNATIONAL STANDARD ISO 24619:2011(E) © ISO 2011 – All rights reserved 1 Language resource management — Persistent identification and sustainable access (PISA) 1 Scope This International Standard specifies requirements for the persistent identifier (PID) framework and for using PIDs as references and citations of language resources in documents as well as in language resources themselves. In this context, examples of language resources include such works as digital dictionaries, language-purposed terminological resources, machine-translation lexica, annotated multimedia/multimodal corpora, text corpora that have been annotated with, for example, morpho-syntactic information, and the like. Computational and applied linguists and information specialists create such resources. This International Standard also addresses issues of persistence and granularity of references to resources, first by requiring that persistent references be implemented by using a PID framework and further by imposing requirements on any PID frameworks used for this purpose. PID frameworks also allow the association of general metadata with the identifier, which can also contain citation information. This International Standard specifies minimum requirements for effective use of PIDs in language resources and cites the use of several possible existing standards and de-facto standards, such as: ISO 690 [16], APA [3], MLA [9] for citation information, ISO/IEC 21000-17, IETF RFC 5147, Annotea [2], temporal-fragment [22], XPointer for part identifier syntax and PURL [23], ARK [18], Handle System [24] and DOI [14]. 2 Normative references The following referenced documents are indispensable for the application of this document. For dated references, only the edition cited applies. For undated references, the latest edition of the referenced document (including any amendments) applies. ISO 12620:2009, Terminology and other language and content resources — Specification of data categories and management of a Data Category Registry for language resources ISO/IEC 21000-17:2006, Information technology — Multimedia framework (MPEG-21) — Part 17: Fragment Identification of MPEG Resources W3C
2003, XPointer Framework: [online] W3C Recommendation 25 March 2003 [viewed 2010-08-04]. Available from: http://www.w3.org/TR/xptr-framework/ WILDE, E. and DUERST, M. URI Fragment Identifiers for the text/plain Media Type, IETF RFC 5147, April 2008 [viewed 2010-12-22]. Available from: http://www.rfc-editor.org/rfc/rfc5147.txt SIST ISO 24619:2014
resource electronic
resource unique and persistent
identifier associated metadata
Figure 1 — Using unique PIDs to point from a source resource to a target resource This International Standard concerns how to uniquely reference an electronic resource in a machine-readable way. In Figure 1, a unique and persistent identifier (PID) included in a source resource points to a target resource. The PID can be associated with metadata of different sorts. The nature of a resource in this context is very broad and the means of referring to it is subject to context. An image, for instance, either can be an independent resource associated with its own unique PID and can be referenced as such, or can be embedded in a document where it lacks an identity of its own, in which case it is a part of that document. In addition, a reference can point to a part of this image. An individual resource can stand alone in one environment and be treated as part of a complex resource in another environment. An internal part of a resource may be viewed as a terminal part, but further processing in a dynamic environment may result in an entity that itself comes to contain accessible sub-parts. This International Standard is designed to support all these cases. In the case of complex language resources, some resources should be assigned their own individual persistent identifiers. Other resources act as containing resources that have many constituent parts, in which case the containing resource should be assigned a PID, while its parts can be referenced by appending part identifiers to this PID. This International Standard provides guidelines for determining the appropriate approach to take with respect to any given resource. This International Standard utilizes existing standards and practices for resource part and fragment identifier formats, where available, and provides guidelines for situations where current standards are inadequate or do not apply. A further discussion of resource types targeted by this International Standard may be found in Annex A. With respect to collections of language resources, the standard takes two types of collections into account: ⎯ Collections of resources that are maintained as complex resources in a more or less published static form so that the definition of the collection as such is maintained as an independent entity by an archive or repository, which then also provides a persistent identifier for such a collection. The archiving institution is responsible for maintaining the connection between the PID and the collection represented as a metadata entry in a catalogue, for example. ⎯ A different type of collection that was not preconceived as a collection by its creators or the archiving institution(s) but achieves its status as a complex resource based on some research or other work that needs to be verifiable, such as the preparation of a monograph or the conduct of a scholarly or scientific project. Such collections, although purposefully constructed by the creator, may not have any significance outside the context of the original work for which they were created. Referring from the research documents to the collection may become tedious when the collection contains hundreds of individual resources. As a consequence, there is a need to refer to these types of collections with a PID that is associated with all its constituent resources and appropriate metadata. Of course this kind of reference is only possible if there is an incarnation of the collection. SIST ISO 24619:2014
A complex resource “A” with constituents x, y and z is identified by PID “1839/A”. The PID resolver translates the identifier “1839/A” into the URI http://oserver/objectA, which can be understood by the resource server to deliver object “A”. The part “z” of resource “A” is identified by PID “1839/A#z”. To enable easy dereferencing by the resource server, the PID resolver should be able to translate the identifier “1839/A#z” into the URI http://oserver/objectA?part=z or some similar query string that can be understood by the resource server for object A to deliver part “z”. Figure 2 — Processing part identifiers by the PID resolver
(using the handle server as an implementation) 5.2.6 Urlified PIDs The PID framework should provide a proxy resolver implementation that is able to resolve urlified PIDs. This allows web clients to resolve such an identifier using the HTTP scheme without needing special browser plug-ins or other special software. In some PID frameworks, the PID is already an HTTP URI, in which case a separate proxy resolver is superfluous. 5.3 PID usage Citations of web-accessible digital language resources should be accompanied by a PID that resolves either to a URI for the resource itself or to a metadata record describing the resource. The latter option can be used if the resource itself cannot be made (immediately) available or in case of collections, in which case the metadata record should contain the identifiers for its constituents if available. If the PID resolves to the URI of the resource, a metadata record in XML format and compliant with a declared schema pertaining to the resource should be associated separately with the identifier and made available via the resolving system. Citation information should be included in the metadata record. No other requirements are specified for the metadata record associated with the resource identifier. In web documents the identifier (for instance, a handle) embedded in citations should in addition to a PID conforming to the PID scheme syntax also be presented in a urlified form that is encoded as a URI so that it becomes actionable in a web browser or other document viewer application. For example, using the Handle System (HS) as an implementation: 1839/00-0000-0000-0000-4 -> http://hdl.handle.net/1839/00-0000-0000-0000-0000-4 The resolver address http://hdl.handle.net identifies an HTTP resolver proxy, which is an application that can receive HTTP requests in the form of urlified PIDs and uses the “real” PID resolver to redirect the client to the resource. With the Handle System as an example, the HTTP resolver proxy is either the central resolver proxy of the Handle System or a Handle System resolver proxy guaranteed to be available by the resource provider. Resource part specifications, which are analogous to page or chapter specifications, should be appended where relevant to the PID using an appropriate delimiter character, for instance using a scheduled extension of the HS: 1839/00-0000-0000-0000-4@time(100s,200s) SIST ISO 24619:2014
1) This strict separation of roles between a web client and server when dereferencing fragments may change. The media fragments working group has produced a W3C draft where it is proposed that web clients negotiate the transportation of only part of the resource from the server [27]. SIST ISO 24619:2014
Independent resources, aggregated resources, and parts of resources A.1 Overview A.1.1 General There is increasing demand in science and industry for options to reference digital language resources, resource parts or collections of language resources in an unambiguous persistent way. Not only is it desirable to be able to retrieve and validate references from scientific papers, but it is also increasingly necessary to maintain various types of references between language resources or parts of such resources. A.1.2 Resources A resource is anything that has an identity [7]. Despite the indefinite nature of this characterization, it is important for researchers to be able to identify linguistically meaningful units as coherent objects in a repository – both for human and, increasingly often, for machine readability. Such objects will be subject to separate manipulations and be used in different scenarios, which means that they can have an autonomous existence in larger contexts. Frequently such an object can be identified as a “single file” in a file system, but it can also be retrievable as a contained object (for instance, a data record) from a database system. Resources like this can have very different types or formats, such as ⎯ a digitized video recording of an interview, ⎯ a sound recording of a song, ⎯ a complex annotation of a communication act, ⎯ a photo documenting a speech event, ⎯ a lexicon for a certain language, ⎯ a grammar description, ⎯ an eye tracking recording during a reading study, ⎯ a metadata description of a resource or a resource collection, ⎯ an integrated document containing texts and photos, etc. The scope of such a resource is left to the discretion of its creator. Some resources are composed of a group of annotation tiers, while other originators may create a separate unit
...
INTERNATIONAL ISO
STANDARD 24619
First edition
2011-05-15
Language resource management —
Persistent identification and sustainable
access (PISA)
Gestion des ressources langagières — Identification et accès pérennes
Reference number
©
ISO 2011
© ISO 2011
All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized in any form or by any means,
electronic or mechanical, including photocopying and microfilm, without permission in writing from either ISO at the address below or
ISO's member body in the country of the requester.
ISO copyright office
Case postale 56 • CH-1211 Geneva 20
Tel. + 41 22 749 01 11
Fax + 41 22 749 09 47
E-mail copyright@iso.org
Web www.iso.org
Published in Switzerland
ii © ISO 2011 – All rights reserved
Contents Page
Foreword .iv
Introduction.v
1 Scope.1
2 Normative references.1
3 Terms and definitions .2
3.1 Resources .2
3.2 Identifiers .4
3.3 Roles, institutions and services .5
3.4 Actions .6
4 Background.6
5 Requirements for PID frameworks and PID use.8
5.1 General .8
5.2 PID framework requirements .8
5.3 PID usage .9
5.4 Citation information and persistent identifiers .10
5.5 Referencing resource parts.10
5.6 Collections .11
6 Complementary requirements .11
6.1 Granularity of identifiers.11
6.2 Recommendations .12
Annex A (informative) Independent resources, aggregated resources, and parts of resources .13
Annex B (informative) Persistent identifier system implementations.22
Annex C (informative) Abbreviated terms .25
Bibliography.27
Alphabetical Index.29
Foreword
ISO (the International Organization for Standardization) is a worldwide federation of national standards bodies
(ISO member bodies). The work of preparing International Standards is normally carried out through ISO
technical committees. Each member body interested in a subject for which a technical committee has been
established has the right to be represented on that committee. International organizations, governmental and
non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely with the
International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization.
International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 2.
The main task of technical committees is to prepare International Standards. Draft International Standards
adopted by the technical committees are circulated to the member bodies for voting. Publication as an
International Standard requires approval by at least 75 % of the member bodies casting a vote.
Attention is drawn to the possibility that some of the elements of this document may be the subject of patent
rights. ISO shall not be held responsible for identifying any or all such patent rights.
ISO 24619 was prepared by Technical Committee ISO/TC 37, Terminology and other language and content
resources, Subcommittee SC 4, Language resource management.
iv © ISO 2011 – All rights reserved
Introduction
References and citations are an important part of documents and papers. Traditionally authors use them to
provide proper acknowledgment to the author(s) of other papers as a source for their work or use them to
support their argumentation. Citations usually contain information that enables a reader to establish the
possible relevance of the cited paper and to identify it unambiguously. Any librarian or knowledgeable person
is able to retrieve the document using well-established procedures based on the information in the citation.
The availability of directly accessible documents on the web has inspired the practice of adding a web location
[4]
(URI ) to the citation information. This practice has made it possible to access referenced documents
directly in web browsers as well as in other document viewers. This practice is already recommended in
standards like ISO 690, although the emphasis there is more on identifying published resources and parts
than on providing sustainable access to them. Increasingly often, such references need to be exploited by
machines and software applications as well as by people, requiring reliable availability of the referenced
resources. Problems with access that occur when resources are relocated have led to the use of persistent
[23], [24] [18], [19], [24]
identifier (PID) frameworks . Current approaches address the resource relocation problem
by introducing resolver services that translate a resource identifier to its actual current location. These resolver
services have an added advantage of permitting the association of additional metadata with the identifier.
[14]
Elaborate frameworks such as the Digital Object Identifier (DOI) , use this feature to manage extra
services, for instance copyright information.
The practice of using persistent identifiers to cite and reference scientific data, along with individual resources
as well as data sets, is less well developed. It is no less powerful, however, in that it allows readers of a paper,
or users of a knowledge resource, direct access to the primary scientific data to which the resource refers.
When using references to access scientific data, including language resources, it becomes important to be
able also to refer to and access parts of resources. This is especially true in the domain of language
resources, where several layers of granularity are usually superimposed on the same data set or resource
collection. Therefore, discussions in this International Standard concerning the use and requirements for PID
frameworks extensively explore how these frameworks can deal efficiently with identifying and accessing parts
of resources. Special recommendations indicate how to approach the granularity issue when issuing PIDs for
resources and resource collections.
The need to apply PID frameworks for identifying resources contained in scientific data sets has also
increased since modern archives and repositories have begun to weave a network of related complex
resources that may be distributed over several locations. In these cases, permanent linkage is a prerequisite.
In a multimedia lexicon for instance, a lexical item can refer to images not necessarily physically in the lexicon,
or that are even referenced at a different site under control of a different organization. However, the link
between the lexicon item and the image must remain valid, even if some servers or files are subject to
relocation over time. Emerging e-Science scenarios, which make use of distributed services processing
distributed resources, are also completely dependent on having transparent access from any processing
service, irrespective of where it is located or what organization may operate it. This implies that resolving
resource references should not be hampered in any way by unnecessary dependencies involving reliance on
unsustainable or unpredictable services, whether they are technical or organizational.
The requirement that services like PID frameworks be accessible to the whole community of language
resource and technology providers is further complicated by the need to provide resolvable PIDs without
imposing commercial dependencies on resource providers other than the fundamental and well-established
requirements for maintaining resources on the Internet.
INTERNATIONAL STANDARD ISO 24619:2011(E)
Language resource management — Persistent identification
and sustainable access (PISA)
1 Scope
This International Standard specifies requirements for the persistent identifier (PID) framework and for using
PIDs as references and citations of language resources in documents as well as in language resources
themselves. In this context, examples of language resources include such works as digital dictionaries,
language-purposed terminological resources, machine-translation lexica, annotated multimedia/multimodal
corpora, text corpora that have been annotated with, for example, morpho-syntactic information, and the like.
Computational and applied linguists and information specialists create such resources.
This International Standard also addresses issues of persistence and granularity of references to resources,
first by requiring that persistent references be implemented by using a PID framework and further by imposing
requirements on any PID frameworks used for this purpose.
PID frameworks also allow the association of general metadata with the identifier, which can also contain
citation information. This International Standard specifies minimum requirements for effective use of PIDs in
language resources and cites the use of several possible existing standards and de-facto standards, such as:
[16] [3] [9] [2]
ISO 690 , APA , MLA for citation information, ISO/IEC 21000-17, IETF RFC 5147, Annotea ,
[22] [23] [18] [24]
temporal-fragment , XPointer for part identifier syntax and PURL , ARK , Handle System and
[14]
DOI .
2 Normative references
The following referenced documents are indispensable for the application of this document. For dated
references, only the edition cited applies. For undated references, the latest edition of the referenced
document (including any amendments) applies.
ISO 12620:2009, Terminology and other language and content resources — Specification of data categories
and management of a Data Category Registry for language resources
ISO/IEC 21000-17:2006, Information technology — Multimedia framework (MPEG-21) — Part 17: Fragment
Identification of MPEG Resources
W3C 2003, XPointer Framework: [online] W3C Recommendation 25 March 2003 [viewed 2010-08-04].
Available from: http://www.w3.org/TR/xptr-framework/
WILDE, E. and DUERST, M. URI Fragment Identifiers for the text/plain Media Type, IETF RFC 5147, April 2008
[viewed 2010-12-22]. Available from: http://www.rfc-editor.org/rfc/rfc5147.txt
3 Terms and definitions
For the purposes of this document, the following terms and definitions apply.
3.1 Resources
3.1.1
resource
digital object on the web with a specific identity that can be addressed with a URI (3.2.2)
NOTE 1 Adapted from IETF RFC 3986.
NOTE 2 In the context of this International Standard, a resource can also be a language resource that has an online
representation.
NOTE 3 A resource can have several representations. Depending on the PID framework (3.2.5), identification of a
[8]
specific representation can be encoded in the identifier (ARK, see B.3) or be left to the content negotiating process
between the web client (3.3.8) that uses the resolved PID to fetch the resource (3.1.1) and the resource server (3.3.6).
3.1.2
language resource
digital resource that provides information about one or more languages
NOTE Language resources cover lexicographical, terminological, morpho-syntactical, corpus-related, or semantic
resources or digital resources used to study linguistic phenomena like texts and multimedia/multimodal recordings. They
are created and used by linguists, information specialists, lexicographers and terminologists, among others. They
frequently comprise many small records compiled within a larger work, and are often authoritative in nature, such as
standardized terminologies and glossaries issued by standards bodies such as ISO, IETF, W3C, etc.
3.1.3
complex resource
resource (3.1.1) consisting of multiple constituent parts, each of which can be accessed individually
NOTE A complex resource can be a federated resource if its constituent parts are distributed over different
repositories (3.1.6).
3.1.4
collection
grouping of any number of resources (3.1.1) that need to be referenced as a whole
3.1.5
published collection
purposefully built collection of resources that is maintained as an independent entity by an archive (3.1.7) or
repository (3.1.6) and for which adequate citation (3.1.16) information is available
3.1.6
digital repository
repository
facility that provides reliable access to managed digital resources (3.1.1)
3.1.7
archive
digital archive
repository (3.1.6) dedicated to the long-term preservation of its associated data
NOTE Often the data in digital archives are also available online, which highlights the need for reliable persistent
identifiers (3.2.4).
2 © ISO 2011 – All rights reserved
3.1.8
resource collection incarnation
incarnation
virtual embodiment of a disparate, otherwise non-aggregated collection (3.1.4) assembled for a specific
purpose that is referenced by a single PID (3.2.4) concatenated with a part identifier (3.2.7) in order to
access the components of the collection
NOTE A bibliography or index can use a single PID together with extensions to provide access to components in a
set of resources (3.1.1) used in the production of a monograph or project without actually collecting the physical files in
one location, which is to say that the individual items remain in their original locations, but are referenced as parts of a
virtual whole.
3.1.9
version
particular form or variation of a resource (3.1.1) that differs from other instantiations of the resource in at least
one aspect or item of information
NOTE Versions are often identified in sequential order (e.g. Version 1, 2, etc.), but version identification of dynamic
resources subject to frequent change is often achieved by assigning a date-time stamp.
3.1.10
snapshot
instantaneous copy of a resource (3.1.1) representing the status of the resource or collection at a single point
in time
3.1.11
abstract resource
non-network-retrievable resource identified by a URI (3.2.2), usually a concept such as a class or property
NOTE It is practice, for example in RDFS (RDF Schema) or OWL (web ontology language) ontologies, to identify
abstract resources using URIs. Web architecture does not require any information resource to be retrievable with this kind
of URI. If an identifier for an abstract resource is not meant to be dereferenced (3.4.1), such as can be the case with an
XML namespace URI, it is not meaningful to issue a PID (3.2.4) for this resource.
3.1.12
resource part
part
identifiable, accessible entity embedded in an independent resource (3.1.1) or in a larger part thereof
NOTE Parts can be embedded in other parts. In dynamic web environments, subsetting into parts is subject to
change and interpretation, which requires a certain level of user decision-making to designate and identify such sub-
entities.
3.1.13
fragment
some portion or subset of a primary resource (3.1.1), some view on representations of the primary resource,
or some other resource defined or described as a component of the resource defined or described by those
representations
NOTE 1 Adapted from IETF RFC 3986.
NOTE 2 In this International Standard, the term fragment is used only in the IETF RFC 3986 sense, when in a web
context a client application (3.3.5) retrieves the fragment from a containing resource.
3.1.14
terminal part
part (3.1.12) of a resource (3.1.1) that is not subdivided into smaller parts
3.1.15
internal part
part (3.1.12) of a resource (3.1.1) that is both embedded in the resource and subdivided into smaller parts
3.1.16
citation
information object containing information that directs a reader's or user's attention from one resource (3.1.1)
to another
3.1.17
reference
digital object that links to data stored elsewhere
NOTE Although citation (3.1.16) and reference are commonly used as near-synonyms, for purposes of this
International Standard, citations provide information for human readers and users, while references include the precise
location where the referenced resource (3.1.1) can be found. References can be machine-readable, and can be
configured as actionable given the required criteria.
3.1.18
annotation tier
separate information layer containing comments, notes, explanations, or other types of external remarks that
can be attached to a resource (3.1.1)
NOTE For instance, maps or images can be annotated with supplemental information, or text corpora can be
annotated in either in-line or standoff mode.
3.1.19
standoff annotation
annotations held outside the document that is being annotated
3.2 Identifiers
3.2.1
identifier
digital identifier
sequence of characters associated with digital, non-digital, or abstract entities, such as books, images,
reports, metadata records or events
3.2.2
URI
Uniform Resource Identifier
string of characters used to identify or name a resource (3.1.1) with a syntax as defined in IETF RFC 3986
3.2.3
URI naming scheme
top level of the URI naming structure
NOTE 1 Every scheme specifies its own syntax conventions for URIs (3.2.2).
NOTE 2 Typical URI schemes include http, https, ftp, mailto, etc. and are registered with IANA.
3.2.4
PID
persistent identifier
unique identifier (3.2.1) that ensures permanent access for a digital object by providing access to it
independently of its physical location or current ownership
NOTE Unique in this context means that the PID will not be issued again for other resources. However, the same PID
can reference different representations or incarnations (3.1.8) of the resource at the discretion of the resource provider.
4 © ISO 2011 – All rights reserved
3.2.5
PID framework
scheme for specifying identifier strings [PID (3.2.4) scheme] for web-accessible digital objects together with a
mechanism that enables the resolution of these identifiers into the object's current URI (3.1.1)
NOTE 1 A PID framework in the sense of this International Standard facilitates access to both individual objects and to
parts (3.1.12) and fragments (3.1.13) contained in such objects. A PID framework can be solely dependent on existing
web resolution protocols or it can entail the interaction of proxy-based resolvers.
NOTE 2 A PID framework in the sense of this International Standard also allows resolution of other information
associated with the PID.
3.2.6
actionable identifier
URI (3.2.2) that has a resource-associated identifier (3.2.1) that is suitably encoded, such that when the URI
is embedded in a web document and “clicked” on, the browser will be redirected to the resource (3.1.1), and
possibly supplementary services related to the resource
NOTE 1 This functionality implies that the URI points to a suitable resolver proxy (3.3.7).
NOTE 2 In some PID frameworks (3.2.5), the PIDs (3.2.4) are URIs and are automatically actionable.
3.2.7
resource part identifier
part identifier
string of characters that refers to a resource part (3.1.12) that can be identified by some means within a given
resource type (time in media, area in an image, record in a data stream, etc.)
NOTE Part identifiers in the sense of this International Standard are intended for server-side resolution in contrast to
client-side resolution, which is characteristic of fragment identifiers (3.2.8).
3.2.8
fragment identifier
identifier (3.2.1) used to reference a part (3.1.12) of a resource (3.1.1) in a web context
NOTE 1 Adapted from IETF RFC 3986.
NOTE 2 A fragment identifier component as defined in IETF RFC 3986 is indicated by the presence of a number sign
(“#”) character and terminated by the end of the URI (3.2.2). Fragments (3.1.13) in the sense of this RFC are resolved
and retrieved from the resource by the local client application (3.3.5).
[27]
NOTE 3 There is a W3C draft proposal to change this handling of fragments .
3.3 Roles, institutions and services
3.3.1
archiving institution
institution responsible for maintaining a digital archive (3.1.7)
3.3.2
resource provider
organization that makes a resource (3.1.1) available online
NOTE A resource can also be a service.
3.3.3
resolver
PID resolver
software application that translates an identifier (3.2.1) into another more suitable identifier, specifically that
translates a resource PID (3.2.4) into its URI (3.2.2) and in this way points a client application to the location of
the resource (3.1.1)
3.3.4
resolution system
system designed to support the submission of a persistent identifier (3.2.4) to a network service in order to
receive in return one or more pieces of current information related to the identified object, e.g. a location (URI)
(3.2.2) of the object or metadata
NOTE The complete resolution system can be viewed as “the PID resolver” (3.3.3) but is often implemented as
different resolvers or resolver services.
3.3.5
client application
software application that accesses a remote service usually on another computer system
3.3.6
resource server
computer that ultimately provides access to the object referenced by a specific client application request
3.3.7
resolver proxy
HTTP resolver proxy
application that implements a service supporting the use of urlified (3.4.3) PIDs (3.2.4) to access resources or
other PID-related information, or both
3.3.8
web client
client application capable of accessing resources on the web using the HTTP protocol
3.4 Actions
3.4.1
dereference
to access the value referred to by a reference (3.1.17)
NOTE When used within the context of dereferencing a URI (3.2.4), it means obtaining a representation of the
resource to which the URI points.
3.4.2
resolve
to translate an identifier (3.2.1) into another name or address suitable for accessing a resource
NOTE The resolution process may require multiple steps in order to obtain a suitable address for a resource.
3.4.3
urlify an identifier
to encode an identifier (3.2.1) as a suitable URI (3.2.4)
NOTE For example, this might be done with the purpose of creating an actionable identifier (3.2.6).
4 Background
PIDs can exist in all kinds of electronic resources and this International Standard does not make explicit
statements about them, but the type of resource targeted by a PID has consequences for the requirements
imposed on the individual PID. Resources can be characterized into three major types:
⎯ independent resources as shown in Figure 1;
⎯ any part of such an individual resource that requires further specification;
⎯ a collection of resources that is referred to as a whole.
6 © ISO 2011 – All rights reserved
electronic electronic
resource resource
unique and
persistent identif ier
associated
metadata
Figure 1 — Using unique PIDs to point from a source resource to a target resource
This International Standard concerns how to uniquely reference an electronic resource in a machine-readable
way. In Figure 1, a unique and persistent identifier (PID) included in a source resource points to a target
resource. The PID can be associated with metadata of different sorts.
The nature of a resource in this context is very broad and the means of referring to it is subject to context. An
image, for instance, either can be an independent resource associated with its own unique PID and can be
referenced as such, or can be embedded in a document where it lacks an identity of its own, in which case it is
a part of that document. In addition, a reference can point to a part of this image. An individual resource can
stand alone in one environment and be treated as part of a complex resource in another environment. An
internal part of a resource may be viewed as a terminal part, but further processing in a dynamic environment
may result in an entity that itself comes to contain accessible sub-parts. This International Standard is
designed to support all these cases.
In the case of complex language resources, some resources should be assigned their own individual
persistent identifiers. Other resources act as containing resources that have many constituent parts, in which
case the containing resource should be assigned a PID, while its parts can be referenced by appending part
identifiers to this PID. This International Standard provides guidelines for determining the appropriate
approach to take with respect to any given resource.
This International Standard utilizes existing standards and practices for resource part and fragment identifier
formats, where available, and provides guidelines for situations where current standards are inadequate or do
not apply. A further discussion of resource types targeted by this International Standard may be found in
Annex A.
With respect to collections of language resources, the standard takes two types of collections into account:
⎯ Collections of resources that are maintained as complex resources in a more or less published static form
so that the definition of the collection as such is maintained as an independent entity by an archive or
repository, which then also provides a persistent identifier for such a collection. The archiving institution is
responsible for maintaining the connection between the PID and the collection represented as a metadata
entry in a catalogue, for example.
⎯ A different type of collection that was not preconceived as a collection by its creators or the archiving
institution(s) but achieves its status as a complex resource based on some research or other work that
needs to be verifiable, such as the preparation of a monograph or the conduct of a scholarly or scientific
project. Such collections, although purposefully constructed by the creator, may not have any significance
outside the context of the original work for which they were created. Referring from the research
documents to the collection may become tedious when the collection contains hundreds of individual
resources. As a consequence, there is a need to refer to these types of collections with a PID that is
associated with all its constituent resources and appropriate metadata. Of course this kind of reference is
only possible if there is an incarnation of the collection.
5 Requirements for PID frameworks and PID use
5.1 General
Current standards and practices for using references and citations, especially in the domain of language
resources, can be found in Annex A. This section focuses initially on requirements for the PID framework itself
and thereafter on requirements for using PIDs as references and citations of language resources.
5.2 PID framework requirements
5.2.1 General
A PID framework in the sense of this International Standard shall support the following:
a) resolution of a single PID to multiple URIs or services;
b) association and access to related metadata;
c) adequate security to prevent malicious or accidental modification of PID/URI mappings and PID/metadata
associations;
d) addressing of parts of a resource (part or fragment identifiers, or both);
e) encoding of the PID as a URI to render identifiers actionable in web documents without requiring client
modifications.
5.2.2 Accommodating duplicate resources
It is common to provide duplicate or mirror resources or copies residing on different resource servers for data
preservation purposes and to provide high-speed access. The PID framework should support this kind of
duplication by allowing multiple URIs to be associated with a single PID.
5.2.3 Accessing resource metadata
Next to providing a reliable URI to the resource, PID frameworks are also used to associate metadata with the
resource in a secure and reliable way. Although this International Standard does not require metadata of any
particular type to be available, it does require the possibility of resolving the PID to an associated metadata
record encoded in XML format, so other services may be built on this feature.
5.2.4 Secure and reliable administration
The PID framework shall provide adequate security so that only the owner or caretaker of the resource can
change the PID/URI mapping or the associated metadata.
5.2.5 Resource part identifiers
It is impossible to provide a PID for every identifiable part of a resource, or even to identify globally all possible
options for segmenting resources into parts. Consequently, PID frameworks shall provide a system for
assigning part or fragment identifiers in combination with the resource PID. Since the objective is to use a
single string for identifying the resource part, PID syntax should support the concatenation of the PID and the
part identifier. For example, the PID resolving system should be configurable such that a PID part identifier
combination will resolve into a URI that can be correctly interpreted by the resource server in order to deliver
the requested resource part [see A.2 a)].
8 © ISO 2011 – All rights reserved
A complex resource “A” with constituents x, y and z is identified by PID “1839/A”.
The PID resolver translates the identifier “1839/A” into the URI http://oserver/objectA, which can be
understood by the resource server to deliver object “A”. The part “z” of resource “A” is identified by
PID “1839/A#z”. To enable easy dereferencing by the resource server, the PID resolver should be able to
translate the identifier “1839/A#z” into the URI http://oserver/objectA?part=z or some similar query string that
can be understood by the resource server for object A to deliver part “z”.
Figure 2 — Processing part identifiers by the PID resolver
(using the handle server as an implementation)
5.2.6 Urlified PIDs
The PID framework should provide a proxy resolver implementation that is able to resolve urlified PIDs. This
allows web clients to resolve such an identifier using the HTTP scheme without needing special browser plug-
ins or other special software. In some PID frameworks, the PID is already an HTTP URI, in which case a
separate proxy resolver is superfluous.
5.3 PID usage
Citations of web-accessible digital language resources should be accompanied by a PID that resolves either
to a URI for the resource itself or to a metadata record describing the resource. The latter option can be used
if the resource itself cannot be made (immediately) available or in case of collections, in which case the
metadata record should contain the identifiers for its constituents if available.
If the PID resolves to the URI of the resource, a metadata record in XML format and compliant with a declared
schema pertaining to the resource should be associated separately with the identifier and made available via
the resolving system. Citation information should be included in the metadata record. No other requirements
are specified for the metadata record associated with the resource identifier.
In web documents the identifier (for instance, a handle) embedded in citations should in addition to a PID
conforming to the PID scheme syntax also be presented in a urlified form that is encoded as a URI so that it
becomes actionable in a web browser or other document viewer application. For example, using the Handle
System (HS) as an implementation:
1839/00-0000-0000-0000-4 -> http://hdl.handle.net/1839/00-0000-0000-0000-0000-4
The resolver address http://hdl.handle.net identifies an HTTP resolver proxy, which is an application that can
receive HTTP requests in the form of urlified PIDs and uses the “real” PID resolver to redirect the client to the
resource. With the Handle System as an example, the HTTP resolver proxy is either the central resolver proxy
of the Handle System or a Handle System resolver proxy guaranteed to be available by the resource provider.
Resource part specifications, which are analogous to page or chapter specifications, should be appended
where relevant to the PID using an appropriate delimiter character, for instance using a scheduled extension
of the HS:
1839/00-0000-0000-0000-4@time(100s,200s)
This extension refers to a segment of an audio file (the part identifier is non-standard).
If a compatible fragment identifier exists for the resource type, this element can be added to the encoded URI
in order to create a composite actionable identifier. An HS example is:
http://handle.net/1839/00-0000-0000-0000-0000-4?urlappend=#ffp(track_ID=101)*mp(/~time(’npt’,’50’))
This example uses the special function “urlappend” of the handle proxy resolver to append a fragment
identifier to the resolved handle.
5.4 Citation information and persistent identifiers
Existing practices can be maintained provided that URIs are replaced by PIDs and document part
specifications are converted into part identifiers or fragment identifiers. References from web documents shall
[13]
also provide a urlified PID if PID syntax does not comply with an IANA-registered URI scheme .
5.5 Referencing resource parts
5.5.1 General
Any applicable existing ISO or IETF standard for identifying a part of a resource can be used as a part
identifier, unless otherwise specified. For resources where no such standard exists, it is permissible to use
human-readable text in the citation, for instance: 10 s to 120 s for a time segment. This approach will not,
however, work for software clients.
When using part or fragment identifiers for retrieving or dereferencing part of a resource, it is important to
clarify the difference between using fragment identifiers such as those defined in IETF RFC 3986 and using
the functionality of a suitable resource server. If the PID resolving process delivers a URI including a fragment
identifier, for instance, using the following URI (with fragment specification according to the IETF RFC 5147
proposed standard for plain text media):
http://myserver/myresource#line=10,20
this URI will cause a web browser to fetch the whole document, and the browser itself will then isolate the
document part from lines 10 to 20 and present it to the user. When the document is small, this is acceptable
behaviour. However, when there is a need to present a fragment from a large 2 GB media file, it is necessary
[22]
to use a special resource server and a URI such as with a part specification from Annodex :
http://videoserver.com/videoA.anx?t=15.0/30.0
This notation will cause the resource server to transfer only the video segment from 15,0 s to 30,0 s.
Updates to this International Standard are expected to enlarge the list of applicable resource part/fragment
identifier formats.
5.5.2 Media (time series)
ISO/IEC 21000-17 for MPEG-21 resources shall apply for media and time series. Annodex syntax can be
[22]
used for specifying time intervals in URI queries and fragments for all applicable media. For other formats,
the part identifier will depend on the format used.
5.5.3 Textual resources
For XML-encoded textual resources, XPointer shall apply. IETF RFC 5147 shall apply for plain text
documents. For other formats, the part identifier will depend on the format used.
10 © ISO 2011 – All rights reserved
5.5.4 Metadata registries, terminologies, ontologies
In accordance with ISO 12620:2009, each data category specification in the ISO/TC 37 Data Category
Registry (DCR) shall have its own PID, which is formed as a concatenation of the PID for the DCR as a whole.
[30]
These PIDs are configured as “cool” URIs as per current practice of the World Wide Web Consortium.
Specific parts of an RDF graph can be addressed using one of the proposed RDF query languages, but no
definitive specification is currently available. For other formats, the part identifier will depend on the format
used.
5.6 Collections
Existing “published” collections shall be assigned an associated PID that is maintained by the collection
sponsor. The PID shall refer to a description representing the collection, which can be a catalogue entry or an
individual metadata description. For virtual collections, the PID should refer to a machine-readable metadata
description which provides access to the relevant information, in particular to the resources that are included,
referenced by their own individual PIDs.
6 Complementary requirements
6.1 Granularity of identifiers
With respect to granularity, this International Standard distinguishes between the identification of parts and of
fragments, as indicated in 5.5. A fragment identifier is defined by IETF RFC 3986 as an optional component of
a URI reference. In conformance with IETF RFC 3986, a URI can be assigned an optional fragment identifier,
whereby the identifier is separated from the rest of the URI reference by a # (number sign) character. The
separator is not considered part of the fragment identifier.
The URI with the fragment identifier can be used by an application to identify and usually to access a specific
resource that is part of or embedded in a primary resource. The format of the fragment identifier depends on
the resource type.
Interpreting and dereferencing the fragment identifier is a web client function, and therefore requires the
complete primary resource to be downloaded, after which a client application can extract the required
fragment. Furthermore, since the fragment identifier is not passed to other systems during the process of
retrieval, some intermediaries in the web architecture (such as proxies) have no interaction with fragment
identifiers, and HTTP redirection does not account for fragments. As an exception, fragment identifiers for
RDF documents do not refer to parts of the document, but rather to the object in that document being
described as having that fragment identifier. As a consequence, the use of fragment identifiers in combination
with a URI allows the client application to isolate part of a resource based on knowledge specific to the client
application.
In contrast to fragment resolution procedures, using a part identifier such as “z” in
http://myserver/myObjectService?part=z
1)
relies on the remote server to isolate the part and send it to the client application .
1) This strict separation of roles between a web client and server when dereferencing fragments may change. The media
fragments working group has produced a W3C draft where it is proposed that web clients negotiate the transportation of
[27]
only part of the resource from the server .
6.2 Recommendations
This International Standard supports different levels of granularity. The following recommendations are
designed to encourage efficiency and promote interoperability with other naming schemes.
⎯ If there is an existing identifier scheme for a type of resources, for instance, ISBN, this level of granularity
should be retained, which is to say that no new PIDs should be issued without very good reasons, such
as for chapters. Chapters would preferably be addressed using part identifiers in conjunction with the PID
of the book.
⎯ If the resource is associated with the complete content of a digital file, an individual PID should probably
be assigned for this resource.
⎯ If the resource is autonomous and exists outside a larger context, an individual PID should probably be
assigned for this resource.
⎯ If a resource should be citable apart from any containing resource, an individual PID should probably be
assigned for this resource.
These recommendations are, however, subject to the needs of resource creators with respect to the level of
granularity they deem suitable to the specific resource environment.
12 © ISO 2011 – All rights reserved
Annex A
(informative)
Independent resources, aggregated resources, and parts of resources
A.1 Overview
A.1.1 General
There is increasing demand in science and industry for options to reference digital language resources,
resource parts or collections of language resources in an unambiguous persistent way. Not only is it desirable
to be able to retrieve and validate references from scientific papers, but it is also increasingly necessary to
maintain various types of references between language resources or parts of such resources.
A.1.2 Resources
[7]
A resource is anything that has an identity . Despite the indefinite nature of this characterization, it is
important for researchers to be able to identify linguistically meaningful units as coherent objects in a
repository – both for human and, increasingly often, for machine readability. Such objects will be subject to
separate manipulations and be used in different scenarios, which means that they can have an autonomous
existence in larger contexts. Frequently such an object can be identified as a “single file” in a file system, but it
can also be retrievable as
...












Questions, Comments and Discussion
Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.
Loading comments...