Information and documentation -- Statistics and quality issues for web archiving

This Technical Report defines statistics, terms and quality criteria for Web archiving. It considers the
needs and practices across a wide range of organisations such as libraries, archives, museums, research
centres and heritage foundations. The examples mentioned are taken from the library sector, because
libraries, especially national libraries, have taken up the new task of Web archiving in the context of
legal deposit. This should in no way be taken to undermine the important contributions of institutions
which are not libraries. Neither does it reduce the principal applicability of this Technical Report for
heritage institutions and archiving professionals.
This Technical Report is intended for professionals directly involved in Web archiving, often in mixed
teams consisting of library or archive curators, engineers and managerial staff. It is also useful for Web
archiving institutions’ funding authorities and external stakeholders. The terminology used in this
Technical Report attempts to reflect the wide range of interests and expertise of the audiences, striking
a balance between computer science, management and librarianship.
This Technical Report does not consider the management of academic and commercial electronic
resources, such as e-journals, e-newspapers or e-books, which are usually stored and processed
separately using different management systems. They are regarded as Internet resources and are not
addressed in this Technical Report as distinct streams of content of Web archives. Some organisations
also collect electronic documents, which may be delivered through the Web, through publisher-based
electronic deposits and repository systems. These too are out of scope for this Technical Report. The
principles and techniques used for this kind of collecting are indeed very different from those of Web
archiving; statistics and quality indicators relevant for one kind of method are not necessarily relevant
for the other.
Finally, this Technical Report essentially focuses on Web archiving principles and methods, and does
not encompass alternative ways of collecting Internet resources. As a matter of fact, some Internet
resources, especially those that are not distributed on the Web (e.g. newsletters distributed as e-mails)
are not harvested by Web archiving techniques and are collected by other means that are not described
nor analysed in this Technical Report.

Information et documentation -- Statistiques et indicateurs de qualité pour l'archivage du web

Informatika in dokumentacija - Statistika in vprašanja glede kakovosti za spletno arhiviranje

To tehnično poročilo opredeljuje statistiko in vprašanja glede kakovosti za spletno arhiviranje. Obravnava potrebe in prakse v različnih organizacijah, kot so knjižnice, arhivi, muzeji, raziskovalni centri in fundacije za kulturno dediščino. Navedeni primeri so vzeti iz knjižničnega sektorja, ker so se knjižnice, še posebej nacionalne knjižnice, lotile nove naloge spletnega arhiviranja v okviru pravnega deponiranja. To v nobenem primeru ne pomeni zmanjšanja pomembnega prispevka ustanov iz drugih sektorjev. Prav tako ne zmanjšuje osnovnih možnosti uporabe tega tehničnega poročila za fundacije za kulturno dediščino in poklicne arhivarje.
To tehnično poročilo je namenjeno strokovnjakom, neposredno povezanih s spletnim arhiviranjem, pogosto v mešanih ekipah, ki jih sestavljajo kustosi knjižnice ali arhiva, inženirji in vodstveno osebje. Prav tako je uporaben za organe financiranja ustanov za spletno arhiviranje in zunanje deležnike. Terminologija, uporabljena v tem tehničnem poročilu, poskuša odražati različne zahteve in strokovno znanje občinstva ob ohranjanju ravnotežja med računalništvom, upravljanjem in knjižničarstvom.
To tehnično poročilo ne obravnava upravljanja akademskih in komercialnih elektronskih virov, kot so e-revije, e-časopisi ali e-knjige, ki so običajno shranjeni in obdelani
ločeno z različnimi sistemi za upravljanje. Ti viri se obravnavajo kot internetni viri in niso vključeni v tem tehničnem poročilu kot posebna vsebina spletnih arhivov. Nekatere organizacije zbirajo tudi elektronske dokumente, ki se lahko posredujejo prek spleta, prek izdajateljevih elektronskih sistemov shramb in repozitorijev. Ti prav tako niso zajeti v tem tehničnem poročilu. Načela in tehnike, ki se uporabljajo pri takšnem zbiranju, se zagotovo precej razlikujejo od tistih pri spletnem arhiviranju; statistika in indikatorji kakovosti, ki veljajo za eno metodo, ne veljajo nujno za drugo.
Poleg tega se to tehnično poročilo osredotoča na načela in metode spletnega arhiviranja ter ne zajema alternativnih načinov zbiranja internetnih virov. Dejstvo je, da nekaterih internetnih virov, še posebej tistih, ki se ne razpošiljajo prek spleta (npr. glasila, ki se razpošiljajo po e-pošti), tehnike spletnega arhiviranja ne zajemajo, ampak se ti viri zbirajo na druge načine, ki niso opisani ali analizirani v tem tehničnem poročilu.

General Information

Status
Published
Public Enquiry End Date
30-Dec-2016
Publication Date
10-Jan-2017
Current Stage
6060 - National Implementation/Publication (Adopted Project)
Start Date
27-Dec-2016
Due Date
03-Mar-2017
Completion Date
11-Jan-2017

Buy Standard

Technical report
ISO/TR 14873:2013 - Information and documentation -- Statistics and quality issues for web archiving
English language
54 pages
sale 15% off
Preview
sale 15% off
Preview
Technical report
TP ISO/TR 14873:2017
English language
59 pages
sale 10% off
Preview
sale 10% off
Preview
e-Library read for
1 day

Standards Content (Sample)

TECHNICAL ISO/TR
REPORT 14873
First edition
2013-12-01
Information and documentation —
Statistics and quality issues for web
archiving
Information et documentation — Statistiques et indicateurs de
qualité pour l’archivage du web
Reference number
ISO/TR 14873:2013(E)
©
ISO 2013

---------------------- Page: 1 ----------------------
ISO/TR 14873:2013(E)

COPYRIGHT PROTECTED DOCUMENT
© ISO 2013
All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized otherwise in any form
or by any means, electronic or mechanical, including photocopying, or posting on the internet or an intranet, without prior
written permission. Permission can be requested from either ISO at the address below or ISO’s member body in the country of
the requester.
ISO copyright office
Case postale 56 • CH-1211 Geneva 20
Tel. + 41 22 749 01 11
Fax + 41 22 749 09 47
E-mail copyright@iso.org
Web www.iso.org
Published in Switzerland
ii © ISO 2013 – All rights reserved

---------------------- Page: 2 ----------------------
ISO/TR 14873:2013(E)

Contents Page
Foreword .iv
Introduction .v
1 Scope . 1
2 Terms and definitions . 1
3 Methods and purposes of Web archiving . 7
3.1 Collecting methods . 8
3.2 Access and description methods .10
3.3 Preservation methods .12
3.4 Legal basis for Web archiving .14
3.5 Additional reasons for Web archiving .15
4 Statistics .16
4.1 General .16
4.2 Statistics for collection development .16
4.3 Collection characterization .22
4.4 Collection usage .28
4.5 Web archive preservation .31
4.6 Measuring the costs of Web archiving .35
5 Quality indicators .37
5.1 General .37
5.2 Limitations .37
5.3 Description .38
6 Usage and benefits .47
6.1 General .47
6.2 Intended usage and readers .47
6.3 Benefits for user groups .48
6.4 Use of proposed statistics by user groups .48
6.5 Web archiving process with related performance indicators .50
Bibliography .52
© ISO 2013 – All rights reserved iii

---------------------- Page: 3 ----------------------
ISO/TR 14873:2013(E)

Foreword
ISO (the International Organization for Standardization) is a worldwide federation of national standards
bodies (ISO member bodies). The work of preparing International Standards is normally carried out
through ISO technical committees. Each member body interested in a subject for which a technical
committee has been established has the right to be represented on that committee. International
organizations, governmental and non-governmental, in liaison with ISO, also take part in the work.
ISO collaborates closely with the International Electrotechnical Commission (IEC) on all matters of
electrotechnical standardization.
The procedures used to develop this document and those intended for its further maintenance are
described in the ISO/IEC Directives, Part 1. In particular the different approval criteria needed for the
different types of ISO documents should be noted. This document was drafted in accordance with the
editorial rules of the ISO/IEC Directives, Part 2 (see www.iso.org/directives).
Attention is drawn to the possibility that some of the elements of this document may be the subject of
patent rights. ISO shall not be held responsible for identifying any or all such patent rights. Details of
any patent rights identified during the development of the document will be in the Introduction and/or
on the ISO list of patent declarations received (see www.iso.org/patents).
Any trade name used in this document is information given for the convenience of users and does not
constitute an endorsement.
For an explanation on the meaning of ISO specific terms and expressions related to conformity
assessment, as well as information about ISO’s adherence to the WTO principles in the Technical Barriers
to Trade (TBT) see the following URL: Foreword - Supplementary information
The committee responsible for this document is ISO/TC 46, Information and documentation, Subcommittee
SC 8, Quality - Statistics and performance evalutation.
iv © ISO 2013 – All rights reserved

---------------------- Page: 4 ----------------------
ISO/TR 14873:2013(E)

Introduction
This Technical Report was developed in response to a worldwide demand for guidelines on the
management and evaluation of Web archiving activities and products.
Web archiving refers to the activities of selecting, capturing, storing, preserving and managing access
to snapshots of Internet resources over time. It started at the end of the 1990s, based on the vision that
an archive of Internet resources would become a vital record for research, commerce and government
in the future. Internet resources are regarded as part of the cultural heritage and therefore preserved
like printed heritage publications. Many institutions involved in Web archiving see this as an extension
of their long standing mission of preserving their national heritage, and this is endorsed and enabled in
many countries by legislative frameworks such as legal deposit.
There is a wide range of resources available on the Internet, including text, image, film, sound and other
multimedia formats. In addition to interlinked Web pages, there are newsgroups, newsletters, blogs and
interactive services such as games, made available using various transfer and communication protocols.
Web archives bring together copies of Internet resources, collected automatically by harvesting software,
usually at regular intervals. The intention is to replay the resources including the inherent relations, for
example by means of hypertext links, as much as possible as they were in their original environment.
The primary goal of Web archiving is to preserve a record of the Web in perpetuity, as closely as possible
to its original form, for various academic, professional and private purposes.
Web archiving is a recent but expanding activity which continuously requires new approaches and tools
in order to stay in sync with rapidly evolving Web technology. Determined by the strategic importance
perceived by the archiving institution, means available and sometimes legal requirements, diverse
approaches have been taken to archive Internet resources, ranging from capturing individual Web
pages to entire top-level domains. From an organisational perspective, Web archiving is also at different
levels of maturity. While it has become a business as usual activity in some organisations, others have
just initiated experimental programmes to explore the challenge.
Depending on the scale and purpose of collection, a distinction can be made between two broad categories
of Web archiving strategy: bulk harvesting and selective harvesting. Large scale bulk harvesting, such
as national domain harvesting, is intended to capture a snapshot of an entire domain (or a subset of
it). Selective harvesting is performed on a much smaller scale, is more focused and undertaken more
frequently, often based on criteria such as theme, event, format (e.g. audio or video files) or agreement
with content owners. A key difference between the two strategies lies in the level of quality control,
the evaluation of harvested Websites to determine whether pre-defined quality standards are being
attained. The scale of domain harvesting makes it impossible to carry out any manual visual comparison
between the harvested and the live version of the resource, which is a common quality assurance method
in selective harvesting.
This Technical Report aims to demonstrate how Web archives, as part of a wider heritage collection, can
be measured and managed in a similar and compliant manner based on traditional library workflows.
The report addresses collection development, characterization, description, preservation, usage and
organisational structure, showing that most aspects of the traditional collection management workflow
remain valid in principle for Web archiving, although adjustment is required in practice.
While this Technical Report provides an overview of the current status of Web archiving, its focus is on
the definition and use of Web archive statistics and quality indicators. The production of some statistics
relies on the use of harvesting, indexing or browsing software, and a different choice of software may
lead to variance in the results. This Technical Report however does not endorse nor recommend any
software in particular. It provides a set of indicators to help assess the performance and quality of Web
archives in general.
This Technical Report should be considered as a work in progress. Some of its contents are expected to
be incorporated in the future into ISO 2789 and ISO 11620.
© ISO 2013 – All rights reserved v

---------------------- Page: 5 ----------------------
TECHNICAL REPORT ISO/TR 14873:2013(E)
Information and documentation — Statistics and quality
issues for web archiving
1 Scope
This Technical Report defines statistics, terms and quality criteria for Web archiving. It considers the
needs and practices across a wide range of organisations such as libraries, archives, museums, research
centres and heritage foundations. The examples mentioned are taken from the library sector, because
libraries, especially national libraries, have taken up the new task of Web archiving in the context of
legal deposit. This should in no way be taken to undermine the important contributions of institutions
which are not libraries. Neither does it reduce the principal applicability of this Technical Report for
heritage institutions and archiving professionals.
This Technical Report is intended for professionals directly involved in Web archiving, often in mixed
teams consisting of library or archive curators, engineers and managerial staff. It is also useful for Web
archiving institutions’ funding authorities and external stakeholders. The terminology used in this
Technical Report attempts to reflect the wide range of interests and expertise of the audiences, striking
a balance between computer science, management and librarianship.
This Technical Report does not consider the management of academic and commercial electronic
resources, such as e-journals, e-newspapers or e-books, which are usually stored and processed
separately using different management systems. They are regarded as Internet resources and are not
addressed in this Technical Report as distinct streams of content of Web archives. Some organisations
also collect electronic documents, which may be delivered through the Web, through publisher-based
electronic deposits and repository systems. These too are out of scope for this Technical Report. The
principles and techniques used for this kind of collecting are indeed very different from those of Web
archiving; statistics and quality indicators relevant for one kind of method are not necessarily relevant
for the other.
Finally, this Technical Report essentially focuses on Web archiving principles and methods, and does
not encompass alternative ways of collecting Internet resources. As a matter of fact, some Internet
resources, especially those that are not distributed on the Web (e.g. newsletters distributed as e-mails)
are not harvested by Web archiving techniques and are collected by other means that are not described
nor analysed in this Technical Report.
2 Terms and definitions
For the purposes of this document, the following terms and definitions apply.
2.1
access
successful request of a library-provided online service
Note 1 to entry: An access is one cycle of user activities that typically starts when a user connects to a library-
provided online service and ends by a terminating activity that is either explicit (by leaving the database through
log-out or exit) or implicit (timeout due to user inactivity).
Note 2 to entry: Accesses to the library website are counted as virtual visits.
Note 3 to entry: Requests of a general entrance or gateway page are excluded.
Note 4 to entry: If possible, requests by search engines are excluded.
[SOURCE: ISO 2789:2013, definition 2.2.1]
© ISO 2013 – All rights reserved 1

---------------------- Page: 6 ----------------------
ISO/TR 14873:2013(E)

2.2
access tool
specialist software used to find, retrieve and replay archived Internet resources
Note 1 to entry: This may be implemented by a number of separate software packages working together.
2.3
administrative metadata
information necessary to allow the proper management of the digital objects in a repository
Note 1 to entry: Administrative metadata can be divided into the following categories:
— context or provenance metadata: describe the lifecycle of a resource to a point, including the related entities
and processes, e.g. configuration and log files;
— technical metadata: describe the technical characteristics of a digital object, e g. its format;
— rights metadata: define the ownership and the legally permitted usage of an object.
2.4
archive
Web archive
entire set of resources crawled from the Web over time, comprising one or more collections
2.5
bit stream
series of 0 and 1 digits that constitutes a digital file
2.6
budget (crawl)
limitation associated with a crawl or individual seeds, which can be expressed in e.g. number of files,
volume of data, or the time to be spent per crawl as defined in the crawler settings
2.7
bulk crawl
bulk harvest
crawl aimed at collecting the entirety of a single or multiple top level domain(s) or a subset(s)
Note 1 to entry: In comparison with selective crawls, bulk crawls have a wider scope and are typically performed
less frequently.
Note 2 to entry: Bulk crawls generally result in large scale Web archives, making it impossible to conduct detailed
quality assurance. This is often done through sampling.
2.8
capture
instance
copy of a resource crawled at a certain point in time
Note 1 to entry: If a resource has been crawled three times on different dates, there will be three captures.
2.9
collection
Web archive collection
cohesive resources presented as a group
Note 1 to entry: A collection can either be selected specifically prior to harvesting (e. g. an event, a topic) or pulled
together retrospectively from available resources in the archive.
Note 2 to entry: A Web archive may consist of one or more collections.
2 © ISO 2013 – All rights reserved

---------------------- Page: 7 ----------------------
ISO/TR 14873:2013(E)

2.10
crawl
harvest
process of browsing and copying resources using a crawler
Note 1 to entry: Crawls can be categorised as bulk or selective crawls.
2.11
crawl settings
crawl parameters
definition of which resources should be collected and the frequency and depth required for each set of seeds
Note 1 to entry: Crawl settings also include crawler politeness (number of requests per second or minute sent to
the server hosting the resource), compliance with robots.txt and filters to exclude crawler traps.
2.12
crawler
harvester
archiving crawler
DEPRECATED: spider
software that will successively request URLs and parse the resulting resource for further URLs
Note 1 to entry: Resources may be stored and URLs discarded in accordance with a predefined set of rules [see
crawl settings (2.11) and scope (crawl) (2.40)].
2.13
crawler trap
Web page (or series thereof) which will cause a crawler to either crash or endlessly follow references to
other resources deemed to be of little or no value
Note 1 to entry: Crawler traps could be put in place intentionally to prevent crawlers from harvesting resources.
This could also occur inadvertently for example when a crawler follows dates of a calendar endlessly.
2.14
curator tool
application that runs on top of a Web crawler and supports the harvesting processes
Note 1 to entry: A core function is the management of targets and the associated descriptive and administrative
metadata. It may also include components for scheduling and quality control.
2.15
data mining
computational process that extracts patterns by analysing quantitative data from different perspectives
and dimensions, categorizing it, and summarizing potential relationships and impacts
[SOURCE: ISO 16439:—, definition 3.13]
2.16
deep Web
DEPRECATED: hidden Web
DEPRECATED: invisible Web
part of the Web which cannot be crawled and indexed by search engines, notably consisting of resources
which are dynamically generated or password protected
2.17
descriptive metadata
information describing the intellectual content of a digital object
2.18
domain name
identification string that defines a realm of administrative autonomy, authority, or control on the
Internet, defined by the rules and procedures of the domain name system (DNS)
© ISO 2013 – All rights reserved 3

---------------------- Page: 8 ----------------------
ISO/TR 14873:2013(E)

2.19
domain name system
DNS
hierarchical, distributed global naming system used to identify entities connected to the Internet
Note 1 to entry: The Top Level Domains (TLDs) are the highest in the hierarchy.
2.20
emulation
recreation of the functionality and behaviour of an obsolete system, using software (called emulator) on
current computer systems
Note 1 to entry: Emulation is a key digital preservation strategy.
2.21
host
portion of a URI that names the network source of the content
Note 1 to entry: A host is typically a domain name such as www.archive.org, or a subdomain such as web.archive.org.
2.22
HTML
Hypertext Markup Language
the main mark-up language for Web pages, consisting of elements which are used to add structural and
semantic information to raw text
2.23
HTTP
Hypertext Transfer Protocol
client/server communication protocol used to transfer information on the Web
2.24
hyperlink
link
relationship structure used to link information on the Internet
2.25
junk
spam
unsolicited contents which are deemed to be of no relevance or long-term value
Note 1 to entry: Intentional spam is commonly used to manipulate search engine indexes. Junk can also be
generated inadvertently when a crawler falls in a crawler trap.
Note 2 to entry: Collecting institutions in general try to avoid collecting junk and spam so that resources can be
used to harvest “good” resources. Some, however, keep a small sample of this as a part of the record of the Web.
2.26
link mining
processing and analysis that focus on extracting patterns and heuristics from hyperlinks, e. g. to draw
network graphs
2.27
live Web leakage
common problem in rendering archived resources, which occurs when links in an archived resource
resolve to the current copy on the live site, instead of to the archival version within a Web archive
Note 1 to entry: Live Web leakage also occurs when scripts on archived Web pages continue to reference, and
successfully request, live Web resources within the archival rendering. This may cause live Web social media
feeds or streaming videos, for example, to appear in the archived webpage.
4 © ISO 2013 – All rights reserved

---------------------- Page: 9 ----------------------
ISO/TR 14873:2013(E)

2.28
log file
file automatically created by a server that maintains a record of its activities
2.29
metadata
data describing context, content and structure of digital object and their management through time
[SOURCE: ISO 15489-1:2001, definition 2.12]
Note 1 to entry: Metadata can be categorised as descriptive, structural and administrative metadata.
2.30
migration
conversion of older or obsolete file formats to newer or current ones for the purpose of maintaining the
accessibility of a digital object
Note 1 to entry: Migration is a key preservation strategy.
[SOURCE: ISO 15489-1:2001, definition 3.13]
2.31
MIME type
Internet media type
content type
two-part identifier for file formats on the Internet
Note 1 to entry: MIME (Multipurpose Internet Mail Extensions) uses the content-type header, consisting of a type
and a subtype, to indicate the format of a resource, e. g. image/jpeg.
2.32
nomination
candidate resource to be considered for inclusion in a Web archive
2.33
page
Web page
structured resource, which in addition to any human-readable content, contains zero or more
relationships with other resources and is identified by a URL
2.34
permission
authorization to crawl a live website and/or to publicly display its content on a Web archive
Note 1 to entry: Permission can be expressed by a formal licence from the rights holder or exempted by the virtue
of legal deposit.
2.35
registered user
person or organization registered with a library in order to use its collection and/or services within or
away from the library
Note 1 to entry: Users can be registered upon their request or automatically when enrolling in the institution.
Note 2 to entry: The registration is monitored at regular intervals, at least every three years, so that inactive
users can be removed from the register.
[SOURCE: ISO 2789:2013, definition 2.2.28]
© ISO 2013 – All rights reserved 5

---------------------- Page: 10 ----------------------
ISO/TR 14873:2013(E)

2.36
request
HTTP-formatted message sent by a requesting system (e.g. a browser or a crawler) to a remote server
for a particular resource identified by a URL
2.37
response
answer by a remote server to an HTTP request for a resource, containing either the requested resource,
a redirection to another URL or a negative (error) response, indicating why the requested resource
could not be returned
2.38
response code
status code
three-digit number indicating to the requesting server the status of the requested resource
Note 1 to entry: Codes starting with a 4 (4xx), for example, indicate that the requested resource is not available.
2.39
robots.txt
robots exclusion standard
protocol used to prevent Web crawlers from accessing all or part of a website
Note 1 to entry: robots.txt is not legally binding.
Note 2 to entry: It may also be used to request a minimum delay between consecutive requests or even to provide
a link to a site map to facilitate better crawling of the site.
2.40
scope (crawl)
set of parameters which defines the extent of a crawl, e. g. the maximum number of hops or the maximum
path depth the crawler should follow
Note 1 to entry: The scope of a crawl can be as broad as a whole top level domain (e. g. .de) or as narrow as a single file.
2.41
scope (Web archive)
extent of a Web archive or collection, as determined by the institutional legal mandate or collection policy
2.42
second level domain
subdivisions within the top level domains for specific categories of organisations or areas of interest
(e. g. .gov.uk for governmental websites, .asso.fr for associations’ websites)
2.43
seed
targeted URL
URL corresponding to the location of a particular resource to be crawled, used as a starting point by
a Web crawler
2.44
selection
curatorial decision-making process which determines whether a meaningful set of resources is in scope
for a Web archive, judged against its collection development policy
2.45
selective crawl
selective harvest
crawl aimed at collecting resources selected according to certain criteria
Note 1 to entry: In comparison with bulk crawls, selective crawls have a narrower scope and are typically
performed more frequently.
6 © ISO 2013 – All rights reserved

---------------------- Page: 11 ----------------------
ISO/TR 14873:2013(E)

Note 2 to entry: Selective continuous crawls are crawls aimed at collecting resources selected according to certain
criteria, such as scholarly importance, relevance to a subject or continuous update frequency of the resource.
Note 3 to entry: Selective event crawls are time-bound crawls, which end at a certain date, aimed at collecting
resources related to unique events, such as elections, sport events and disasters.
2.46
structural metadata
information that describes how compound objects are constructed together to make up logical units
2.47
target
meaningful set of resources to be collected as defined by one or more seeds and the associated crawl settings
2.48
top level domain
TLD
highest level of domains in the Domain Name System (DNS), including country-code top-level domains
(e. g. .fr, .de), which are based on the two-character territory codes of ISO 3166 country abbreviation,
and generic top-level domains (e. g. .com, .net, .org, .paris.)
Note 1 to entry: Unless specifically stated, this term is used to mean country-code TLDs in the report.
2.49
Uniform Resource Identifier
URI
extensible string of characters used to identify or name a resource on the Internet
2.50
Uniform Resource Locator
URL
subset of the Uniform Resource Identifier (URI) that specifies the location of a resource and the protocol
for retrieving it
2.51
WARC format
file format that specifies a method for combining multiple digital resources into an aggregate archival
file together with related information
Note 1 to entry: The WARC (Web ARChive) format has been an ISO standard since 2009 (ISO 28500:2009).
2.52
website
set of legally and/or editorially interconnected Web pages
Note 1 to entry: Usually websites represent official institutions, organizations, private firms and private homepages.
2.53
Web
main publishing application of the Internet, enabled by three key standards: URI, HTTP and HTML
3 Methods and purposes of We
...

SLOVENSKI STANDARD
SIST-TP ISO/TR 14873:2017
01-februar-2017
Informatika in dokumentacija - Statistika in vprašanja glede kakovosti za spletno
arhiviranje
Information and documentation -- Statistics and quality issues for web archiving
Information et documentation -- Statistiques et indicateurs de qualité pour l'archivage du
web
Ta slovenski standard je istoveten z: ISO/TR 14873:2013
ICS:
01.140.20 Informacijske vede Information sciences
03.120.99 Drugi standardi v zvezi s Other standards related to
kakovostjo quality
SIST-TP ISO/TR 14873:2017 en
2003-01.Slovenski inštitut za standardizacijo. Razmnoževanje celote ali delov tega standarda ni dovoljeno.

---------------------- Page: 1 ----------------------

SIST-TP ISO/TR 14873:2017

---------------------- Page: 2 ----------------------

SIST-TP ISO/TR 14873:2017
TECHNICAL ISO/TR
REPORT 14873
First edition
2013-12-01
Information and documentation —
Statistics and quality issues for web
archiving
Information et documentation — Statistiques et indicateurs de
qualité pour l’archivage du web
Reference number
ISO/TR 14873:2013(E)
©
ISO 2013

---------------------- Page: 3 ----------------------

SIST-TP ISO/TR 14873:2017
ISO/TR 14873:2013(E)

COPYRIGHT PROTECTED DOCUMENT
© ISO 2013
All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized otherwise in any form
or by any means, electronic or mechanical, including photocopying, or posting on the internet or an intranet, without prior
written permission. Permission can be requested from either ISO at the address below or ISO’s member body in the country of
the requester.
ISO copyright office
Case postale 56 • CH-1211 Geneva 20
Tel. + 41 22 749 01 11
Fax + 41 22 749 09 47
E-mail copyright@iso.org
Web www.iso.org
Published in Switzerland
ii © ISO 2013 – All rights reserved

---------------------- Page: 4 ----------------------

SIST-TP ISO/TR 14873:2017
ISO/TR 14873:2013(E)

Contents Page
Foreword .iv
Introduction .v
1 Scope . 1
2 Terms and definitions . 1
3 Methods and purposes of Web archiving . 7
3.1 Collecting methods . 8
3.2 Access and description methods .10
3.3 Preservation methods .12
3.4 Legal basis for Web archiving .14
3.5 Additional reasons for Web archiving .15
4 Statistics .16
4.1 General .16
4.2 Statistics for collection development .16
4.3 Collection characterization .22
4.4 Collection usage .28
4.5 Web archive preservation .31
4.6 Measuring the costs of Web archiving .35
5 Quality indicators .37
5.1 General .37
5.2 Limitations .37
5.3 Description .38
6 Usage and benefits .47
6.1 General .47
6.2 Intended usage and readers .47
6.3 Benefits for user groups .48
6.4 Use of proposed statistics by user groups .48
6.5 Web archiving process with related performance indicators .50
Bibliography .52
© ISO 2013 – All rights reserved iii

---------------------- Page: 5 ----------------------

SIST-TP ISO/TR 14873:2017
ISO/TR 14873:2013(E)

Foreword
ISO (the International Organization for Standardization) is a worldwide federation of national standards
bodies (ISO member bodies). The work of preparing International Standards is normally carried out
through ISO technical committees. Each member body interested in a subject for which a technical
committee has been established has the right to be represented on that committee. International
organizations, governmental and non-governmental, in liaison with ISO, also take part in the work.
ISO collaborates closely with the International Electrotechnical Commission (IEC) on all matters of
electrotechnical standardization.
The procedures used to develop this document and those intended for its further maintenance are
described in the ISO/IEC Directives, Part 1. In particular the different approval criteria needed for the
different types of ISO documents should be noted. This document was drafted in accordance with the
editorial rules of the ISO/IEC Directives, Part 2 (see www.iso.org/directives).
Attention is drawn to the possibility that some of the elements of this document may be the subject of
patent rights. ISO shall not be held responsible for identifying any or all such patent rights. Details of
any patent rights identified during the development of the document will be in the Introduction and/or
on the ISO list of patent declarations received (see www.iso.org/patents).
Any trade name used in this document is information given for the convenience of users and does not
constitute an endorsement.
For an explanation on the meaning of ISO specific terms and expressions related to conformity
assessment, as well as information about ISO’s adherence to the WTO principles in the Technical Barriers
to Trade (TBT) see the following URL: Foreword - Supplementary information
The committee responsible for this document is ISO/TC 46, Information and documentation, Subcommittee
SC 8, Quality - Statistics and performance evalutation.
iv © ISO 2013 – All rights reserved

---------------------- Page: 6 ----------------------

SIST-TP ISO/TR 14873:2017
ISO/TR 14873:2013(E)

Introduction
This Technical Report was developed in response to a worldwide demand for guidelines on the
management and evaluation of Web archiving activities and products.
Web archiving refers to the activities of selecting, capturing, storing, preserving and managing access
to snapshots of Internet resources over time. It started at the end of the 1990s, based on the vision that
an archive of Internet resources would become a vital record for research, commerce and government
in the future. Internet resources are regarded as part of the cultural heritage and therefore preserved
like printed heritage publications. Many institutions involved in Web archiving see this as an extension
of their long standing mission of preserving their national heritage, and this is endorsed and enabled in
many countries by legislative frameworks such as legal deposit.
There is a wide range of resources available on the Internet, including text, image, film, sound and other
multimedia formats. In addition to interlinked Web pages, there are newsgroups, newsletters, blogs and
interactive services such as games, made available using various transfer and communication protocols.
Web archives bring together copies of Internet resources, collected automatically by harvesting software,
usually at regular intervals. The intention is to replay the resources including the inherent relations, for
example by means of hypertext links, as much as possible as they were in their original environment.
The primary goal of Web archiving is to preserve a record of the Web in perpetuity, as closely as possible
to its original form, for various academic, professional and private purposes.
Web archiving is a recent but expanding activity which continuously requires new approaches and tools
in order to stay in sync with rapidly evolving Web technology. Determined by the strategic importance
perceived by the archiving institution, means available and sometimes legal requirements, diverse
approaches have been taken to archive Internet resources, ranging from capturing individual Web
pages to entire top-level domains. From an organisational perspective, Web archiving is also at different
levels of maturity. While it has become a business as usual activity in some organisations, others have
just initiated experimental programmes to explore the challenge.
Depending on the scale and purpose of collection, a distinction can be made between two broad categories
of Web archiving strategy: bulk harvesting and selective harvesting. Large scale bulk harvesting, such
as national domain harvesting, is intended to capture a snapshot of an entire domain (or a subset of
it). Selective harvesting is performed on a much smaller scale, is more focused and undertaken more
frequently, often based on criteria such as theme, event, format (e.g. audio or video files) or agreement
with content owners. A key difference between the two strategies lies in the level of quality control,
the evaluation of harvested Websites to determine whether pre-defined quality standards are being
attained. The scale of domain harvesting makes it impossible to carry out any manual visual comparison
between the harvested and the live version of the resource, which is a common quality assurance method
in selective harvesting.
This Technical Report aims to demonstrate how Web archives, as part of a wider heritage collection, can
be measured and managed in a similar and compliant manner based on traditional library workflows.
The report addresses collection development, characterization, description, preservation, usage and
organisational structure, showing that most aspects of the traditional collection management workflow
remain valid in principle for Web archiving, although adjustment is required in practice.
While this Technical Report provides an overview of the current status of Web archiving, its focus is on
the definition and use of Web archive statistics and quality indicators. The production of some statistics
relies on the use of harvesting, indexing or browsing software, and a different choice of software may
lead to variance in the results. This Technical Report however does not endorse nor recommend any
software in particular. It provides a set of indicators to help assess the performance and quality of Web
archives in general.
This Technical Report should be considered as a work in progress. Some of its contents are expected to
be incorporated in the future into ISO 2789 and ISO 11620.
© ISO 2013 – All rights reserved v

---------------------- Page: 7 ----------------------

SIST-TP ISO/TR 14873:2017

---------------------- Page: 8 ----------------------

SIST-TP ISO/TR 14873:2017
TECHNICAL REPORT ISO/TR 14873:2013(E)
Information and documentation — Statistics and quality
issues for web archiving
1 Scope
This Technical Report defines statistics, terms and quality criteria for Web archiving. It considers the
needs and practices across a wide range of organisations such as libraries, archives, museums, research
centres and heritage foundations. The examples mentioned are taken from the library sector, because
libraries, especially national libraries, have taken up the new task of Web archiving in the context of
legal deposit. This should in no way be taken to undermine the important contributions of institutions
which are not libraries. Neither does it reduce the principal applicability of this Technical Report for
heritage institutions and archiving professionals.
This Technical Report is intended for professionals directly involved in Web archiving, often in mixed
teams consisting of library or archive curators, engineers and managerial staff. It is also useful for Web
archiving institutions’ funding authorities and external stakeholders. The terminology used in this
Technical Report attempts to reflect the wide range of interests and expertise of the audiences, striking
a balance between computer science, management and librarianship.
This Technical Report does not consider the management of academic and commercial electronic
resources, such as e-journals, e-newspapers or e-books, which are usually stored and processed
separately using different management systems. They are regarded as Internet resources and are not
addressed in this Technical Report as distinct streams of content of Web archives. Some organisations
also collect electronic documents, which may be delivered through the Web, through publisher-based
electronic deposits and repository systems. These too are out of scope for this Technical Report. The
principles and techniques used for this kind of collecting are indeed very different from those of Web
archiving; statistics and quality indicators relevant for one kind of method are not necessarily relevant
for the other.
Finally, this Technical Report essentially focuses on Web archiving principles and methods, and does
not encompass alternative ways of collecting Internet resources. As a matter of fact, some Internet
resources, especially those that are not distributed on the Web (e.g. newsletters distributed as e-mails)
are not harvested by Web archiving techniques and are collected by other means that are not described
nor analysed in this Technical Report.
2 Terms and definitions
For the purposes of this document, the following terms and definitions apply.
2.1
access
successful request of a library-provided online service
Note 1 to entry: An access is one cycle of user activities that typically starts when a user connects to a library-
provided online service and ends by a terminating activity that is either explicit (by leaving the database through
log-out or exit) or implicit (timeout due to user inactivity).
Note 2 to entry: Accesses to the library website are counted as virtual visits.
Note 3 to entry: Requests of a general entrance or gateway page are excluded.
Note 4 to entry: If possible, requests by search engines are excluded.
[SOURCE: ISO 2789:2013, definition 2.2.1]
© ISO 2013 – All rights reserved 1

---------------------- Page: 9 ----------------------

SIST-TP ISO/TR 14873:2017
ISO/TR 14873:2013(E)

2.2
access tool
specialist software used to find, retrieve and replay archived Internet resources
Note 1 to entry: This may be implemented by a number of separate software packages working together.
2.3
administrative metadata
information necessary to allow the proper management of the digital objects in a repository
Note 1 to entry: Administrative metadata can be divided into the following categories:
— context or provenance metadata: describe the lifecycle of a resource to a point, including the related entities
and processes, e.g. configuration and log files;
— technical metadata: describe the technical characteristics of a digital object, e g. its format;
— rights metadata: define the ownership and the legally permitted usage of an object.
2.4
archive
Web archive
entire set of resources crawled from the Web over time, comprising one or more collections
2.5
bit stream
series of 0 and 1 digits that constitutes a digital file
2.6
budget (crawl)
limitation associated with a crawl or individual seeds, which can be expressed in e.g. number of files,
volume of data, or the time to be spent per crawl as defined in the crawler settings
2.7
bulk crawl
bulk harvest
crawl aimed at collecting the entirety of a single or multiple top level domain(s) or a subset(s)
Note 1 to entry: In comparison with selective crawls, bulk crawls have a wider scope and are typically performed
less frequently.
Note 2 to entry: Bulk crawls generally result in large scale Web archives, making it impossible to conduct detailed
quality assurance. This is often done through sampling.
2.8
capture
instance
copy of a resource crawled at a certain point in time
Note 1 to entry: If a resource has been crawled three times on different dates, there will be three captures.
2.9
collection
Web archive collection
cohesive resources presented as a group
Note 1 to entry: A collection can either be selected specifically prior to harvesting (e. g. an event, a topic) or pulled
together retrospectively from available resources in the archive.
Note 2 to entry: A Web archive may consist of one or more collections.
2 © ISO 2013 – All rights reserved

---------------------- Page: 10 ----------------------

SIST-TP ISO/TR 14873:2017
ISO/TR 14873:2013(E)

2.10
crawl
harvest
process of browsing and copying resources using a crawler
Note 1 to entry: Crawls can be categorised as bulk or selective crawls.
2.11
crawl settings
crawl parameters
definition of which resources should be collected and the frequency and depth required for each set of seeds
Note 1 to entry: Crawl settings also include crawler politeness (number of requests per second or minute sent to
the server hosting the resource), compliance with robots.txt and filters to exclude crawler traps.
2.12
crawler
harvester
archiving crawler
DEPRECATED: spider
software that will successively request URLs and parse the resulting resource for further URLs
Note 1 to entry: Resources may be stored and URLs discarded in accordance with a predefined set of rules [see
crawl settings (2.11) and scope (crawl) (2.40)].
2.13
crawler trap
Web page (or series thereof) which will cause a crawler to either crash or endlessly follow references to
other resources deemed to be of little or no value
Note 1 to entry: Crawler traps could be put in place intentionally to prevent crawlers from harvesting resources.
This could also occur inadvertently for example when a crawler follows dates of a calendar endlessly.
2.14
curator tool
application that runs on top of a Web crawler and supports the harvesting processes
Note 1 to entry: A core function is the management of targets and the associated descriptive and administrative
metadata. It may also include components for scheduling and quality control.
2.15
data mining
computational process that extracts patterns by analysing quantitative data from different perspectives
and dimensions, categorizing it, and summarizing potential relationships and impacts
[SOURCE: ISO 16439:—, definition 3.13]
2.16
deep Web
DEPRECATED: hidden Web
DEPRECATED: invisible Web
part of the Web which cannot be crawled and indexed by search engines, notably consisting of resources
which are dynamically generated or password protected
2.17
descriptive metadata
information describing the intellectual content of a digital object
2.18
domain name
identification string that defines a realm of administrative autonomy, authority, or control on the
Internet, defined by the rules and procedures of the domain name system (DNS)
© ISO 2013 – All rights reserved 3

---------------------- Page: 11 ----------------------

SIST-TP ISO/TR 14873:2017
ISO/TR 14873:2013(E)

2.19
domain name system
DNS
hierarchical, distributed global naming system used to identify entities connected to the Internet
Note 1 to entry: The Top Level Domains (TLDs) are the highest in the hierarchy.
2.20
emulation
recreation of the functionality and behaviour of an obsolete system, using software (called emulator) on
current computer systems
Note 1 to entry: Emulation is a key digital preservation strategy.
2.21
host
portion of a URI that names the network source of the content
Note 1 to entry: A host is typically a domain name such as www.archive.org, or a subdomain such as web.archive.org.
2.22
HTML
Hypertext Markup Language
the main mark-up language for Web pages, consisting of elements which are used to add structural and
semantic information to raw text
2.23
HTTP
Hypertext Transfer Protocol
client/server communication protocol used to transfer information on the Web
2.24
hyperlink
link
relationship structure used to link information on the Internet
2.25
junk
spam
unsolicited contents which are deemed to be of no relevance or long-term value
Note 1 to entry: Intentional spam is commonly used to manipulate search engine indexes. Junk can also be
generated inadvertently when a crawler falls in a crawler trap.
Note 2 to entry: Collecting institutions in general try to avoid collecting junk and spam so that resources can be
used to harvest “good” resources. Some, however, keep a small sample of this as a part of the record of the Web.
2.26
link mining
processing and analysis that focus on extracting patterns and heuristics from hyperlinks, e. g. to draw
network graphs
2.27
live Web leakage
common problem in rendering archived resources, which occurs when links in an archived resource
resolve to the current copy on the live site, instead of to the archival version within a Web archive
Note 1 to entry: Live Web leakage also occurs when scripts on archived Web pages continue to reference, and
successfully request, live Web resources within the archival rendering. This may cause live Web social media
feeds or streaming videos, for example, to appear in the archived webpage.
4 © ISO 2013 – All rights reserved

---------------------- Page: 12 ----------------------

SIST-TP ISO/TR 14873:2017
ISO/TR 14873:2013(E)

2.28
log file
file automatically created by a server that maintains a record of its activities
2.29
metadata
data describing context, content and structure of digital object and their management through time
[SOURCE: ISO 15489-1:2001, definition 2.12]
Note 1 to entry: Metadata can be categorised as descriptive, structural and administrative metadata.
2.30
migration
conversion of older or obsolete file formats to newer or current ones for the purpose of maintaining the
accessibility of a digital object
Note 1 to entry: Migration is a key preservation strategy.
[SOURCE: ISO 15489-1:2001, definition 3.13]
2.31
MIME type
Internet media type
content type
two-part identifier for file formats on the Internet
Note 1 to entry: MIME (Multipurpose Internet Mail Extensions) uses the content-type header, consisting of a type
and a subtype, to indicate the format of a resource, e. g. image/jpeg.
2.32
nomination
candidate resource to be considered for inclusion in a Web archive
2.33
page
Web page
structured resource, which in addition to any human-readable content, contains zero or more
relationships with other resources and is identified by a URL
2.34
permission
authorization to crawl a live website and/or to publicly display its content on a Web archive
Note 1 to entry: Permission can be expressed by a formal licence from the rights holder or exempted by the virtue
of legal deposit.
2.35
registered user
person or organization registered with a library in order to use its collection and/or services within or
away from the library
Note 1 to entry: Users can be registered upon their request or automatically when enrolling in the institution.
Note 2 to entry: The registration is monitored at regular intervals, at least every three years, so that inactive
users can be removed from the register.
[SOURCE: ISO 2789:2013, definition 2.2.28]
© ISO 2013 – All rights reserved 5

---------------------- Page: 13 ----------------------

SIST-TP ISO/TR 14873:2017
ISO/TR 14873:2013(E)

2.36
request
HTTP-formatted message sent by a requesting system (e.g. a browser or a crawler) to a remote server
for a particular resource identified by a URL
2.37
response
answer by a remote server to an HTTP request for a resource, containing either the requested resource,
a redirection to another URL or a negative (error) response, indicating why the requested resource
could not be returned
2.38
response code
status code
three-digit number indicating to the requesting server the status of the requested resource
Note 1 to entry: Codes starting with a 4 (4xx), for example, indicate that the requested resource is not available.
2.39
robots.txt
robots exclusion standard
protocol used to prevent Web crawlers from accessing all or part of a website
Note 1 to entry: robots.txt is not legally binding.
Note 2 to entry: It may also be used to request a minimum delay between consecutive requests or even to provide
a link to a site map to facilitate better crawling of the site.
2.40
scope (crawl)
set of parameters which defines the extent of a crawl, e. g. the maximum number of hops or the maximum
path depth the crawler should follow
Note 1 to entry: The scope of a crawl can be as broad as a whole top level domain (e. g. .de) or as narrow as a single file.
2.41
scope (Web archive)
extent of a Web archive or collection, as determined by the institutional legal mandate or collection policy
2.42
second level domain
subdivisions within the top level domains for specific categories of organisations or areas of interest
(e. g. .gov.uk for governmental websites, .asso.fr for associations’ websites)
2.43
seed
targeted URL
URL corresponding to the location of a particular resource to be crawled, used as a starting point by
a Web crawler
2.44
selection
curatorial decision-making process which determines whether a meaningful set of resources is in scope
for a Web archive, judged against its collection development policy
2.45
selective crawl
selective harvest
crawl aimed at collecting resources selected according to certain criteria
Note 1 to entry: In comparison with bulk crawls, selective crawls have a narrower scope and are typically
performed more frequently.
6 © ISO 2013 – All rights reserved

---------------------- Page: 14 ----------------------

SIST-TP ISO/TR 14873:2017
ISO/TR 14873:2013(E)

Note 2 to entry: Selective continuous crawls are crawls aimed at collecting resources selected according to certain
criteria, such as scholarly importance, relevance to a subject or continuous update frequency of the resource.
Note 3 to entry: Selective event crawls are time-bound crawls, which end at a certain date, aimed at collecting
resources related to unique events, such as elections, sport events and disasters.
2.46
structural metadata
information that describes how compound objects are constructed together to make up logical units
2.47
target
meaningful set of resources to be collected as defined by one or more seeds and the associated crawl settings
2.48
top level domain
TLD
highest
...

Questions, Comments and Discussion

Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.