ISO 28500:2017
(Main)Information and documentation — WARC file format
Information and documentation — WARC file format
ISO 28500:2017 specifies the WARC file format: - to store both the payload content and control information from mainstream Internet application layer protocols, such as the HTTP, DNS, and FTP; - to store arbitrary metadata linked to other stored data (e.g. subject classifier, discovered language, encoding); - to support data compression and maintain data record integrity; - to store all control information from the harvesting protocol (e.g. request headers), not just response information; - to store the results of data transformations linked to other stored data; - to store a duplicate detection event linked to other stored data (to reduce storage in the presence of identical or substantially similar resources); - to be extended without disruption to existing functionality; - to support handling of overly long records by truncation or segmentation, where desired.
Information et documentation — Format de fichier WARC
Informatika in dokumentacija - Datotečna oblika zapisa WARC
Ta dokument določa datotečno obliko zapisa WARC:
— za shranjevanje koristne vsebine in nadzornih podatkov iz glavnih internetnih protokolov aplikacijskih plasti, kot so HTTP, DNS in FTP;
— za shranjevanje metapodatkov, povezanih z drugimi shranjenimi podatki (kot so klasifikator zadeve, odkriti jezik in kodiranje);
— za podporo stiskanja podatkov in ohranitev celovitosti podatkovnega zapisa;
— za shranjevanje vseh nadzornih podatkov iz protokola povzemanja (npr. glav zahtev), ne samo podatkov o odzivih;
— za shranjevanje rezultatov spreminjanja podatkov, povezanih z drugimi shranjenimi podatki;
— za shranjevanje dogodka zaznavanja podvojitev, povezanega z drugimi shranjenimi podatki (za zmanjševanje zasedenosti shrambe v prisotnosti
identičnih ali zelo podobnih virov);
— za razširitev brez motenj obstoječih funkcij;
— za podporo obravnavanja zelo dolgih zapisov s krajšanjem ali segmentacijo, kjer je zaželeno.
General Information
Relations
Standards Content (Sample)
SLOVENSKI STANDARD
01-september-2018
1DGRPHãþD
SIST ISO 28500:2009
,QIRUPDWLNDLQGRNXPHQWDFLMD'DWRWHþQDREOLND]DSLVD:$5&
Information and documentation -- WARC file format
Information et documentation -- Format de fichier WARC
Ta slovenski standard je istoveten z: ISO 28500:2017
ICS:
35.240.30 Uporabniške rešitve IT v IT applications in information,
informatiki, dokumentiranju in documentation and
založništvu publishing
2003-01.Slovenski inštitut za standardizacijo. Razmnoževanje celote ali delov tega standarda ni dovoljeno.
INTERNATIONAL ISO
STANDARD 28500
Second edition
2017-08
Information and documentation —
WARC file format
Information et documentation — Format de fichier WARC
Reference number
©
ISO 2017
© ISO 2017, Published in Switzerland
All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized otherwise in any form
or by any means, electronic or mechanical, including photocopying, or posting on the internet or an intranet, without prior
written permission. Permission can be requested from either ISO at the address below or ISO’s member body in the country of
the requester.
ISO copyright office
Ch. de Blandonnet 8 • CP 401
CH-1214 Vernier, Geneva, Switzerland
Tel. +41 22 749 01 11
Fax +41 22 749 09 47
copyright@iso.org
www.iso.org
ii © ISO 2017 – All rights reserved
Contents Page
Foreword .v
Introduction .vi
1 Scope . 1
2 Normative references . 1
3 Terms, definitions and abbreviated terms . 2
4 File and record model . 3
5 Named fields . 5
5.1 General . 5
5.2 WARC-Record-ID (mandatory) . 5
5.3 Content-Length (mandatory) . 5
5.4 WARC-Date (mandatory) . 6
5.5 WARC-Type (mandatory) . 6
5.6 Content-Type . 6
5.7 WARC-Concurrent-To . 7
5.8 WARC-Block-Digest . 7
5.9 WARC-Payload-Digest . 7
5.10 WARC-IP-Address . 8
5.11 WARC-Refers-To . 8
5.12 WARC-Refers-To-Target-URI . 8
5.13 WARC-Refers-To-Date . 8
5.14 WARC-Target-URI . 9
5.15 WARC-Truncated. 9
5.16 WARC-Warcinfo-ID . 9
5.17 WARC-Filename. 9
5.18 WARC-Profile .10
5.19 WARC-Identified-Payload-Type .10
5.20 WARC-Segment-Number .10
5.21 WARC-Segment-Origin-ID .10
5.22 WARC-Segment-Total-Length .10
6 WARC record types .11
6.1 General .11
6.2 ‘warcinfo’ .11
6.3 ‘response’ .11
6.3.1 General.11
6.3.2 ‘http’ and ‘https’ schemes .12
6.3.3 Other URI schemes .12
6.4 ‘resource’ .12
6.4.1 General.12
6.4.2 ‘http’ and ‘https’ schemes .12
6.4.3 ‘ftp’ scheme .12
6.4.4 ‘dns’ scheme .13
6.4.5 Other URI schemes .13
6.5 ‘request’ .13
6.5.1 General.13
6.5.2 ‘http’ and ‘https’ schemes .13
6.5.3 Other URI schemes .13
6.6 ‘metadata’ .13
6.7 ‘revisit’ .14
6.7.1 General.14
6.7.2 Profile: Identical Payload Digest .14
6.7.3 Profile: Server Not Modified .15
6.7.4 Other profiles .15
6.8 ‘conversion’ .15
6.9 ‘continuation’ .16
7 Record segmentation .16
8 WARC file name, size and compression .16
Annex A (informative) Use cases for writing WARC records .18
Annex B (informative) Examples of WARC records .21
Annex C (informative) WARC file size and name recommendations .24
Annex D (informative) Compression recommendations .25
Bibliography .26
iv © ISO 2017 – All rights reserved
Foreword
ISO (the International Organization for Standardization) is a worldwide federation of national standards
bodies (ISO member bodies). The work of preparing International Standards is normally carried out
through ISO technical committees. Each member body interested in a subject for which a technical
committee has been established has the right to be represented on that committee. International
organizations, governmental and non-governmental, in liaison with ISO, also take part in the work.
ISO collaborates closely with the International Electrotechnical Commission (IEC) on all matters of
electrotechnical standardization.
The procedures used to develop this document and those intended for its further maintenance are
described in the ISO/IEC Directives, Part 1. In particular the different approval criteria needed for the
different types of ISO documents should be noted. This document was drafted in accordance with the
editorial rules of the ISO/IEC Directives, Part 2 (see www .iso .org/ directives).
Attention is drawn to the possibility that some of the elements of this document may be the subject of
patent rights. ISO shall not be held responsible for identifying any or all such patent rights. Details of
any patent rights identified during the development of the document will be in the Introduction and/or
on the ISO list of patent declarations received (see www .iso .org/ patents).
Any trade name used in this document is information given for the convenience of users and does not
constitute an endorsement.
For an explanation on the voluntary nature of standards, the meaning of ISO specific terms and
expressions related to conformity assessment, as well as information about ISO’s adherence to the
World Trade Organization (WTO) principles in the Technical Barriers to Trade (TBT) see the following
URL: w w w . i s o .org/ iso/ foreword .html.
This document was prepared by Technical Committee ISO/TC 46, Information and documentation,
Subcommittee 4, Technical interoperability.
This second edition cancels and replaces the first edition (ISO 28500:2009), which has been technically
revised.
Introduction
Websites and web pages emerge and disappear from the World Wide Web every day. For the past 10
years, memory storage organizations have tried to find the most appropriate ways to collect and keep
track of this vast quantity of important material using web-scale tools such as web crawlers. A web
crawler is a program that browses the web in an automated manner according to a set of policies;
starting with a list of URLs, it saves each page identified by a URL, finds all the hyperlinks in the page
(e.g. links to other pages, images, videos, scripting or style instructions, etc.), and adds them to the list
of URLs to visit recursively. Storing and managing the billions of saved web page objects itself presents
a challenge.
At the same time, those same organizations have a rising need to archive large numbers of digital
files not necessarily captured from the web (e.g. entire series of electronic journals, or data generated
by environmental sensing equipment). A general requirement that appears to be emerging is for a
container format that permits one file simply and safely to carry a very large number of constituent
data objects for the purpose of storage, management, and exchange. Those data objects (or resources)
need to be of unrestricted type (including many binary types for audio, CAD, compressed files, etc.), but
fortunately the container needs only minimal knowledge of the nature of the objects.
The WARC (Web ARChive) file format offers a convention for concatenating multiple resource records
(data objects), each consisting of a set of simple text headers and an arbitrary data block into one long
file. The WARC format is an extension of the ARC file format (ARC) that has traditionally been used to
store “web crawls” as sequences of content blocks harvested from the World Wide Web. Each capture
in an ARC file is preceded by a one-line header that very briefly describes the harvested content and its
length. This is directly followed by the retrieval protocol response messages and content. The original
ARC format file has been used by the Internet Archive (IA) since 1996 for managing billions of objects,
and by several national libraries.
The motivation to extend the ARC format arose from the discussion and experiences of the International
Internet Preservation Consortium (IIPC), whose members include the national libraries of Australia,
Canada, Denmark, Finland, France, Iceland, Italy, Norway, Sweden, The British Library (UK), The
Library of Congress (USA), and the Internet Archive (IA). The California Digital Library and the Los
Alamos National Laboratory also provided input on extending and generalizing the format.
The WARC format offers a standard way to structure, manage and store billions of resources collected
from the web and elsewhere. It is used to build applications for harvesting, managing, accessing, mining
and exchanging content. While it represents the unique standard format for web archives, it has been
adopted beyond the web archiving community to store born-digital or digitized materials. The way
WARC files will be created and resources stored and rendered will depend on software and applications
implementations.
Besides the primary content recorded in ARCs, the extended WARC format accommodates related
secondary content, such as assigned metadata, abbreviated duplicate detection events, later-date
transformations, and segmentation of large resources. The extension may also be useful for more
general applications than web archiving. To aid the development of tools that are backwards compatible,
WARC content is clearly distinguishable from pre-revision ARC content.
The WARC file format is made sufficiently different from the legacy ARC format files so that software
tools can unambiguously detect and correctly process both WARC and ARC records; given the large
amount of existing archival data in the previous ARC format, it is important that access and use of this
legacy not be interrupted when transitioning to the WARC format.
vi © ISO 2017 – All rights reserved
INTERNATIONAL STANDARD ISO 28500:2017(E)
Information and documentation — WARC file format
1 Scope
This document specifies the WARC file format:
— to store both the payload content and control information from mainstream Internet application
layer protocols, such as the HTTP, DNS, and FTP;
— to store arbitrary metadata linked to other stored data (e.g. subject classifier, discovered language,
encoding);
— to support data compression and maintain data record integrity;
— to store all control information from the harvesting protocol (e.g. request headers), not just response
information;
— to store the results of data transformations linked to other stored data;
— to store a duplicate detection event linked to other stored data (to reduce storage in the presence of
identical or substantially similar resources);
— to be extended without disruption to existing functionality;
— to support handling of overly long records by truncation or segmentation, where desired.
2 Normative references
The following documents are referred to in the text in such a way that some or all of their content
constitutes requirements of this document. For dated references, only the edition cited applies. For
undated references, the latest edition of the referenced document (including any amendments) applies.
1)
RFC1035 Mockapetris, P. Domain names — Implementation and specification, STD 13, November 1987
2)
RFC2045 Freed, N. and Borenstein, N. Multipurpose Internet Mail Extensions (MIME) Part One: Format
of Internet Message Bodies, November 1996
3)
RFC2540 Eastlake, D. Detached Domain Name System (DNS) Information, March 1999
4)
RFC2616 Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P. and Berners-Lee, T.
Hypertext Transfer Protocol — HTTP/1.1. June 1999 (TXT, PS, PDF, HTML, XML)
5)
RFC3629 Yergeau, F. UTF-8, a transformation format of ISO 10646. STD 63, November 2003
6)
RFC3986 Berners-Lee, T., Fielding, R., Masinter, L. Uniform Resource Identifier (URI): Generic Syntax.
STD 66, January 2005 (TXT, HTML, XML)
1) Available at: https:// www .ietf .org/ rfc/ rfc1035 .txt.
2) Available at: https:// www .ietf .org/ rfc/ rfc2045 .txt.
3) Available at: https:// tools .ietf .org/ html/ rfc2540.
4) Available at: https:// www .ietf .org/ rfc/ rfc2616 .txt.
5) Available at: https:// tools .ietf .org/ html/ rfc3629.
6) Available at: https:// www .ietf .org/ rfc/ rfc3986 .txt.
7)
RFC4027 Josefsson, S. Domain Name System Media Types, April 2005
8)
RFC4291 Hinden, R. and Deering, S. IP Version 6 Addressing Architecture, February 2006
9)
RFC5322 Resnick, P. (ed.) Internet Message Format, October 2008
3 Terms, definitions and abbreviated terms
3.1 Terms and definitions
For the purposes of this document, the following terms and definitions apply.
ISO and IEC maintain terminological databases for use in standardization at the following addresses:
— ISO Online browsing platform: available at http:// www .iso .org/ obp
— IEC Electropedia: available at http:// www .electropedia .org/
3.1.1
WARC record
basic constituent of a WARC file, consisting of a sequence of WARC records
3.1.2
WARC record content block
part (zero or more octets) of a WARC record that follows the header and that forms the main body of a
WARC record
3.1.3
WARC record payload
data object referred to, or contained by a WARC record as a meaningful subset of the content block
3.1.4
WARC record header
beginning of a WARC record, consisting of one first line declaring the record to be in the WARC format
with a given version number, followed by lines of named fields up to a blank line
3.1.5
WARC named fields
set of elements consisting of a name, a colon, and a value, with long values continued on indented lines
3.1.6
WARC logical record
record composed of multiple segments, each represented by a WARC record
3.2 Abbreviated terms
ABNF augmented Backus-Naur form
ARC archive
CRLF carriage return line feed
DNS domain name system
FTP file transfer protocol
7) Available at: https:// tools .ietf .org/ html/ rfc4027.
8) Available at: https:// tools .ietf .org/ html/ rfc4291.
9) Available at: https:// tools .ietf .org/ html/ rfc5322.
2 © ISO 2017 – All rights reserved
HTTP hypertext transport protocol
IANA Internet Assigned Numbers Authority
IESG Internet Engineering Steering Group
RFC request for comments
UR (I/L/N) uniform resource (identifier/locator/name)
WARC web archive
4 File and record model
A WARC format file is the simple concatenation of one or more WARC records. The first record usually
describes the records to follow. In general, record content is either the direct result of a retrieval
attempt (web pages, inline images, URL redirection information, DNS hostname lookup results, stand-
alone files, etc.) or is synthesized material (e.g. metadata, transformed content) that provides additional
information about archived content.
A WARC record shall consist of a record header followed by a record content block and two new lines.
The WARC record header shall consist of one first line declaring the record to be in the WARC format
with a given version number, then a variable number of line-oriented named fields terminated by a
blank line. The WARC record header format shall follow the general rules of HTTP/1.1 [RFC2616] and
[RFC5322] headers with one major exception: it shall also allow UTF-8 characters, as specified in
[RFC3629].
The top-level view of a WARC file can be expressed in an ABNF grammar, reusing the augmented
constructs defined in section 2.1 of HTTP/1.1 [RFC2616]. (In particular, note that to avoid the risk
of confusion, where any WARC rule has the same name as an [RFC2616] rule, the definition here has
been made the same, except in the case of the CHAR rule, which in WARC includes multibyte UTF-
8 characters.)
warc-file = 1*warc-record
warc-record = header CRLF
block CRLF CRLF
header = version warc-fields
version = "WARC/1.1" CRLF
warc-fields = *named-field CRLF
block = *OCTET
The record version shall appear first in every record and hence shall also begin the WARC file itself.
The WARC record relies heavily on named fields. Each named field consists of a name followed by a
colon (“:”) and the field value. Field names are not case-sensitive. The field value may be preceded
by any amount of linear white space (LWS), though a single space is preferred. Header fields can be
extended over multiple lines by preceding each extra line with at least one space or tab character.
Named fields may appear in any order and field values may contain any UTF-8 character. Both defined-
fields and extension-fields follow the generic named-field format. Extension-fields may be used in
extensions of the core format.
named-field = field-name ":" [ field-value ]
field-name = token
field-value = *(field-content | LWS) ; further qualified
; by field
; definitions
field-content =
and consisting of either *TEXT or combinations
of token, separators, and quoted-string>
OCTET =
token = 1*
except CTLs or separators>
separators = "(" | ")" | "<" | ">" | "@"
| "," | ""; | ":" | "\" | <">
| "/" | "[" | "]" | "?" | "="
| "{" | "}" | SP | HT
TEXT =
but including LWS>
CHAR = ; (0-191, 194-244)
DIGIT =
CTL =
(octets 0 - 31) and DEL (127)>
CR = ; (13)
LF = ; (10)
SP = ; (32)
HT = ; (9)
CRLF = CR LF
LWS = [CRLF] 1*(SP | HT) ; semantics same as
; single SP
quoted-string = (<"> *(qdtext | quoted-pair) <"> )
qdtext = >
quoted-pair = "\" CHAR ; single-character quoting
uri = <'URI' per RFC3986>
Although UTF-8 characters are allowed, the ‘encoded-word’ mechanism of [RFC2047] may also be used
when writing WARC fields and shall also be understood by WARC reading software.
NOTE In WARC 1.0 standard (ISO 28500:2009), uri was defined as “<” <ʹURIʹ per RFC3986> “>”. This rule has
been changed to meet requests from implementers.
The rest of the WARC record grammar concerns defined-field parameters such as record identifier,
record type, creation time, content length, and content type.
defined-field = WARC-Type
| WARC-Record-ID
| WARC-Date
| Content-Length
| Content-Type
| WARC-Concurrent-To
| WARC-Block-Digest
| WARC-Payload-Digest
| WARC-IP-Address
| WARC-Refers-To
| WARC-Refers-To-Target-URI
| WARC-Refers-To-Date
| WARC-Target-URI
| WARC-Truncated
| WARC-Warcinfo-ID
| WARC-Filename ; warcinfo only
| WARC-Profile ; revisit only
| WARC-Identified-Payload-Type
| WARC-Segment-Origin-ID ; continuation only
| WARC-Segment-Number
| WARC-Segment-Total-Length ; continuation only
Every WARC record shall have a type, reported in the WARC-Type field. Eight WARC record types are
defined in this document as follows:
— ‘warcinfo’;
— ‘response’;
— ‘resource’;
— ‘request’;
— ‘metadata’;
— ‘revisit’;
— ‘conversion’;
— ‘continuation’.
4 © ISO 2017 – All rights reserved
The relevant fields for each record type are described in detail in Clause 6. Each field’s meaning and
legal value format are described in Clause 5.
The record block shall contain octet content, interpreted based on the record type and other header
values. All records shall include a Content-Length field to specify the length of the block.
Some record types (and possibly future record types) also define a payload, such as a meaningful subset
of the block or content from a predecessor record. Some headers pertain to the payload of a record
rather than the block directly.
For example, in a ‘response’ record with a content block consisting of HTTP headers and a data
object, the payload would be the data object. All ‘response’, ‘resource’, ‘request’, ‘revisit’, ‘conversion’
and ‘continuation’ records may have a payload. All ‘warcinfo’ and ‘metadata’ records shall not have a
payload.
Content matching the warc-file rule shall have the MIME content-type “application/warc”, as specified
in Clause 8.
Content matching only the warc-fields rule is useful as a simple descriptive format, and has MIME
content-type “application/warc-fields”, as specified in Clause 8.
New named fields and new records types may be defined in extensions of the core format. However, it
is strongly recommended to discuss any addition to verify that a suitable field or type does not already
exist to avoid collision. Discussion should notably be held within the IIPC. See Reference [11] for more
information.
5 Named fields
5.1 General
Named fields within a WARC record provide information about the current record. WARC both reuses
appropriate headers from other standards and defines new headers, all beginning “WARC-”, for WARC-
specific purposes.
WARC named fields of the same type shall not be repeated in the same WARC record (for example, a
WARC record shall not have several WARC-Date or several WARC-Target-URI), except as noted (e.g.
WARC-Concurrent-To).
Because new fields may be defined in extensions to the core WARC format, WARC processing software
shall ignore fields with unrecognized names.
5.2 WARC-Record-ID (mandatory)
A WARC-Record-ID is an identifier assigned to the current record that is globally unique for its period
of intended use. No identifier scheme is mandated by this specification, but each WARC-Record-ID shall
be a legal URI and clearly indicate a documented and registered scheme to which it conforms (e.g. via
a URI scheme prefix such as “http:” or “urn:”). Care should be taken to ensure that this value is written
with no internal white space.
WARC-Record-ID = "WARC-Record-ID" ":" "<" uri ">"
All records shall have a WARC-Record-ID field.
5.3 Content-Length (mandatory)
The Content-Length is the number of octets in the block, similar to [RFC2616]. If no block is present, a
value of “0” (zero) shall be used.
Content-Length = "Content-Length" ":" 1*DIGIT
All records shall have a Content-Length field.
5.4 WARC-Date (mandatory)
[1]
The WARC-Date is a UTC timestamp as described in ISO 8601 , for example YYYY -MM -DDThh: mm:
ssZ. The timestamp shall represent the instant that data capture for record creation began. Multiple
records written as part of a single capture event (see 5.7) shall use the same WARC-Date, even though
the times of their writing will not be exactly synchronized.
[1]
WARC-Date may be specified at any of the levels of granularity described in . If WARC-Date includes
a decimal fraction of a second, the decimal fraction of a second shall have a minimum of 1 digit and a
maximum of 9 digits. WARC-Date should be specified with as much precision as is accurately known.
This document recommends no particular algorithm for access software to choose a record by date
when an exact match is not available.
WARC-Date = "WARC-Date" ":" iso8601
iso8601 =
All records shall have a WARC-Date field.
See Annex A for examples on usage of WARC-Date fields.
5.5 WARC-Type (mandatory)
WARC-Type is the type of WARC record. Record types defined in this document are
— ‘warcinfo’,
— ‘response’,
— ‘resource’,
— ‘request’,
— ‘metadata’,
— ‘revisit’,
— ‘conversion’, and
— ‘continuation’.
Other types of WARC records may be defined in extensions of the core format. Types are further
described in Clause 6.
A WARC file needs not to contain any particular record types, though starting all WARC files with a
‘warcinfo’ record is recommended.
WARC-Type = "WARC-Type" ":" record-type
record-type = "warcinfo" | "response" | "resource"
| "request" | "metadata" | "revisit"
| "conversion" | "continuation"
All records shall have a WARC-Type field.
WARC processing software shall ignore records of unrecognized type.
See Annex A for examples on usage of WARC-Type fields.
5.6 Content-Type
The Content-Type field is the MIME type (as defined in [RFC2045]) of information contained
in the record’s block. For example, in HTTP request and response records, this would be
‘application/http’ as specified in 19.1 of [RFC2616] (or ‘application/http; msgtype=request’ and
‘application/http; msgtype = response’ respectively). In particular, the content-type is not the value of
the HTTP Content-Type header in a HTTP response but a MIME type to describe the full archived HTTP
message (hence ‘application/http’ if the block contains request or response headers).
6 © ISO 2017 – All rights reserved
Content-Type = "Content-Type" ":" media-type
media-type = type "/" subtype *(";" parameter)
type = token
subtype = token
parameter = attribute "=" value
attribute = token
value = token | quoted-string
All records with a non-empty block (non-zero Content-Length), except ‘continuation’ records, should
have a Content-Type field. Only if the media type is not given by a Content-Type field, a reader may
attempt to guess the media type via inspection of its content and/or the name extension(s) of the URI
used to identify the resource. If the media type remains unknown, the reader should treat it as type
“application/octet-stream”.
5.7 WARC-Concurrent-To
The WARC-Concurrent-To field (or fields) contains the WARC-Record-ID of any records created as part of
the same capture event as the current record. A capture event comprises the information automatically
gathered by a retrieval against a single WARC-Target-URI; for example, it may be represented by a
‘response’ or ‘revisit’ record plus its associated ‘request’ record.
WARC-Concurrent-To = "WARC-Concurrent-To" ":" "<" uri ">"
This field may be used to associate records of types ‘request’, ‘response’, ‘resource’, ‘metadata’, and
‘revisit’ with one another when they arise from a single capture event. (When so used, any WARC-
Concurrent-To association shall be considered bidirectional even if the header only appears on one
record.) The WARC-Concurrent-To field shall not be used in ‘warcinfo’, ‘conversion’, and ‘continuation’
records.
As an exception to the general rule, several WARC-Concurrent-To fields may be repeated within the
same WARC record.
See Annex A for examples on usage of WARC-Concurrent-To fields.
5.8 WARC-Block-Digest
The WARC-Block-Digest is an optional parameter indicating the algorithm name and calculated value of
a digest applied to the full block of the record.
WARC-Block-Digest = "WARC-Block-Digest" ":" labelled-digest
labelled-digest = algorithm ":" digest-value
algorithm = token
digest-value = token
An example is a SHA-1 labelled Base32 ([RFC4648]) value:
WARC-Block-Digest: sha1: AB2CD3EF4GH5IJ6KL7MN8OPQ
No particular algorithm is recommended.
Any record may have a WARC-Block-Digest field.
5.9 WARC-Payload-Digest
A WARC-Payload-Digest is an optional parameter indicating the algorithm name and calculated value
of a digest applied to the payload referred to or contained by the record, which is not necessarily
equivalent to the record block.
WARC-Payload-Digest = "WARC-Payload-Digest" ":" labelled-digest
An example is a SHA-1 labelled Base32 ([RFC4648]) value:
WARC-Payload-Digest: sha1: 3EF4GH5IJ6KL7MN8OPQAB2CD
No particular algorithm is recommended.
The payload of an application/http block is its ‘entity-body’ (specified in [RFC2616]). In contrast to
WARC-Block-Digest, the WARC-Payload-Digest field may also be used for data not actually present in the
current record block, for example when a block is left off in accordance with a ‘revisit’ profile (see 6.7),
or when a record is segmented (the WARC-Payload-Digest recorded in the segments of a segmented
record shall be the digest of the payload of the logical record).
The WARC-Payload-Digest field may be used on WARC records with a well-defined payload and shall
not be used on records without a well-defined payload.
5.10 WARC-IP-Address
The WARC-IP-Address is the numeric Internet address contacted to retrieve any included content.
An IPv4 address shall be written as a “dotted quad”; an IPv6 address shall be written as specified in
[RFC4291]. For a HTTP retrieval, this will be the IP address used at retrieval time corresponding to the
hostname in the record’s target-URI.
WARC-IP-Address = "WARC-IP-Address" ":" (ipv4 | ipv6)
ipv4 = <"dotted quad">
ipv6 =
The WARC-IP-Address field may be used on ‘response’, ‘resource’, ‘request’, ‘metadata’, and ‘revisit’
records, but shall not be used on ‘warcinfo’, ‘conversion’ or ‘continuation’ records.
5.11 WARC-Refers-To
The WARC-Refers-To field contains the WARC-Record-ID of a single record for which the present record
holds additional content.
WARC-Refers-To = "WARC-Refers-To" ":" "<" uri ">"
The WARC-Refers-To field may be used to associate a ‘metadata’ record to another record it describes.
The WARC-Refers-To field may also be used to associate a record of type ‘revisit’ or ‘conversion’ with
the preceding record which helped determine the present record content. The WARC-Refers-To field
shall not be used in ‘warcinfo’, ‘response’, ‘resource’, ‘request’, and ‘continuation’ records.
See Annex A for examples on usage of WARC-Refers-To fields.
5.12 WARC-Refers-To-Target-URI
The WARC-Refers-To-Target-URI field contains the WARC-Target-URI of a record for which the present
record is considered a revisit of.
WARC-Refers-To-Target-URI = "WARC-Refers-To-Target-URI" ":" uri
The WARC-Refers-To-Target-URI field may be used to associate a record of type ‘revisit’ with another
record which contains the resource which has not been archived.
The WARC-Refers-To-Target-URI field may be used in ‘revisit’ records and shall not be used in ‘warcinfo’,
‘response’, ‘metadata’, ‘conversion’, ‘resource’, ‘request’, and ‘continuation’ records.
5.13 WARC-Refers-To-Date
The WARC-Refers-To-Date field contains the WARC-Date of a record for which the present record is
considered a revisit of.
WARC-Refers-To-Date = "WARC-Refers-To-Date" ":" iso8601
iso8601 =
The WARC-Refers-To-Date field may be used to associate a record of type ‘revisit’ with another record
which contains the resource which has not been archived.
The WARC-Refers-To-Date field may be used in ‘revisit’ records and shall not be used in ‘warcinfo’,
‘response’, ‘metadata’, ‘conversion’, ‘resource’, ‘request’, and ‘continuation’ records.
8 © ISO 2017 – All rights reserved
5.14 WARC-Target-URI
The WARC-Target-URI is the original URI whose capture gave rise to the information content in this
record. In the context of web harvesting, this is the URI that was the target of a crawler’s retrieval
request. For a ‘revisit’ record, it is the URI that was the target of a retrieval request. Indirectly, such as
for a ‘metadata’, or ‘conversion’ record, it is a copy of the WARC-Target-URI appearing in the original
record to which the newer record pertains. The URI in this value shall be written as specified in
[RFC3986].
WARC-Target-URI = "WARC-Target-URI" ":" uri
All ‘response’, ‘resource’, ‘request’, ‘revisit’, ‘conversion’ and ‘continuation’ records shall have a WARC-
Target-URI field. A ‘metadata’ record may have a WARC-Target-URI field. A ‘warcinfo’ record shall not
have a WARC-Target-URI field.
5.15 WARC-Truncated
For practical reasons, writers of the WARC format may place limits on the time or storage allocated
to archiving a single resource. As a result, only a truncated portion of the original resource may be
available for saving into a WARC record.
Any record may indicate that truncation of its content block has occurred and give the reason with a
WARC-Truncated field.
WARC-Truncated = "WARC-Truncated" ":" reason-token
reason-token = "length" ; exceeds configured max
; length
| "time" ; exceeds configured max time
| "disconnect" ; network disconnect
| "unspecified" ; other/unknown reason
Other reasons may be defined in extensions of the core format.
For example, if the capture of what appeared to be a multi-gigabyte resource was cut short after a
transfer time limit was reached, the partial resource could be saved to a WARC record with this field.
The WARC-Truncated field may be used on any WARC record. The WARC Content-Length field shall still
report the actual truncated size of the record block.
5.16 WARC-Warcinfo-ID
When present, the WARC-Warcinfo-ID indicates the WARC-Record-ID of the associa
...
INTERNATIONAL ISO
STANDARD 28500
Second edition
2017-08
Information and documentation —
WARC file format
Information et documentation — Format de fichier WARC
Reference number
©
ISO 2017
© ISO 2017, Published in Switzerland
All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized otherwise in any form
or by any means, electronic or mechanical, including photocopying, or posting on the internet or an intranet, without prior
written permission. Permission can be requested from either ISO at the address below or ISO’s member body in the country of
the requester.
ISO copyright office
Ch. de Blandonnet 8 • CP 401
CH-1214 Vernier, Geneva, Switzerland
Tel. +41 22 749 01 11
Fax +41 22 749 09 47
copyright@iso.org
www.iso.org
ii © ISO 2017 – All rights reserved
Contents Page
Foreword .v
Introduction .vi
1 Scope . 1
2 Normative references . 1
3 Terms, definitions and abbreviated terms . 2
4 File and record model . 3
5 Named fields . 5
5.1 General . 5
5.2 WARC-Record-ID (mandatory) . 5
5.3 Content-Length (mandatory) . 5
5.4 WARC-Date (mandatory) . 6
5.5 WARC-Type (mandatory) . 6
5.6 Content-Type . 6
5.7 WARC-Concurrent-To . 7
5.8 WARC-Block-Digest . 7
5.9 WARC-Payload-Digest . 7
5.10 WARC-IP-Address . 8
5.11 WARC-Refers-To . 8
5.12 WARC-Refers-To-Target-URI . 8
5.13 WARC-Refers-To-Date . 8
5.14 WARC-Target-URI . 9
5.15 WARC-Truncated. 9
5.16 WARC-Warcinfo-ID . 9
5.17 WARC-Filename. 9
5.18 WARC-Profile .10
5.19 WARC-Identified-Payload-Type .10
5.20 WARC-Segment-Number .10
5.21 WARC-Segment-Origin-ID .10
5.22 WARC-Segment-Total-Length .10
6 WARC record types .11
6.1 General .11
6.2 ‘warcinfo’ .11
6.3 ‘response’ .11
6.3.1 General.11
6.3.2 ‘http’ and ‘https’ schemes .12
6.3.3 Other URI schemes .12
6.4 ‘resource’ .12
6.4.1 General.12
6.4.2 ‘http’ and ‘https’ schemes .12
6.4.3 ‘ftp’ scheme .12
6.4.4 ‘dns’ scheme .13
6.4.5 Other URI schemes .13
6.5 ‘request’ .13
6.5.1 General.13
6.5.2 ‘http’ and ‘https’ schemes .13
6.5.3 Other URI schemes .13
6.6 ‘metadata’ .13
6.7 ‘revisit’ .14
6.7.1 General.14
6.7.2 Profile: Identical Payload Digest .14
6.7.3 Profile: Server Not Modified .15
6.7.4 Other profiles .15
6.8 ‘conversion’ .15
6.9 ‘continuation’ .16
7 Record segmentation .16
8 WARC file name, size and compression .16
Annex A (informative) Use cases for writing WARC records .18
Annex B (informative) Examples of WARC records .21
Annex C (informative) WARC file size and name recommendations .24
Annex D (informative) Compression recommendations .25
Bibliography .26
iv © ISO 2017 – All rights reserved
Foreword
ISO (the International Organization for Standardization) is a worldwide federation of national standards
bodies (ISO member bodies). The work of preparing International Standards is normally carried out
through ISO technical committees. Each member body interested in a subject for which a technical
committee has been established has the right to be represented on that committee. International
organizations, governmental and non-governmental, in liaison with ISO, also take part in the work.
ISO collaborates closely with the International Electrotechnical Commission (IEC) on all matters of
electrotechnical standardization.
The procedures used to develop this document and those intended for its further maintenance are
described in the ISO/IEC Directives, Part 1. In particular the different approval criteria needed for the
different types of ISO documents should be noted. This document was drafted in accordance with the
editorial rules of the ISO/IEC Directives, Part 2 (see www .iso .org/ directives).
Attention is drawn to the possibility that some of the elements of this document may be the subject of
patent rights. ISO shall not be held responsible for identifying any or all such patent rights. Details of
any patent rights identified during the development of the document will be in the Introduction and/or
on the ISO list of patent declarations received (see www .iso .org/ patents).
Any trade name used in this document is information given for the convenience of users and does not
constitute an endorsement.
For an explanation on the voluntary nature of standards, the meaning of ISO specific terms and
expressions related to conformity assessment, as well as information about ISO’s adherence to the
World Trade Organization (WTO) principles in the Technical Barriers to Trade (TBT) see the following
URL: w w w . i s o .org/ iso/ foreword .html.
This document was prepared by Technical Committee ISO/TC 46, Information and documentation,
Subcommittee 4, Technical interoperability.
This second edition cancels and replaces the first edition (ISO 28500:2009), which has been technically
revised.
Introduction
Websites and web pages emerge and disappear from the World Wide Web every day. For the past 10
years, memory storage organizations have tried to find the most appropriate ways to collect and keep
track of this vast quantity of important material using web-scale tools such as web crawlers. A web
crawler is a program that browses the web in an automated manner according to a set of policies;
starting with a list of URLs, it saves each page identified by a URL, finds all the hyperlinks in the page
(e.g. links to other pages, images, videos, scripting or style instructions, etc.), and adds them to the list
of URLs to visit recursively. Storing and managing the billions of saved web page objects itself presents
a challenge.
At the same time, those same organizations have a rising need to archive large numbers of digital
files not necessarily captured from the web (e.g. entire series of electronic journals, or data generated
by environmental sensing equipment). A general requirement that appears to be emerging is for a
container format that permits one file simply and safely to carry a very large number of constituent
data objects for the purpose of storage, management, and exchange. Those data objects (or resources)
need to be of unrestricted type (including many binary types for audio, CAD, compressed files, etc.), but
fortunately the container needs only minimal knowledge of the nature of the objects.
The WARC (Web ARChive) file format offers a convention for concatenating multiple resource records
(data objects), each consisting of a set of simple text headers and an arbitrary data block into one long
file. The WARC format is an extension of the ARC file format (ARC) that has traditionally been used to
store “web crawls” as sequences of content blocks harvested from the World Wide Web. Each capture
in an ARC file is preceded by a one-line header that very briefly describes the harvested content and its
length. This is directly followed by the retrieval protocol response messages and content. The original
ARC format file has been used by the Internet Archive (IA) since 1996 for managing billions of objects,
and by several national libraries.
The motivation to extend the ARC format arose from the discussion and experiences of the International
Internet Preservation Consortium (IIPC), whose members include the national libraries of Australia,
Canada, Denmark, Finland, France, Iceland, Italy, Norway, Sweden, The British Library (UK), The
Library of Congress (USA), and the Internet Archive (IA). The California Digital Library and the Los
Alamos National Laboratory also provided input on extending and generalizing the format.
The WARC format offers a standard way to structure, manage and store billions of resources collected
from the web and elsewhere. It is used to build applications for harvesting, managing, accessing, mining
and exchanging content. While it represents the unique standard format for web archives, it has been
adopted beyond the web archiving community to store born-digital or digitized materials. The way
WARC files will be created and resources stored and rendered will depend on software and applications
implementations.
Besides the primary content recorded in ARCs, the extended WARC format accommodates related
secondary content, such as assigned metadata, abbreviated duplicate detection events, later-date
transformations, and segmentation of large resources. The extension may also be useful for more
general applications than web archiving. To aid the development of tools that are backwards compatible,
WARC content is clearly distinguishable from pre-revision ARC content.
The WARC file format is made sufficiently different from the legacy ARC format files so that software
tools can unambiguously detect and correctly process both WARC and ARC records; given the large
amount of existing archival data in the previous ARC format, it is important that access and use of this
legacy not be interrupted when transitioning to the WARC format.
vi © ISO 2017 – All rights reserved
INTERNATIONAL STANDARD ISO 28500:2017(E)
Information and documentation — WARC file format
1 Scope
This document specifies the WARC file format:
— to store both the payload content and control information from mainstream Internet application
layer protocols, such as the HTTP, DNS, and FTP;
— to store arbitrary metadata linked to other stored data (e.g. subject classifier, discovered language,
encoding);
— to support data compression and maintain data record integrity;
— to store all control information from the harvesting protocol (e.g. request headers), not just response
information;
— to store the results of data transformations linked to other stored data;
— to store a duplicate detection event linked to other stored data (to reduce storage in the presence of
identical or substantially similar resources);
— to be extended without disruption to existing functionality;
— to support handling of overly long records by truncation or segmentation, where desired.
2 Normative references
The following documents are referred to in the text in such a way that some or all of their content
constitutes requirements of this document. For dated references, only the edition cited applies. For
undated references, the latest edition of the referenced document (including any amendments) applies.
1)
RFC1035 Mockapetris, P. Domain names — Implementation and specification, STD 13, November 1987
2)
RFC2045 Freed, N. and Borenstein, N. Multipurpose Internet Mail Extensions (MIME) Part One: Format
of Internet Message Bodies, November 1996
3)
RFC2540 Eastlake, D. Detached Domain Name System (DNS) Information, March 1999
4)
RFC2616 Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P. and Berners-Lee, T.
Hypertext Transfer Protocol — HTTP/1.1. June 1999 (TXT, PS, PDF, HTML, XML)
5)
RFC3629 Yergeau, F. UTF-8, a transformation format of ISO 10646. STD 63, November 2003
6)
RFC3986 Berners-Lee, T., Fielding, R., Masinter, L. Uniform Resource Identifier (URI): Generic Syntax.
STD 66, January 2005 (TXT, HTML, XML)
1) Available at: https:// www .ietf .org/ rfc/ rfc1035 .txt.
2) Available at: https:// www .ietf .org/ rfc/ rfc2045 .txt.
3) Available at: https:// tools .ietf .org/ html/ rfc2540.
4) Available at: https:// www .ietf .org/ rfc/ rfc2616 .txt.
5) Available at: https:// tools .ietf .org/ html/ rfc3629.
6) Available at: https:// www .ietf .org/ rfc/ rfc3986 .txt.
7)
RFC4027 Josefsson, S. Domain Name System Media Types, April 2005
8)
RFC4291 Hinden, R. and Deering, S. IP Version 6 Addressing Architecture, February 2006
9)
RFC5322 Resnick, P. (ed.) Internet Message Format, October 2008
3 Terms, definitions and abbreviated terms
3.1 Terms and definitions
For the purposes of this document, the following terms and definitions apply.
ISO and IEC maintain terminological databases for use in standardization at the following addresses:
— ISO Online browsing platform: available at http:// www .iso .org/ obp
— IEC Electropedia: available at http:// www .electropedia .org/
3.1.1
WARC record
basic constituent of a WARC file, consisting of a sequence of WARC records
3.1.2
WARC record content block
part (zero or more octets) of a WARC record that follows the header and that forms the main body of a
WARC record
3.1.3
WARC record payload
data object referred to, or contained by a WARC record as a meaningful subset of the content block
3.1.4
WARC record header
beginning of a WARC record, consisting of one first line declaring the record to be in the WARC format
with a given version number, followed by lines of named fields up to a blank line
3.1.5
WARC named fields
set of elements consisting of a name, a colon, and a value, with long values continued on indented lines
3.1.6
WARC logical record
record composed of multiple segments, each represented by a WARC record
3.2 Abbreviated terms
ABNF augmented Backus-Naur form
ARC archive
CRLF carriage return line feed
DNS domain name system
FTP file transfer protocol
7) Available at: https:// tools .ietf .org/ html/ rfc4027.
8) Available at: https:// tools .ietf .org/ html/ rfc4291.
9) Available at: https:// tools .ietf .org/ html/ rfc5322.
2 © ISO 2017 – All rights reserved
HTTP hypertext transport protocol
IANA Internet Assigned Numbers Authority
IESG Internet Engineering Steering Group
RFC request for comments
UR (I/L/N) uniform resource (identifier/locator/name)
WARC web archive
4 File and record model
A WARC format file is the simple concatenation of one or more WARC records. The first record usually
describes the records to follow. In general, record content is either the direct result of a retrieval
attempt (web pages, inline images, URL redirection information, DNS hostname lookup results, stand-
alone files, etc.) or is synthesized material (e.g. metadata, transformed content) that provides additional
information about archived content.
A WARC record shall consist of a record header followed by a record content block and two new lines.
The WARC record header shall consist of one first line declaring the record to be in the WARC format
with a given version number, then a variable number of line-oriented named fields terminated by a
blank line. The WARC record header format shall follow the general rules of HTTP/1.1 [RFC2616] and
[RFC5322] headers with one major exception: it shall also allow UTF-8 characters, as specified in
[RFC3629].
The top-level view of a WARC file can be expressed in an ABNF grammar, reusing the augmented
constructs defined in section 2.1 of HTTP/1.1 [RFC2616]. (In particular, note that to avoid the risk
of confusion, where any WARC rule has the same name as an [RFC2616] rule, the definition here has
been made the same, except in the case of the CHAR rule, which in WARC includes multibyte UTF-
8 characters.)
warc-file = 1*warc-record
warc-record = header CRLF
block CRLF CRLF
header = version warc-fields
version = "WARC/1.1" CRLF
warc-fields = *named-field CRLF
block = *OCTET
The record version shall appear first in every record and hence shall also begin the WARC file itself.
The WARC record relies heavily on named fields. Each named field consists of a name followed by a
colon (“:”) and the field value. Field names are not case-sensitive. The field value may be preceded
by any amount of linear white space (LWS), though a single space is preferred. Header fields can be
extended over multiple lines by preceding each extra line with at least one space or tab character.
Named fields may appear in any order and field values may contain any UTF-8 character. Both defined-
fields and extension-fields follow the generic named-field format. Extension-fields may be used in
extensions of the core format.
named-field = field-name ":" [ field-value ]
field-name = token
field-value = *(field-content | LWS) ; further qualified
; by field
; definitions
field-content =
and consisting of either *TEXT or combinations
of token, separators, and quoted-string>
OCTET =
token = 1*
except CTLs or separators>
separators = "(" | ")" | "<" | ">" | "@"
| "," | ""; | ":" | "\" | <">
| "/" | "[" | "]" | "?" | "="
| "{" | "}" | SP | HT
TEXT =
but including LWS>
CHAR = ; (0-191, 194-244)
DIGIT =
CTL =
(octets 0 - 31) and DEL (127)>
CR = ; (13)
LF = ; (10)
SP = ; (32)
HT = ; (9)
CRLF = CR LF
LWS = [CRLF] 1*(SP | HT) ; semantics same as
; single SP
quoted-string = (<"> *(qdtext | quoted-pair) <"> )
qdtext = >
quoted-pair = "\" CHAR ; single-character quoting
uri = <'URI' per RFC3986>
Although UTF-8 characters are allowed, the ‘encoded-word’ mechanism of [RFC2047] may also be used
when writing WARC fields and shall also be understood by WARC reading software.
NOTE In WARC 1.0 standard (ISO 28500:2009), uri was defined as “<” <ʹURIʹ per RFC3986> “>”. This rule has
been changed to meet requests from implementers.
The rest of the WARC record grammar concerns defined-field parameters such as record identifier,
record type, creation time, content length, and content type.
defined-field = WARC-Type
| WARC-Record-ID
| WARC-Date
| Content-Length
| Content-Type
| WARC-Concurrent-To
| WARC-Block-Digest
| WARC-Payload-Digest
| WARC-IP-Address
| WARC-Refers-To
| WARC-Refers-To-Target-URI
| WARC-Refers-To-Date
| WARC-Target-URI
| WARC-Truncated
| WARC-Warcinfo-ID
| WARC-Filename ; warcinfo only
| WARC-Profile ; revisit only
| WARC-Identified-Payload-Type
| WARC-Segment-Origin-ID ; continuation only
| WARC-Segment-Number
| WARC-Segment-Total-Length ; continuation only
Every WARC record shall have a type, reported in the WARC-Type field. Eight WARC record types are
defined in this document as follows:
— ‘warcinfo’;
— ‘response’;
— ‘resource’;
— ‘request’;
— ‘metadata’;
— ‘revisit’;
— ‘conversion’;
— ‘continuation’.
4 © ISO 2017 – All rights reserved
The relevant fields for each record type are described in detail in Clause 6. Each field’s meaning and
legal value format are described in Clause 5.
The record block shall contain octet content, interpreted based on the record type and other header
values. All records shall include a Content-Length field to specify the length of the block.
Some record types (and possibly future record types) also define a payload, such as a meaningful subset
of the block or content from a predecessor record. Some headers pertain to the payload of a record
rather than the block directly.
For example, in a ‘response’ record with a content block consisting of HTTP headers and a data
object, the payload would be the data object. All ‘response’, ‘resource’, ‘request’, ‘revisit’, ‘conversion’
and ‘continuation’ records may have a payload. All ‘warcinfo’ and ‘metadata’ records shall not have a
payload.
Content matching the warc-file rule shall have the MIME content-type “application/warc”, as specified
in Clause 8.
Content matching only the warc-fields rule is useful as a simple descriptive format, and has MIME
content-type “application/warc-fields”, as specified in Clause 8.
New named fields and new records types may be defined in extensions of the core format. However, it
is strongly recommended to discuss any addition to verify that a suitable field or type does not already
exist to avoid collision. Discussion should notably be held within the IIPC. See Reference [11] for more
information.
5 Named fields
5.1 General
Named fields within a WARC record provide information about the current record. WARC both reuses
appropriate headers from other standards and defines new headers, all beginning “WARC-”, for WARC-
specific purposes.
WARC named fields of the same type shall not be repeated in the same WARC record (for example, a
WARC record shall not have several WARC-Date or several WARC-Target-URI), except as noted (e.g.
WARC-Concurrent-To).
Because new fields may be defined in extensions to the core WARC format, WARC processing software
shall ignore fields with unrecognized names.
5.2 WARC-Record-ID (mandatory)
A WARC-Record-ID is an identifier assigned to the current record that is globally unique for its period
of intended use. No identifier scheme is mandated by this specification, but each WARC-Record-ID shall
be a legal URI and clearly indicate a documented and registered scheme to which it conforms (e.g. via
a URI scheme prefix such as “http:” or “urn:”). Care should be taken to ensure that this value is written
with no internal white space.
WARC-Record-ID = "WARC-Record-ID" ":" "<" uri ">"
All records shall have a WARC-Record-ID field.
5.3 Content-Length (mandatory)
The Content-Length is the number of octets in the block, similar to [RFC2616]. If no block is present, a
value of “0” (zero) shall be used.
Content-Length = "Content-Length" ":" 1*DIGIT
All records shall have a Content-Length field.
5.4 WARC-Date (mandatory)
[1]
The WARC-Date is a UTC timestamp as described in ISO 8601 , for example YYYY -MM -DDThh: mm:
ssZ. The timestamp shall represent the instant that data capture for record creation began. Multiple
records written as part of a single capture event (see 5.7) shall use the same WARC-Date, even though
the times of their writing will not be exactly synchronized.
[1]
WARC-Date may be specified at any of the levels of granularity described in . If WARC-Date includes
a decimal fraction of a second, the decimal fraction of a second shall have a minimum of 1 digit and a
maximum of 9 digits. WARC-Date should be specified with as much precision as is accurately known.
This document recommends no particular algorithm for access software to choose a record by date
when an exact match is not available.
WARC-Date = "WARC-Date" ":" iso8601
iso8601 =
All records shall have a WARC-Date field.
See Annex A for examples on usage of WARC-Date fields.
5.5 WARC-Type (mandatory)
WARC-Type is the type of WARC record. Record types defined in this document are
— ‘warcinfo’,
— ‘response’,
— ‘resource’,
— ‘request’,
— ‘metadata’,
— ‘revisit’,
— ‘conversion’, and
— ‘continuation’.
Other types of WARC records may be defined in extensions of the core format. Types are further
described in Clause 6.
A WARC file needs not to contain any particular record types, though starting all WARC files with a
‘warcinfo’ record is recommended.
WARC-Type = "WARC-Type" ":" record-type
record-type = "warcinfo" | "response" | "resource"
| "request" | "metadata" | "revisit"
| "conversion" | "continuation"
All records shall have a WARC-Type field.
WARC processing software shall ignore records of unrecognized type.
See Annex A for examples on usage of WARC-Type fields.
5.6 Content-Type
The Content-Type field is the MIME type (as defined in [RFC2045]) of information contained
in the record’s block. For example, in HTTP request and response records, this would be
‘application/http’ as specified in 19.1 of [RFC2616] (or ‘application/http; msgtype=request’ and
‘application/http; msgtype = response’ respectively). In particular, the content-type is not the value of
the HTTP Content-Type header in a HTTP response but a MIME type to describe the full archived HTTP
message (hence ‘application/http’ if the block contains request or response headers).
6 © ISO 2017 – All rights reserved
Content-Type = "Content-Type" ":" media-type
media-type = type "/" subtype *(";" parameter)
type = token
subtype = token
parameter = attribute "=" value
attribute = token
value = token | quoted-string
All records with a non-empty block (non-zero Content-Length), except ‘continuation’ records, should
have a Content-Type field. Only if the media type is not given by a Content-Type field, a reader may
attempt to guess the media type via inspection of its content and/or the name extension(s) of the URI
used to identify the resource. If the media type remains unknown, the reader should treat it as type
“application/octet-stream”.
5.7 WARC-Concurrent-To
The WARC-Concurrent-To field (or fields) contains the WARC-Record-ID of any records created as part of
the same capture event as the current record. A capture event comprises the information automatically
gathered by a retrieval against a single WARC-Target-URI; for example, it may be represented by a
‘response’ or ‘revisit’ record plus its associated ‘request’ record.
WARC-Concurrent-To = "WARC-Concurrent-To" ":" "<" uri ">"
This field may be used to associate records of types ‘request’, ‘response’, ‘resource’, ‘metadata’, and
‘revisit’ with one another when they arise from a single capture event. (When so used, any WARC-
Concurrent-To association shall be considered bidirectional even if the header only appears on one
record.) The WARC-Concurrent-To field shall not be used in ‘warcinfo’, ‘conversion’, and ‘continuation’
records.
As an exception to the general rule, several WARC-Concurrent-To fields may be repeated within the
same WARC record.
See Annex A for examples on usage of WARC-Concurrent-To fields.
5.8 WARC-Block-Digest
The WARC-Block-Digest is an optional parameter indicating the algorithm name and calculated value of
a digest applied to the full block of the record.
WARC-Block-Digest = "WARC-Block-Digest" ":" labelled-digest
labelled-digest = algorithm ":" digest-value
algorithm = token
digest-value = token
An example is a SHA-1 labelled Base32 ([RFC4648]) value:
WARC-Block-Digest: sha1: AB2CD3EF4GH5IJ6KL7MN8OPQ
No particular algorithm is recommended.
Any record may have a WARC-Block-Digest field.
5.9 WARC-Payload-Digest
A WARC-Payload-Digest is an optional parameter indicating the algorithm name and calculated value
of a digest applied to the payload referred to or contained by the record, which is not necessarily
equivalent to the record block.
WARC-Payload-Digest = "WARC-Payload-Digest" ":" labelled-digest
An example is a SHA-1 labelled Base32 ([RFC4648]) value:
WARC-Payload-Digest: sha1: 3EF4GH5IJ6KL7MN8OPQAB2CD
No particular algorithm is recommended.
The payload of an application/http block is its ‘entity-body’ (specified in [RFC2616]). In contrast to
WARC-Block-Digest, the WARC-Payload-Digest field may also be used for data not actually present in the
current record block, for example when a block is left off in accordance with a ‘revisit’ profile (see 6.7),
or when a record is segmented (the WARC-Payload-Digest recorded in the segments of a segmented
record shall be the digest of the payload of the logical record).
The WARC-Payload-Digest field may be used on WARC records with a well-defined payload and shall
not be used on records without a well-defined payload.
5.10 WARC-IP-Address
The WARC-IP-Address is the numeric Internet address contacted to retrieve any included content.
An IPv4 address shall be written as a “dotted quad”; an IPv6 address shall be written as specified in
[RFC4291]. For a HTTP retrieval, this will be the IP address used at retrieval time corresponding to the
hostname in the record’s target-URI.
WARC-IP-Address = "WARC-IP-Address" ":" (ipv4 | ipv6)
ipv4 = <"dotted quad">
ipv6 =
The WARC-IP-Address field may be used on ‘response’, ‘resource’, ‘request’, ‘metadata’, and ‘revisit’
records, but shall not be used on ‘warcinfo’, ‘conversion’ or ‘continuation’ records.
5.11 WARC-Refers-To
The WARC-Refers-To field contains the WARC-Record-ID of a single record for which the present record
holds additional content.
WARC-Refers-To = "WARC-Refers-To" ":" "<" uri ">"
The WARC-Refers-To field may be used to associate a ‘metadata’ record to another record it describes.
The WARC-Refers-To field may also be used to associate a record of type ‘revisit’ or ‘conversion’ with
the preceding record which helped determine the present record content. The WARC-Refers-To field
shall not be used in ‘warcinfo’, ‘response’, ‘resource’, ‘request’, and ‘continuation’ records.
See Annex A for examples on usage of WARC-Refers-To fields.
5.12 WARC-Refers-To-Target-URI
The WARC-Refers-To-Target-URI field contains the WARC-Target-URI of a record for which the present
record is considered a revisit of.
WARC-Refers-To-Target-URI = "WARC-Refers-To-Target-URI" ":" uri
The WARC-Refers-To-Target-URI field may be used to associate a record of type ‘revisit’ with another
record which contains the resource which has not been archived.
The WARC-Refers-To-Target-URI field may be used in ‘revisit’ records and shall not be used in ‘warcinfo’,
‘response’, ‘metadata’, ‘conversion’, ‘resource’, ‘request’, and ‘continuation’ records.
5.13 WARC-Refers-To-Date
The WARC-Refers-To-Date field contains the WARC-Date of a record for which the present record is
considered a revisit of.
WARC-Refers-To-Date = "WARC-Refers-To-Date" ":" iso8601
iso8601 =
The WARC-Refers-To-Date field may be used to associate a record of type ‘revisit’ with another record
which contains the resource which has not been archived.
The WARC-Refers-To-Date field may be used in ‘revisit’ records and shall not be used in ‘warcinfo’,
‘response’, ‘metadata’, ‘conversion’, ‘resource’, ‘request’, and ‘continuation’ records.
8 © ISO 2017 – All rights reserved
5.14 WARC-Target-URI
The WARC-Target-URI is the original URI whose capture gave rise to the information content in this
record. In the context of web harvesting, this is the URI that was the target of a crawler’s retrieval
request. For a ‘revisit’ record, it is the URI that was the target of a retrieval request. Indirectly, such as
for a ‘metadata’, or ‘conversion’ record, it is a copy of the WARC-Target-URI appearing in the original
record to which the newer record pertains. The URI in this value shall be written as specified in
[RFC3986].
WARC-Target-URI = "WARC-Target-URI" ":" uri
All ‘response’, ‘resource’, ‘request’, ‘revisit’, ‘conversion’ and ‘continuation’ records shall have a WARC-
Target-URI field. A ‘metadata’ record may have a WARC-Target-URI field. A ‘warcinfo’ record shall not
have a WARC-Target-URI field.
5.15 WARC-Truncated
For practical reasons, writers of the WARC format may place limits on the time or storage allocated
to archiving a single resource. As a result, only a truncated portion of the original resource may be
available for saving into a WARC record.
Any record may indicate that truncation of its content block has occurred and give the reason with a
WARC-Truncated field.
WARC-Truncated = "WARC-Truncated" ":" reason-token
reason-token = "length" ; exceeds configured max
; length
| "time" ; exceeds configured max time
| "disconnect" ; network disconnect
| "unspecified" ; other/unknown reason
Other reasons may be defined in extensions of the core format.
For example, if the capture of what appeared to be a multi-gigabyte resource was cut short after a
transfer time limit was reached, the partial resource could be saved to a WARC record with this field.
The WARC-Truncated field may be used on any WARC record. The WARC Content-Length field shall still
report the actual truncated size of the record block.
5.16 WARC-Warcinfo-ID
When present, the WARC-Warcinfo-ID indicates the WARC-Record-ID of the associated ‘warcinfo’
record for this record. Typically, the WARC-Warcinfo-ID parameter is used when the context of the
applicable ‘warcinfo’ record is unavailable, such as after distributing single records into separate WARC
files. WARC writing applications (such as web crawlers) may choose to always record this parameter.
WARC-Warcinfo-ID = "WARC-Warcinfo-ID" ":" "<" uri ">"
The WARC-Warcinfo-ID field value overrides any association with a previously occurring (in the WARC)
‘warcinfo’ record, thus providing a way to protect the true association when records are combined
from different WARC files.
The WARC-Warcinfo-ID field may be used in any record type except ‘warcinfo’.
5.17 WARC-Filename
The WARC-Filename is the filename containing the current ‘warcinfo’ record.
WARC-Filename = "WARC-Filename" ":" (TEXT | quoted-string)
The WARC-Filename field may be used in ‘warcinfo’ type records and shall not be used for other
record types.
...










Questions, Comments and Discussion
Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.
Loading comments...