ISO 28500:2009
(Main)Information and documentation - WARC file format
Information and documentation - WARC file format
ISO 28500:2009 specifies the WARC file format: to store both the payload content and control information from mainstream Internet application layer protocols, such as the Hypertext Transfer Protocol (HTTP), Domain Name System (DNS), and File Transfer Protocol (FTP); to store arbitrary metadata linked to other stored data (e.g. subject classifier, discovered language, encoding); to support data compression and maintain data record integrity; to store all control information from the harvesting protocol (e.g. request headers), not just response information; to store the results of data transformations linked to other stored data; to store a duplicate detection event linked to other stored data (to reduce storage in the presence of identical or substantially similar resources); to be extended without disruption to existing functionality; to support handling of overly long records by truncation or segmentation, where desired.
Information et documentation — Format de fichier WARC
Informatika in dokumentacija - Datotečna oblika zapisa WARC
General Information
Relations
Frequently Asked Questions
ISO 28500:2009 is a standard published by the International Organization for Standardization (ISO). Its full title is "Information and documentation - WARC file format". This standard covers: ISO 28500:2009 specifies the WARC file format: to store both the payload content and control information from mainstream Internet application layer protocols, such as the Hypertext Transfer Protocol (HTTP), Domain Name System (DNS), and File Transfer Protocol (FTP); to store arbitrary metadata linked to other stored data (e.g. subject classifier, discovered language, encoding); to support data compression and maintain data record integrity; to store all control information from the harvesting protocol (e.g. request headers), not just response information; to store the results of data transformations linked to other stored data; to store a duplicate detection event linked to other stored data (to reduce storage in the presence of identical or substantially similar resources); to be extended without disruption to existing functionality; to support handling of overly long records by truncation or segmentation, where desired.
ISO 28500:2009 specifies the WARC file format: to store both the payload content and control information from mainstream Internet application layer protocols, such as the Hypertext Transfer Protocol (HTTP), Domain Name System (DNS), and File Transfer Protocol (FTP); to store arbitrary metadata linked to other stored data (e.g. subject classifier, discovered language, encoding); to support data compression and maintain data record integrity; to store all control information from the harvesting protocol (e.g. request headers), not just response information; to store the results of data transformations linked to other stored data; to store a duplicate detection event linked to other stored data (to reduce storage in the presence of identical or substantially similar resources); to be extended without disruption to existing functionality; to support handling of overly long records by truncation or segmentation, where desired.
ISO 28500:2009 is classified under the following ICS (International Classification for Standards) categories: 35.240.30 - IT applications in information, documentation and publishing. The ICS classification helps identify the subject area and facilitates finding related standards.
ISO 28500:2009 has the following relationships with other standards: It is inter standard links to ISO 28500:2017. Understanding these relationships helps ensure you are using the most current and applicable version of the standard.
You can purchase ISO 28500:2009 directly from iTeh Standards. The document is available in PDF format and is delivered instantly after payment. Add the standard to your cart and complete the secure checkout process. iTeh Standards is an authorized distributor of ISO standards.
Standards Content (Sample)
INTERNATIONAL ISO
STANDARD 28500
First edition
2009-05-15
Information and documentation — WARC
file format
Information et documentation — Format de fichier WARC
Reference number
©
ISO 2009
PDF disclaimer
This PDF file may contain embedded typefaces. In accordance with Adobe's licensing policy, this file may be printed or viewed but
shall not be edited unless the typefaces which are embedded are licensed to and installed on the computer performing the editing. In
downloading this file, parties accept therein the responsibility of not infringing Adobe's licensing policy. The ISO Central Secretariat
accepts no liability in this area.
Adobe is a trademark of Adobe Systems Incorporated.
Details of the software products used to create this PDF file can be found in the General Info relative to the file; the PDF-creation
parameters were optimized for printing. Every care has been taken to ensure that the file is suitable for use by ISO member bodies. In
the unlikely event that a problem relating to it is found, please inform the Central Secretariat at the address given below.
© ISO 2009
All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized in any form or by any means,
electronic or mechanical, including photocopying and microfilm, without permission in writing from either ISO at the address below or
ISO's member body in the country of the requester.
ISO copyright office
Case postale 56 • CH-1211 Geneva 20
Tel. + 41 22 749 01 11
Fax + 41 22 749 09 47
E-mail copyright@iso.org
Web www.iso.org
Published in Switzerland
ii © ISO 2009 – All rights reserved
Contents Page
Foreword. v
Introduction . vi
1 Scope . 1
2 Normative references . 1
3 Terms, definitions and abbreviated terms . 2
3.1 Terms and definitions. 2
3.2 Abbreviated terms . 2
4 File and record model. 3
5 Named fields. 5
5.1 General. 5
5.2 WARC-Record-ID (mandatory) . 6
5.3 Content-Length (mandatory) . 6
5.4 WARC-Date (mandatory). 6
5.5 WARC-Type (mandatory) . 6
5.6 Content-Type. 7
5.7 WARC-Concurrent-To. 7
5.8 WARC-Block-Digest. 8
5.9 WARC-Payload-Digest. 8
5.10 WARC-IP-Address. 8
5.11 WARC-Refers-To. 9
5.12 WARC-Target-URI . 9
5.13 WARC-Truncated . 9
5.14 WARC-Warcinfo-ID . 10
5.15 WARC-Filename . 10
5.16 WARC-Profile . 10
5.17 WARC-Identified-Payload-Type. 10
5.18 WARC-Segment-Number. 10
5.19 WARC-Segment-Origin-ID. 11
5.20 WARC-Segment-Total-Length . 11
6 WARC record types . 11
6.1 General. 11
6.2 'warcinfo'. 11
6.3 'response' . 12
6.4 'resource' . 13
6.5 'request' . 13
6.6 'metadata'. 14
6.7 'revisit'. 15
6.8 'conversion' . 16
6.9 'continuation'. 16
7 Record segmentation . 16
8 Registration of MIME media types application/warc and application/warc-fields . 17
8.1 General. 17
8.2 application/warc. 17
8.3 application/warc-fields . 18
9 WARC file name, size and compression . 18
Annex A (informative) Use cases for writing WARC records . 19
Annex B (informative) Examples of WARC records. 22
Annex C (informative) WARC file size and name recommendations . 26
Annex D (informative) Compression recommendations . 27
Bibliography . 28
iv © ISO 2009 – All rights reserved
Foreword
ISO (the International Organization for Standardization) is a worldwide federation of national standards bodies
(ISO member bodies). The work of preparing International Standards is normally carried out through ISO
technical committees. Each member body interested in a subject for which a technical committee has been
established has the right to be represented on that committee. International organizations, governmental and
non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely with the
International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization.
International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 2.
The main task of technical committees is to prepare International Standards. Draft International Standards
adopted by the technical committees are circulated to the member bodies for voting. Publication as an
International Standard requires approval by at least 75 % of the member bodies casting a vote.
Attention is drawn to the possibility that some of the elements of this document may be the subject of patent
rights. ISO shall not be held responsible for identifying any or all such patent rights.
ISO 28500 was prepared by Technical Committee ISO/TC 46, Information and documentation, Subcommittee
SC 4, Technical interoperability.
Introduction
Websites and web pages emerge and disappear from the World Wide Web every day. For the past ten years,
memory storage organizations have tried to find the most appropriate ways to collect and keep track of this
vast quantity of important material using web-scale tools such as web crawlers. A web crawler is a program
that browses the web in an automated manner according to a set of policies; starting with a list of URLs, it
saves each page identified by a URL, finds all the hyperlinks in the page (e.g. links to other pages, images,
videos, scripting or style instructions, etc.), and adds them to the list of URLs to visit recursively. Storing and
managing the billions of saved web page objects itself presents a challenge.
At the same time, those same organizations have a rising need to archive large numbers of digital files not
necessarily captured from the web (e.g. entire series of electronic journals, or data generated by
environmental sensing equipment). A general requirement that appears to be emerging is for a container
format that permits one file simply and safely to carry a very large number of constituent data objects for the
purpose of storage, management, and exchange. Those data objects (or resources) need to be of unrestricted
type (including many binary types for audio, CAD, compressed files, etc.), but fortunately the container needs
only minimal knowledge of the nature of the objects.
The WARC (Web ARChive) file format offers a convention for concatenating multiple resource records (data
objects), each consisting of a set of simple text headers and an arbitrary data block into one long file. The
WARC format is an extension of the ARC file format (ARC) that has traditionally been used to store "web
crawls" as sequences of content blocks harvested from the World Wide Web. Each capture in an ARC file is
preceded by a one-line header that very briefly describes the harvested content and its length. This is directly
followed by the retrieval protocol response messages and content. The original ARC format file has been used
by the Internet Archive (IA) since 1996 for managing billions of objects, and by several national libraries.
The motivation to extend the ARC format arose from the discussion and experiences of the International
Internet Preservation Consortium (IIPC), whose members include the national libraries of Australia, Canada,
Denmark, Finland, France, Iceland, Italy, Norway, Sweden, The British Library (UK), The Library of Congress
(USA), and the Internet Archive (IA). The California Digital Library and the Los Alamos National Laboratory
also provided input on extending and generalizing the format.
The WARC format is expected to be a standard way to structure, manage and store billions of resources
collected from the web and elsewhere. It will be used to build applications for harvesting (such as the open
source Heritrix web crawler), managing, accessing, and exchanging content. The way WARC files will be
created and resources stored and rendered will depend on software and applications implementations.
Besides the primary content recorded in ARCs, the extended WARC format accommodates related secondary
content, such as assigned metadata, abbreviated duplicate detection events, later-date transformations, and
segmentation of large resources. The extension may also be useful for more general applications than web
archiving. To aid the development of tools that are backwards compatible, WARC content is clearly
distinguishable from pre-revision ARC content.
The WARC file format is made sufficiently different from the legacy ARC format files so that software tools can
unambiguously detect and correctly process both WARC and ARC records; given the large amount of existing
archival data in the previous ARC format, it is important that access and use of this legacy not be interrupted
when transitioning to the WARC format.
After the Internet Engineering Steering Group (IESG: http://www.ietf.org/iesg.html) approval, IANA (Internet
Assigned Numbers Authority: http://www.iana.org/) is expected to register the WARC type "application/warc"
using the application provided in this International Standard and following procedures defined in [RFC2048].
vi © ISO 2009 – All rights reserved
INTERNATIONAL STANDARD ISO 28500:2009(E)
Information and documentation — WARC file format
1 Scope
This International Standard specifies the WARC file format:
⎯ to store both the payload content and control information from mainstream Internet application layer
protocols, such as the HTTP, DNS, and FTP;
⎯ to store arbitrary metadata linked to other stored data (e.g. subject classifier, discovered language,
encoding);
⎯ to support data compression and maintain data record integrity;
⎯ to store all control information from the harvesting protocol (e.g. request headers), not just response
information;
⎯ to store the results of data transformations linked to other stored data;
⎯ to store a duplicate detection event linked to other stored data (to reduce storage in the presence of
identical or substantially similar resources);
⎯ to be extended without disruption to existing functionality;
⎯ to support handling of overly long records by truncation or segmentation, where desired.
2 Normative references
The following referenced documents are indispensable for the application of this document. For dated
references, only the edition cited applies. For undated references, the latest edition of the referenced
document (including any amendments) applies.
ISO 8601, Data elements and interchange formats — Information interchange — Representation of dates and
times
[RFC1035] Mockapetris, P. Domain names — Implementation and specification. STD 13, November 1987.
Available at: http://www.faqs.org/rfcs/rfc1035.html
[RFC1884] Hinden, R. and Deering, S. IP Version 6 Addressing Architecture. December 1995. Available at:
http://www.faqs.org/rfcs/rfc1884.html
[RFC2045] Freed, N. and Borenstein, N. Multipurpose Internet Mail Extensions (MIME) Part One: Format of
Internet Message Bodies. November 1996. Available at: http://www.faqs.org/rfcs/rfc2045
[RFC2540] Eastlake, D. Detached Domain Name System (DNS) Information. March 1999. Available at:
http://www.faqs.org/rfcs/rfc2540.html
[RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P. and Berners-Lee, T.
Hypertext Transfer Protocol — HTTP/1.1. June 1999 (TXT, PS, PDF, HTML, XML). Available at:
http://www.faqs.org/rfcs/rfc2616.html
[RFC2822] Resnick, P. (ed.) Internet Message Format. April 2001. Available at:
http://www.faqs.org/rfcs/rfc2822
[RFC3629] Yergeau, F. UTF-8, a transformation format of ISO 10646. STD 63, November 2003. Available at:
http://www.faqs.org/rfcs/rfc3629.html
[RFC3986] Berners-Lee, T., Fielding, R., Masinter, L. Uniform Resource Identifier (URI): Generic Syntax. STD
66, January 2005 (TXT, HTML, XML). Available at: http://www.faqs.org/rfcs/rfc3986.html
[RFC4027] Josefsson, S. Domain Name System Media Types. April 2005. Available at:
http://www.faqs.org/rfcs/rfc4027.html
[W3CDTF] Date and Time Formats: note submitted to the W3C. 15 September 1997 (W3C profile of
ISO 8601). Available at: http://www.w3.org/TR/NOTE-datetime
3 Terms, definitions and abbreviated terms
3.1 Terms and definitions
For the purposes of this document, the following terms and definitions apply.
3.1.1
WARC record
basic constituent of a WARC file, consisting of a sequence of WARC records
3.1.2
WARC record content block
part (zero or more octets) of a WARC record that follows the header and that forms the main body of a WARC
record
3.1.3
WARC record payload
data object referred to, or contained by a WARC record as a meaningful subset of the content block
3.1.4
WARC record header
beginning of a WARC record, consisting of one first line declaring the record to be in the WARC format with a
given version number, followed by lines of named fields up to a blank line
3.1.5
WARC named fields
set of elements consisting of a name, a colon, and a value, with long values continued on indented lines
3.1.6
WARC logical record
in the context of segmentation, a logical record may be composed of multiple segments, each represented by
a WARC record
3.2 Abbreviated terms
ABNF augmented Backus-Naur form
ARC archive
CRLF carriage return line feed
2 © ISO 2009 – All rights reserved
DNS domain name system
FTP file transfer protocol
HTTP hypertext transport protocol
IANA Internet Assigned Numbers Authority
IESG Internet Engineering Steering Group
RFC request for comments
UR (I/L/N) uniform resource (identifier/locator/name)
WARC web archive
4 File and record model
A WARC format file is the simple concatenation of one or more WARC records. The first record usually
describes the records to follow. In general, record content is either the direct result of a retrieval attempt (web
pages, inline images, URL redirection information, DNS hostname lookup results, stand-alone files, etc.) or is
synthesized material (e.g. metadata, transformed content) that provides additional information about archived
content.
A WARC record shall consist of a record header followed by a record content block and two new lines. The
WARC record header shall consist of one first line declaring the record to be in the WARC format with a given
version number, then a variable number of line-oriented named fields terminated by a blank line. The WARC
record header format shall follow the general rules of HTTP/1.1 [RFC2616] and [RFC2822] headers with one
major exception: it shall also allow UTF-8 characters, as specified in [RFC3629].
The top-level view of a WARC file can be expressed in an ABNF grammar, reusing the augmented constructs
defined in section 2.1 of HTTP/1.1 [RFC2616]. (In particular, note that to avoid the risk of confusion, where
any WARC rule has the same name as an [RFC2616] rule, the definition here has been made the same,
except in the case of the CHAR rule, which in WARC includes multibyte UTF-8 characters.)
warc-file = 1*warc-record
warc-record = header CRLF
block CRLF CRLF
header = version warc-fields
version = "WARC/1.0" CRLF
warc-fields = *named-field CRLF
block = *OCTET
The record version shall appear first in every record and hence shall also begin the WARC file itself.
The WARC record relies heavily on named fields. Each named field consists of a name followed by a colon
(":") and the field value. Field names are not case-sensitive. The field value may be preceded by any amount
of linear white space (LWS), though a single space is preferred. Header fields can be extended over multiple
lines by preceding each extra line with at least one space or tab character.
Named fields may appear in any order and field values may contain any UTF-8 character. Both defined-fields
and extension-fields follow the generic named-field format. Extension-fields may be used in extensions of the
core format.
named-field = field-name ":" [ field-value ]
field-name = token
field-value = *( field-content | LWS ) ; further qualified
; by field
; definitions
field-content =
and consisting of either *TEXT or combinations
of token, separators, and quoted-string>
OCTET =
token = 1*
except CTLs or separators>
separators = "(" | ")" | "<" | ">" | "@"
| "," | ";" | ":" | "\" | <">
| "/" | "[" | "]" | "?" | "="
| "{" | "}" | SP | HT
TEXT =
but including LWS>
CHAR = ; (0-191, 194-244)
DIGIT =
CTL =
(octets 0 - 31) and DEL (127)>
CR = ; (13)
LF = ; (10)
SP = ; (32)
HT = ; (9)
CRLF = CR LF
LWS = [CRLF] 1*( SP | HT ) ; semantics same as
; single SP
quoted-string = ( <"> *(qdtext | quoted-pair ) <"> )
qdtext = >
quoted-pair = "\" CHAR ; single-character quoting
uri = "<" <'URI' per RFC3986> ">"
Although UTF-8 characters are allowed, the 'encoded-word' mechanism of [RFC2047] may also be used when
writing WARC fields and shall also be understood by WARC reading software.
The rest of the WARC record grammar concerns defined-field parameters such as record identifier, record
type, creation time, content length, and content type.
defined-field = WARC-Type
| WARC-Record-ID
| WARC-Date
| Content-Length
| Content-Type
| WARC-Concurrent-To
| WARC-Block-Digest
| WARC-Payload-Digest
| WARC-IP-Address
| WARC-Refers-To
| WARC-Target-URI
| WARC-Truncated
| WARC-Warcinfo-ID
| WARC-Filename ; warcinfo only
| WARC-Profile ; revisit only
| WARC-Identified-Payload-Type
| WARC-Segment-Origin-ID ; continuation only
| WARC-Segment-Number
| WARC-Segment-Total-Length ; continuation only
4 © ISO 2009 – All rights reserved
Every WARC record shall have a type, reported in the WARC-Type field. Eight WARC record types are
defined in this International Standard as follows:
⎯ 'warcinfo',
⎯ 'response',
⎯ 'resource',
⎯ 'request',
⎯ 'metadata',
⎯ 'revisit',
⎯ 'conversion',
⎯ 'continuation'.
Other types of WARC records may be defined in extensions of the core format. The relevant fields for each
record type are described in detail in Clause 6. Each field's meaning and legal value format are described in
Clause 5.
The record block shall contain octet content, interpreted based on the record type and other header values. All
records shall include a Content-Length field to specify the length of the block.
Some record types (and possibly future record types) also define a payload, such as a meaningful subset of
the block or content from a predecessor record. Some headers pertain to the payload of a record rather than
the block directly.
For example, in a 'response' record with a content block consisting of HTTP headers and a data object, the
payload would be the data object. All 'response', 'resource', 'request', 'conversion' and 'continuation' records
may have a payload. All 'warcinfo', 'metadata' and 'revisit' records shall not have a payload.
Content matching the warc-file rule shall have the MIME content-type "application/warc", as specified in 8.2.
Content matching only the warc-fields rule is useful as a simple descriptive format, and has MIME content-
type "application/warc-fields", as specified in 8.3.
5 Named fields
5.1 General
Named fields within a WARC record provide information about the current record. WARC both reuses
appropriate headers from other standards and defines new headers, all beginning "WARC-", for WARC-
specific purposes.
WARC named fields of the same type shall not be repeated in the same WARC record (for example, a WARC
record shall not have several WARC-Date or several WARC-Target-URI), except as noted (e.g.
WARC-Concurrent-To).
Because new fields may be defined in extensions to the core WARC format, WARC processing software shall
ignore fields with unrecognized names.
5.2 WARC-Record-ID (mandatory)
A WARC-Record-ID is an identifier assigned to the current record that is globally unique for its period of
intended use. No identifier scheme is mandated by this specification, but each WARC-Record-ID shall be a
legal URI and clearly indicate a documented and registered scheme to which it conforms (e.g. via a URI
scheme prefix such as "http:" or "urn:"). Care should be taken to ensure that this value is written with no
internal white space.
WARC-Record-ID = "WARC-Record-ID" ":" uri
All records shall have a WARC-Record-ID field.
5.3 Content-Length (mandatory)
The Content-Length is the number of octets in the block, similar to [RFC2616]. If no block is present, a value
of "0" (zero) shall be used.
Content-Length = "Content-Length" ":" 1*DIGIT
All records shall have a Content-Length field.
5.4 WARC-Date (mandatory)
The WARC-Date is a 14-digit UTC time-stamp formatted as YYYY-MM-DDThh:mm:ssZ, and shall conform to
the W3C profile of ISO 8601, i.e. [W3CDTF]. The time-stamp shall represent the instant that data capture for
record creation began. Multiple records written as part of a single capture event (see 5.7) shall use the same
WARC-Date, even though the times of their writing will not be exactly synchronized.
WARC-Date = "WARC-Date" ":" w3c-iso8601
w3c-iso8601 =
All records shall have a WARC-Date field.
See Annex A for examples on usage of WARC-Date fields.
5.5 WARC-Type (mandatory)
WARC-Type is the type of WARC record. Record types defined in this International Standard are:
⎯ 'warcinfo',
⎯ 'response',
⎯ 'resource',
⎯ 'request',
⎯ 'metadata',
⎯ 'revisit',
⎯ 'conversion', and
⎯ 'continuation'.
6 © ISO 2009 – All rights reserved
Other types of WARC records may be defined in extensions of the core format. Types are further described in
Clause 6.
A WARC file need not contain any particular record types, though starting all WARC files with a 'warcinfo'
record is recommended.
WARC-Type = "WARC-Type" ":" record-type
record-type = "warcinfo" | "response" | "resource"
| "request" | "metadata" | "revisit"
| "conversion" | "continuation" | future-type
future-type = token
All records shall have a WARC-Type field.
WARC processing software shall ignore records of unrecognized type.
See Annex A for examples on usage of WARC-Type fields.
5.6 Content-Type
The Content-Type field is the MIME type (as defined in [RFC2045]) of information contained in the record's
block. For example, in HTTP request and response records, this would be 'application/http' as specified in
19.1 of [RFC2616] (or 'application/http; msgtype=request' and 'application/http; msgtype=response'
respectively). In particular, the content-type is not the value of the HTTP Content-Type header in a HTTP
response but a MIME type to describe the full archived HTTP message (hence 'application/http' if the block
contains request or response headers).
Content-Type = "Content-Type" ":" media-type
media-type = type "/" subtype *( ";" parameter )
type = token
subtype = token
parameter = attribute "=" value
attribute = token
value = token | quoted-string
All records with a non-empty block (non-zero Content-Length), except 'continuation' records, should have a
Content-Type field. Only if the media type is not given by a Content-Type field, a reader may attempt to guess
the media type via inspection of its content and/or the name extension(s) of the URI used to identify the
resource. If the media type remains unknown, the reader should treat it as type "application/octet-stream".
5.7 WARC-Concurrent-To
The WARC-Concurrent-To field (or fields) contains the WARC-Record-ID of any records created as part of the
same capture event as the current record. A capture event comprises the information automatically gathered
by a retrieval against a single WARC-Target-URI; for example, it may be represented by a 'response' or
'revisit' record plus its associated 'request' record.
WARC-Concurrent-To = "WARC-Concurrent-To" ":" uri
This field may be used to associate records of types 'request', 'response', 'resource', 'metadata', and 'revisit'
with one another when they arise from a single capture event. (When so used, any WARC-Concurrent-To
association shall be considered bidirectional even if the header only appears on one record.) The WARC-
Concurrent-To field shall not be used in 'warcinfo', 'conversion', and 'continuation' records.
As an exception to the general rule, several WARC-Concurrent-To fields may be repeated within the same
WARC record.
See Annex A for examples on usage of WARC-Concurrent-To fields.
5.8 WARC-Block-Digest
The WARC-Block-Digest is an optional parameter indicating the algorithm name and calculated value of a
digest applied to the full block of the record.
WARC-Block-Digest = "WARC-Block-Digest" ":" labelled-digest
labelled-digest = algorithm ":" digest-value
algorithm = token
digest-value = token
An example is a SHA-1 labelled Base32 ([RFC3548]) value:
WARC-Block-Digest: sha1:AB2CD3EF4GH5IJ6KL7MN8OPQ
No particular algorithm is recommended.
Any record may have a WARC-Block-Digest field.
5.9 WARC-Payload-Digest
A WARC-Payload-Digest is an optional parameter indicating the algorithm name and calculated value of a
digest applied to the payload referred to or contained by the record, which is not necessarily equivalent to the
record block.
WARC-Payload-Digest = "WARC-Payload-Digest" ":" labelled-digest
An example is a SHA-1 labelled Base32 ([RFC3548]) value:
WARC-Payload-Digest: sha1:3EF4GH5IJ6KL7MN8OPQAB2CD
No particular algorithm is recommended.
The payload of an application/http block is its 'entity-body' (specified in [RFC2616]). In contrast to WARC-
Block-Digest, the WARC-Payload-Digest field may also be used for data not actually present in the current
record block, for example when a block is left off in accordance with a 'revisit' profile (see 6.7), or when a
record is segmented (the WARC-Payload-Digest recorded in the first segment of a segmented record shall be
the digest of the payload of the logical record).
The WARC-Payload-Digest field may be used on WARC records with a well-defined payload and shall not be
used on records without a well-defined payload.
5.10 WARC-IP-Address
The WARC-IP-Address is the numeric Internet address contacted to retrieve any included content. An IPv4
address shall be written as a "dotted quad"; an IPv6 address shall be written as specified in [RFC1884]. For a
HTTP retrieval, this will be the IP address used at retrieval time corresponding to the hostname in the record's
target-URI.
WARC-IP-Address = "WARC-IP-Address" ":" (ipv4 | ipv6)
ipv4 = <"dotted quad">
ipv6 =
8 © ISO 2009 – All rights reserved
The WARC-IP-Address field may be used on 'response', 'resource', 'request', 'metadata', and 'revisit' records,
but shall not be used on 'warcinfo', 'conversion' or 'continuation' records.
5.11 WARC-Refers-To
The WARC-Refers-To field contains the WARC-Record-ID of a single record for which the present record
holds additional content.
WARC-Refers-To = "WARC-Refers-To" ":" uri
The WARC-Refers-To field may be used to associate a 'metadata' record to another record it describes. The
WARC-Refers-To field may also be used to associate a record of type 'revisit' or 'conversion' with the
preceding record which helped determine the present record content. The WARC-Refers-To field shall not be
used in 'warcinfo', 'response', ‘resource’, 'request', and 'continuation' records.
See Annex A for examples on usage of WARC-Refers-To fields.
5.12 WARC-Target-URI
The WARC-Target-URI is the original URI whose capture gave rise to the information content in this record. In
the context of web harvesting, this is the URI that was the target of a crawler's retrieval request. For a 'revisit'
record, it is the URI that was the target of a retrieval request. Indirectly, such as for a 'metadata', or
'conversion' record, it is a copy of the WARC-Target-URI appearing in the original record to which the newer
record pertains. The URI in this value shall be written as specified in [RFC3986].
WARC-Target-URI = "WARC-Target-URI" ":" uri
All 'response', 'resource', 'request', 'revisit', ‘conversion’ and 'continuation' records shall have a WARC-Target-
URI field. A 'metadata' record may have a WARC-Target-URI field. A 'warcinfo' record shall not have a
WARC-Target-URI field.
5.13 WARC-Truncated
For practical reasons, writers of the WARC format may place limits on the time or storage allocated to
archiving a single resource. As a result, only a truncated portion of the original resource may be available for
saving into a WARC record.
Any record may indicate that truncation of its content block has occurred and give the reason with a WARC-
Truncated field.
WARC-Truncated = "WARC-Truncated" ":" reason-token
reason-token = "length" ; exceeds configured max
; length
| "time" ; exceeds configured max time
| "disconnect" ; network disconnect
| "unspecified" ; other/unknown reason
| future-reason
future-reason = token
For example, if the capture of what appeared to be a multi-gigabyte resource was cut short after a transfer
time limit was reached, the partial resource could be saved to a WARC record with this field.
The WARC-Truncated field may be used on any WARC record. The WARC Content-Length field shall still
report the actual truncated size of the record block.
5.14 WARC-Warcinfo-ID
When present, the WARC-Warcinfo-ID indicates the WARC-Record-ID of the associated 'warcinfo' record for
this record. Typically, the WARC-Warcinfo-ID parameter is used when the context of the applicable 'warcinfo'
record is unavailable, such as after distributing single records into separate WARC files. WARC writing
applications (such as web crawlers) may choose to always record this parameter.
WARC-Warcinfo-ID = "WARC-Warcinfo-ID" ":" uri
The WARC-Warcinfo-ID field value overrides any association with a previously occurring (in the WARC)
'warcinfo' record, thus providing a way to protect the true association when records are combined from
different WARC files.
The WARC-Warcinfo-ID field may be used in any record type except 'warcinfo'.
5.15 WARC-Filename
The WARC-Filename is the filename containing the current 'warcinfo' record.
WARC-Filename = "WARC-Filename" ":" ( TEXT | quoted-string )
The WARC-Filename field may be used in 'warcinfo' type records and shall not be used for other record types.
5.16 WARC-Profile
The WARC-Profile is a URI signifying the kind of analysis and handling applied in a 'revisit' record. (Like an
XML namespace, the URI may, but need not, return human-readable or machine-readable documentation.) If
reading software does not recognize the given URI as a supported kind of handling, it shall not attempt to
interpret the associated record block.
WARC-Profile = "WARC-Profile" ":" uri
The section 'revisit' defines two initial profile options for the WARC-Profile header for 'revisit' records.
The WARC-Profile field is mandatory on 'revisit' type records and undefined for other record types.
5.17 WARC-Identified-Payload-Type
The WARC-Identified-Payload-Type is the content-type of the record's payload as determined by an
independent check. This string shall not be arrived at by blindly promoting a HTTP Content-Type value up
from a record block into the WARC header without direct analysis of the payload, as such values may often be
unreliable.
WARC-Identified-Payload-Type = "WARC-Identified-Payload-Type" ":"
media-type
The WARC-Identified-Payload-Type field may be used on WARC records with a well-defined payload and
shall not be used on records without a well-defined payload.
5.18 WARC-Segment-Number
The WARC-Segment-Number reports the current record's relative ordering in a sequence of segmented
records.
WARC-Segment-Number = "WARC-Segment-Number" ":" 1*DIGIT
10 © ISO 2009 – All rights reserved
In the first segment of any record that is completed in one or more later 'continuation' WARC records, this
parameter is mandatory. Its value there is "1". In a 'continuation' record, this parameter is also mandatory. Its
value is the sequence number of the current segment in the logical whole record, increasing by 1 in each next
segment.
See Clause 7 on record segmentation for full details on the use of WARC record segmentation.
5.19 WARC-Segment-Origin-ID
The WARC-Segment-Origin-ID identifies the starting record in a series of segmented records whose content
blocks are reassembled to obtain a logically complete content block.
WARC-Segment-Origin-ID = "WARC-Segment-Origin-ID" ":" uri
This field is mandatory on all 'continuation' records, and shall not be used in other records. See Clause 7 on
record segmentation for full details on the use of WARC record segmentation.
5.20 WARC-Segment-Total-Length
In the final record of a segmented series, the WARC-Segment-Total-Length reports the total length of all
segment content blocks when concatenated together.
WARC-Segment-Total-Length = "WARC-Segment-Total-Length" ":"
1*DIGIT
This field is mandatory on the last 'continuation' record of a series, and shall not be used elsewhere.
See Clause 7 on record segmentation for full details on the use of WARC record segmentation.
6 WARC record types
6.1 General
The purpose and use of each defined record type is described in 6.2 to 6.9.
Because new record types that extend the WARC format may be defined in future standards, WARC
processing software shall skip records of unknown type.
6.2 'warcinfo'
A 'warcinfo' record describes the records that follow it, up through end of file, end of input, or until next
'warcinfo' record. Typically, this appears once and at the beginning of a WARC file. For a web archive, it often
contains information about the web crawl which generated the following records.
The format of this descriptive record block may vary, though the use of the "application/warc-fields" content-
type is recommended. Allowable fields include, but are not limited to, all [DCMI] plus the following field
definitions. All fields are optional.
a) 'operator': contact information for the operator who created this WARC resource. A name or name and
email address is recommended.
b) 'software': the software and software version used to create this WARC resource. For example,
"heritrix/1.12.0".
c) 'robots': the robots policy followed by the harvester creating this WARC resource. The string 'classic'
indicates the 1994 web robots exclusion standard rules are being obeyed.
d) 'hostname': the hostname of the machine that created this WARC resource, such as
"crawling17.archive.org".
e) 'ip': the IP address of the machine that created this WARC resource, such as "123.2.3.4".
f) 'http-header-user-agent': the HTTP 'user-agent' header usually sent by the harvester along with each
request. Note that if 'request' records are used to save verbatim requests, this information is redundant.
(If a 'request' or 'metadata' record reports a different 'user-agent' for a specific request, the more specific
information should be considered more reliable.)
g) 'http-header-from': the HTTP 'from' header usually sent by the harvester along with each request. (The
same considerations as for 'user-agent' apply.)
So that multiple record excerpts from inside WARC files are also valid WARC files, it is optional that the first
record of a legal WARC be a 'warcinfo' description. Also, to allow the concatenation of WARC files into a
larger valid WARC file, it is allowable for 'warcinfo' records to appear anywhere in a WARC file.
See B.1 for an example of a 'warcinfo' record.
6.3 'response'
6.3.1 General
A 'response' record should contain a complete scheme-specific response, including network protocol
information, where possible. The exact contents of a 'response' record are determined not just by the record
type but also by the URI scheme of the record's target-URI, as described in 6.3.2 to 6.3.3.
See B.2 for an example of a 'response' record.
6.3.2 'http' and 'https' schemes
For a target-URI of the 'http' or 'https' schemes, a 'response' record block should co
...
SLOVENSKI STANDARD
01-december-2009
,QIRUPDWLNDLQGRNXPHQWDFLMD'DWRWHþQDREOLND]DSLVD:$5&
Information and documentation - WARC file format
Information et documentation - Format de fichier WARC
Ta slovenski standard je istoveten z: ISO 28500:2009
ICS:
35.240.30 Uporabniške rešitve IT v IT applications in information,
informatiki, dokumentiranju in documentation and
založništvu publishing
2003-01.Slovenski inštitut za standardizacijo. Razmnoževanje celote ali delov tega standarda ni dovoljeno.
INTERNATIONAL ISO
STANDARD 28500
First edition
2009-05-15
Information and documentation — WARC
file format
Information et documentation — Format de fichier WARC
Reference number
©
ISO 2009
PDF disclaimer
This PDF file may contain embedded typefaces. In accordance with Adobe's licensing policy, this file may be printed or viewed but
shall not be edited unless the typefaces which are embedded are licensed to and installed on the computer performing the editing. In
downloading this file, parties accept therein the responsibility of not infringing Adobe's licensing policy. The ISO Central Secretariat
accepts no liability in this area.
Adobe is a trademark of Adobe Systems Incorporated.
Details of the software products used to create this PDF file can be found in the General Info relative to the file; the PDF-creation
parameters were optimized for printing. Every care has been taken to ensure that the file is suitable for use by ISO member bodies. In
the unlikely event that a problem relating to it is found, please inform the Central Secretariat at the address given below.
© ISO 2009
All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized in any form or by any means,
electronic or mechanical, including photocopying and microfilm, without permission in writing from either ISO at the address below or
ISO's member body in the country of the requester.
ISO copyright office
Case postale 56 • CH-1211 Geneva 20
Tel. + 41 22 749 01 11
Fax + 41 22 749 09 47
E-mail copyright@iso.org
Web www.iso.org
Published in Switzerland
ii © ISO 2009 – All rights reserved
Contents Page
Foreword. v
Introduction . vi
1 Scope . 1
2 Normative references . 1
3 Terms, definitions and abbreviated terms . 2
3.1 Terms and definitions. 2
3.2 Abbreviated terms . 2
4 File and record model. 3
5 Named fields. 5
5.1 General. 5
5.2 WARC-Record-ID (mandatory) . 6
5.3 Content-Length (mandatory) . 6
5.4 WARC-Date (mandatory). 6
5.5 WARC-Type (mandatory) . 6
5.6 Content-Type. 7
5.7 WARC-Concurrent-To. 7
5.8 WARC-Block-Digest. 8
5.9 WARC-Payload-Digest. 8
5.10 WARC-IP-Address. 8
5.11 WARC-Refers-To. 9
5.12 WARC-Target-URI . 9
5.13 WARC-Truncated . 9
5.14 WARC-Warcinfo-ID . 10
5.15 WARC-Filename . 10
5.16 WARC-Profile . 10
5.17 WARC-Identified-Payload-Type. 10
5.18 WARC-Segment-Number. 10
5.19 WARC-Segment-Origin-ID. 11
5.20 WARC-Segment-Total-Length . 11
6 WARC record types . 11
6.1 General. 11
6.2 'warcinfo'. 11
6.3 'response' . 12
6.4 'resource' . 13
6.5 'request' . 13
6.6 'metadata'. 14
6.7 'revisit'. 15
6.8 'conversion' . 16
6.9 'continuation'. 16
7 Record segmentation . 16
8 Registration of MIME media types application/warc and application/warc-fields . 17
8.1 General. 17
8.2 application/warc. 17
8.3 application/warc-fields . 18
9 WARC file name, size and compression . 18
Annex A (informative) Use cases for writing WARC records . 19
Annex B (informative) Examples of WARC records. 22
Annex C (informative) WARC file size and name recommendations . 26
Annex D (informative) Compression recommendations . 27
Bibliography . 28
iv © ISO 2009 – All rights reserved
Foreword
ISO (the International Organization for Standardization) is a worldwide federation of national standards bodies
(ISO member bodies). The work of preparing International Standards is normally carried out through ISO
technical committees. Each member body interested in a subject for which a technical committee has been
established has the right to be represented on that committee. International organizations, governmental and
non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely with the
International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization.
International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 2.
The main task of technical committees is to prepare International Standards. Draft International Standards
adopted by the technical committees are circulated to the member bodies for voting. Publication as an
International Standard requires approval by at least 75 % of the member bodies casting a vote.
Attention is drawn to the possibility that some of the elements of this document may be the subject of patent
rights. ISO shall not be held responsible for identifying any or all such patent rights.
ISO 28500 was prepared by Technical Committee ISO/TC 46, Information and documentation, Subcommittee
SC 4, Technical interoperability.
Introduction
Websites and web pages emerge and disappear from the World Wide Web every day. For the past ten years,
memory storage organizations have tried to find the most appropriate ways to collect and keep track of this
vast quantity of important material using web-scale tools such as web crawlers. A web crawler is a program
that browses the web in an automated manner according to a set of policies; starting with a list of URLs, it
saves each page identified by a URL, finds all the hyperlinks in the page (e.g. links to other pages, images,
videos, scripting or style instructions, etc.), and adds them to the list of URLs to visit recursively. Storing and
managing the billions of saved web page objects itself presents a challenge.
At the same time, those same organizations have a rising need to archive large numbers of digital files not
necessarily captured from the web (e.g. entire series of electronic journals, or data generated by
environmental sensing equipment). A general requirement that appears to be emerging is for a container
format that permits one file simply and safely to carry a very large number of constituent data objects for the
purpose of storage, management, and exchange. Those data objects (or resources) need to be of unrestricted
type (including many binary types for audio, CAD, compressed files, etc.), but fortunately the container needs
only minimal knowledge of the nature of the objects.
The WARC (Web ARChive) file format offers a convention for concatenating multiple resource records (data
objects), each consisting of a set of simple text headers and an arbitrary data block into one long file. The
WARC format is an extension of the ARC file format (ARC) that has traditionally been used to store "web
crawls" as sequences of content blocks harvested from the World Wide Web. Each capture in an ARC file is
preceded by a one-line header that very briefly describes the harvested content and its length. This is directly
followed by the retrieval protocol response messages and content. The original ARC format file has been used
by the Internet Archive (IA) since 1996 for managing billions of objects, and by several national libraries.
The motivation to extend the ARC format arose from the discussion and experiences of the International
Internet Preservation Consortium (IIPC), whose members include the national libraries of Australia, Canada,
Denmark, Finland, France, Iceland, Italy, Norway, Sweden, The British Library (UK), The Library of Congress
(USA), and the Internet Archive (IA). The California Digital Library and the Los Alamos National Laboratory
also provided input on extending and generalizing the format.
The WARC format is expected to be a standard way to structure, manage and store billions of resources
collected from the web and elsewhere. It will be used to build applications for harvesting (such as the open
source Heritrix web crawler), managing, accessing, and exchanging content. The way WARC files will be
created and resources stored and rendered will depend on software and applications implementations.
Besides the primary content recorded in ARCs, the extended WARC format accommodates related secondary
content, such as assigned metadata, abbreviated duplicate detection events, later-date transformations, and
segmentation of large resources. The extension may also be useful for more general applications than web
archiving. To aid the development of tools that are backwards compatible, WARC content is clearly
distinguishable from pre-revision ARC content.
The WARC file format is made sufficiently different from the legacy ARC format files so that software tools can
unambiguously detect and correctly process both WARC and ARC records; given the large amount of existing
archival data in the previous ARC format, it is important that access and use of this legacy not be interrupted
when transitioning to the WARC format.
After the Internet Engineering Steering Group (IESG: http://www.ietf.org/iesg.html) approval, IANA (Internet
Assigned Numbers Authority: http://www.iana.org/) is expected to register the WARC type "application/warc"
using the application provided in this International Standard and following procedures defined in [RFC2048].
vi © ISO 2009 – All rights reserved
INTERNATIONAL STANDARD ISO 28500:2009(E)
Information and documentation — WARC file format
1 Scope
This International Standard specifies the WARC file format:
⎯ to store both the payload content and control information from mainstream Internet application layer
protocols, such as the HTTP, DNS, and FTP;
⎯ to store arbitrary metadata linked to other stored data (e.g. subject classifier, discovered language,
encoding);
⎯ to support data compression and maintain data record integrity;
⎯ to store all control information from the harvesting protocol (e.g. request headers), not just response
information;
⎯ to store the results of data transformations linked to other stored data;
⎯ to store a duplicate detection event linked to other stored data (to reduce storage in the presence of
identical or substantially similar resources);
⎯ to be extended without disruption to existing functionality;
⎯ to support handling of overly long records by truncation or segmentation, where desired.
2 Normative references
The following referenced documents are indispensable for the application of this document. For dated
references, only the edition cited applies. For undated references, the latest edition of the referenced
document (including any amendments) applies.
ISO 8601, Data elements and interchange formats — Information interchange — Representation of dates and
times
[RFC1035] Mockapetris, P. Domain names — Implementation and specification. STD 13, November 1987.
Available at: http://www.faqs.org/rfcs/rfc1035.html
[RFC1884] Hinden, R. and Deering, S. IP Version 6 Addressing Architecture. December 1995. Available at:
http://www.faqs.org/rfcs/rfc1884.html
[RFC2045] Freed, N. and Borenstein, N. Multipurpose Internet Mail Extensions (MIME) Part One: Format of
Internet Message Bodies. November 1996. Available at: http://www.faqs.org/rfcs/rfc2045
[RFC2540] Eastlake, D. Detached Domain Name System (DNS) Information. March 1999. Available at:
http://www.faqs.org/rfcs/rfc2540.html
[RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P. and Berners-Lee, T.
Hypertext Transfer Protocol — HTTP/1.1. June 1999 (TXT, PS, PDF, HTML, XML). Available at:
http://www.faqs.org/rfcs/rfc2616.html
[RFC2822] Resnick, P. (ed.) Internet Message Format. April 2001. Available at:
http://www.faqs.org/rfcs/rfc2822
[RFC3629] Yergeau, F. UTF-8, a transformation format of ISO 10646. STD 63, November 2003. Available at:
http://www.faqs.org/rfcs/rfc3629.html
[RFC3986] Berners-Lee, T., Fielding, R., Masinter, L. Uniform Resource Identifier (URI): Generic Syntax. STD
66, January 2005 (TXT, HTML, XML). Available at: http://www.faqs.org/rfcs/rfc3986.html
[RFC4027] Josefsson, S. Domain Name System Media Types. April 2005. Available at:
http://www.faqs.org/rfcs/rfc4027.html
[W3CDTF] Date and Time Formats: note submitted to the W3C. 15 September 1997 (W3C profile of
ISO 8601). Available at: http://www.w3.org/TR/NOTE-datetime
3 Terms, definitions and abbreviated terms
3.1 Terms and definitions
For the purposes of this document, the following terms and definitions apply.
3.1.1
WARC record
basic constituent of a WARC file, consisting of a sequence of WARC records
3.1.2
WARC record content block
part (zero or more octets) of a WARC record that follows the header and that forms the main body of a WARC
record
3.1.3
WARC record payload
data object referred to, or contained by a WARC record as a meaningful subset of the content block
3.1.4
WARC record header
beginning of a WARC record, consisting of one first line declaring the record to be in the WARC format with a
given version number, followed by lines of named fields up to a blank line
3.1.5
WARC named fields
set of elements consisting of a name, a colon, and a value, with long values continued on indented lines
3.1.6
WARC logical record
in the context of segmentation, a logical record may be composed of multiple segments, each represented by
a WARC record
3.2 Abbreviated terms
ABNF augmented Backus-Naur form
ARC archive
CRLF carriage return line feed
2 © ISO 2009 – All rights reserved
DNS domain name system
FTP file transfer protocol
HTTP hypertext transport protocol
IANA Internet Assigned Numbers Authority
IESG Internet Engineering Steering Group
RFC request for comments
UR (I/L/N) uniform resource (identifier/locator/name)
WARC web archive
4 File and record model
A WARC format file is the simple concatenation of one or more WARC records. The first record usually
describes the records to follow. In general, record content is either the direct result of a retrieval attempt (web
pages, inline images, URL redirection information, DNS hostname lookup results, stand-alone files, etc.) or is
synthesized material (e.g. metadata, transformed content) that provides additional information about archived
content.
A WARC record shall consist of a record header followed by a record content block and two new lines. The
WARC record header shall consist of one first line declaring the record to be in the WARC format with a given
version number, then a variable number of line-oriented named fields terminated by a blank line. The WARC
record header format shall follow the general rules of HTTP/1.1 [RFC2616] and [RFC2822] headers with one
major exception: it shall also allow UTF-8 characters, as specified in [RFC3629].
The top-level view of a WARC file can be expressed in an ABNF grammar, reusing the augmented constructs
defined in section 2.1 of HTTP/1.1 [RFC2616]. (In particular, note that to avoid the risk of confusion, where
any WARC rule has the same name as an [RFC2616] rule, the definition here has been made the same,
except in the case of the CHAR rule, which in WARC includes multibyte UTF-8 characters.)
warc-file = 1*warc-record
warc-record = header CRLF
block CRLF CRLF
header = version warc-fields
version = "WARC/1.0" CRLF
warc-fields = *named-field CRLF
block = *OCTET
The record version shall appear first in every record and hence shall also begin the WARC file itself.
The WARC record relies heavily on named fields. Each named field consists of a name followed by a colon
(":") and the field value. Field names are not case-sensitive. The field value may be preceded by any amount
of linear white space (LWS), though a single space is preferred. Header fields can be extended over multiple
lines by preceding each extra line with at least one space or tab character.
Named fields may appear in any order and field values may contain any UTF-8 character. Both defined-fields
and extension-fields follow the generic named-field format. Extension-fields may be used in extensions of the
core format.
named-field = field-name ":" [ field-value ]
field-name = token
field-value = *( field-content | LWS ) ; further qualified
; by field
; definitions
field-content =
and consisting of either *TEXT or combinations
of token, separators, and quoted-string>
OCTET =
token = 1*
except CTLs or separators>
separators = "(" | ")" | "<" | ">" | "@"
| "," | ";" | ":" | "\" | <">
| "/" | "[" | "]" | "?" | "="
| "{" | "}" | SP | HT
TEXT =
but including LWS>
CHAR = ; (0-191, 194-244)
DIGIT =
CTL =
(octets 0 - 31) and DEL (127)>
CR = ; (13)
LF = ; (10)
SP = ; (32)
HT = ; (9)
CRLF = CR LF
LWS = [CRLF] 1*( SP | HT ) ; semantics same as
; single SP
quoted-string = ( <"> *(qdtext | quoted-pair ) <"> )
qdtext = >
quoted-pair = "\" CHAR ; single-character quoting
uri = "<" <'URI' per RFC3986> ">"
Although UTF-8 characters are allowed, the 'encoded-word' mechanism of [RFC2047] may also be used when
writing WARC fields and shall also be understood by WARC reading software.
The rest of the WARC record grammar concerns defined-field parameters such as record identifier, record
type, creation time, content length, and content type.
defined-field = WARC-Type
| WARC-Record-ID
| WARC-Date
| Content-Length
| Content-Type
| WARC-Concurrent-To
| WARC-Block-Digest
| WARC-Payload-Digest
| WARC-IP-Address
| WARC-Refers-To
| WARC-Target-URI
| WARC-Truncated
| WARC-Warcinfo-ID
| WARC-Filename ; warcinfo only
| WARC-Profile ; revisit only
| WARC-Identified-Payload-Type
| WARC-Segment-Origin-ID ; continuation only
| WARC-Segment-Number
| WARC-Segment-Total-Length ; continuation only
4 © ISO 2009 – All rights reserved
Every WARC record shall have a type, reported in the WARC-Type field. Eight WARC record types are
defined in this International Standard as follows:
⎯ 'warcinfo',
⎯ 'response',
⎯ 'resource',
⎯ 'request',
⎯ 'metadata',
⎯ 'revisit',
⎯ 'conversion',
⎯ 'continuation'.
Other types of WARC records may be defined in extensions of the core format. The relevant fields for each
record type are described in detail in Clause 6. Each field's meaning and legal value format are described in
Clause 5.
The record block shall contain octet content, interpreted based on the record type and other header values. All
records shall include a Content-Length field to specify the length of the block.
Some record types (and possibly future record types) also define a payload, such as a meaningful subset of
the block or content from a predecessor record. Some headers pertain to the payload of a record rather than
the block directly.
For example, in a 'response' record with a content block consisting of HTTP headers and a data object, the
payload would be the data object. All 'response', 'resource', 'request', 'conversion' and 'continuation' records
may have a payload. All 'warcinfo', 'metadata' and 'revisit' records shall not have a payload.
Content matching the warc-file rule shall have the MIME content-type "application/warc", as specified in 8.2.
Content matching only the warc-fields rule is useful as a simple descriptive format, and has MIME content-
type "application/warc-fields", as specified in 8.3.
5 Named fields
5.1 General
Named fields within a WARC record provide information about the current record. WARC both reuses
appropriate headers from other standards and defines new headers, all beginning "WARC-", for WARC-
specific purposes.
WARC named fields of the same type shall not be repeated in the same WARC record (for example, a WARC
record shall not have several WARC-Date or several WARC-Target-URI), except as noted (e.g.
WARC-Concurrent-To).
Because new fields may be defined in extensions to the core WARC format, WARC processing software shall
ignore fields with unrecognized names.
5.2 WARC-Record-ID (mandatory)
A WARC-Record-ID is an identifier assigned to the current record that is globally unique for its period of
intended use. No identifier scheme is mandated by this specification, but each WARC-Record-ID shall be a
legal URI and clearly indicate a documented and registered scheme to which it conforms (e.g. via a URI
scheme prefix such as "http:" or "urn:"). Care should be taken to ensure that this value is written with no
internal white space.
WARC-Record-ID = "WARC-Record-ID" ":" uri
All records shall have a WARC-Record-ID field.
5.3 Content-Length (mandatory)
The Content-Length is the number of octets in the block, similar to [RFC2616]. If no block is present, a value
of "0" (zero) shall be used.
Content-Length = "Content-Length" ":" 1*DIGIT
All records shall have a Content-Length field.
5.4 WARC-Date (mandatory)
The WARC-Date is a 14-digit UTC time-stamp formatted as YYYY-MM-DDThh:mm:ssZ, and shall conform to
the W3C profile of ISO 8601, i.e. [W3CDTF]. The time-stamp shall represent the instant that data capture for
record creation began. Multiple records written as part of a single capture event (see 5.7) shall use the same
WARC-Date, even though the times of their writing will not be exactly synchronized.
WARC-Date = "WARC-Date" ":" w3c-iso8601
w3c-iso8601 =
All records shall have a WARC-Date field.
See Annex A for examples on usage of WARC-Date fields.
5.5 WARC-Type (mandatory)
WARC-Type is the type of WARC record. Record types defined in this International Standard are:
⎯ 'warcinfo',
⎯ 'response',
⎯ 'resource',
⎯ 'request',
⎯ 'metadata',
⎯ 'revisit',
⎯ 'conversion', and
⎯ 'continuation'.
6 © ISO 2009 – All rights reserved
Other types of WARC records may be defined in extensions of the core format. Types are further described in
Clause 6.
A WARC file need not contain any particular record types, though starting all WARC files with a 'warcinfo'
record is recommended.
WARC-Type = "WARC-Type" ":" record-type
record-type = "warcinfo" | "response" | "resource"
| "request" | "metadata" | "revisit"
| "conversion" | "continuation" | future-type
future-type = token
All records shall have a WARC-Type field.
WARC processing software shall ignore records of unrecognized type.
See Annex A for examples on usage of WARC-Type fields.
5.6 Content-Type
The Content-Type field is the MIME type (as defined in [RFC2045]) of information contained in the record's
block. For example, in HTTP request and response records, this would be 'application/http' as specified in
19.1 of [RFC2616] (or 'application/http; msgtype=request' and 'application/http; msgtype=response'
respectively). In particular, the content-type is not the value of the HTTP Content-Type header in a HTTP
response but a MIME type to describe the full archived HTTP message (hence 'application/http' if the block
contains request or response headers).
Content-Type = "Content-Type" ":" media-type
media-type = type "/" subtype *( ";" parameter )
type = token
subtype = token
parameter = attribute "=" value
attribute = token
value = token | quoted-string
All records with a non-empty block (non-zero Content-Length), except 'continuation' records, should have a
Content-Type field. Only if the media type is not given by a Content-Type field, a reader may attempt to guess
the media type via inspection of its content and/or the name extension(s) of the URI used to identify the
resource. If the media type remains unknown, the reader should treat it as type "application/octet-stream".
5.7 WARC-Concurrent-To
The WARC-Concurrent-To field (or fields) contains the WARC-Record-ID of any records created as part of the
same capture event as the current record. A capture event comprises the information automatically gathered
by a retrieval against a single WARC-Target-URI; for example, it may be represented by a 'response' or
'revisit' record plus its associated 'request' record.
WARC-Concurrent-To = "WARC-Concurrent-To" ":" uri
This field may be used to associate records of types 'request', 'response', 'resource', 'metadata', and 'revisit'
with one another when they arise from a single capture event. (When so used, any WARC-Concurrent-To
association shall be considered bidirectional even if the header only appears on one record.) The WARC-
Concurrent-To field shall not be used in 'warcinfo', 'conversion', and 'continuation' records.
As an exception to the general rule, several WARC-Concurrent-To fields may be repeated within the same
WARC record.
See Annex A for examples on usage of WARC-Concurrent-To fields.
5.8 WARC-Block-Digest
The WARC-Block-Digest is an optional parameter indicating the algorithm name and calculated value of a
digest applied to the full block of the record.
WARC-Block-Digest = "WARC-Block-Digest" ":" labelled-digest
labelled-digest = algorithm ":" digest-value
algorithm = token
digest-value = token
An example is a SHA-1 labelled Base32 ([RFC3548]) value:
WARC-Block-Digest: sha1:AB2CD3EF4GH5IJ6KL7MN8OPQ
No particular algorithm is recommended.
Any record may have a WARC-Block-Digest field.
5.9 WARC-Payload-Digest
A WARC-Payload-Digest is an optional parameter indicating the algorithm name and calculated value of a
digest applied to the payload referred to or contained by the record, which is not necessarily equivalent to the
record block.
WARC-Payload-Digest = "WARC-Payload-Digest" ":" labelled-digest
An example is a SHA-1 labelled Base32 ([RFC3548]) value:
WARC-Payload-Digest: sha1:3EF4GH5IJ6KL7MN8OPQAB2CD
No particular algorithm is recommended.
The payload of an application/http block is its 'entity-body' (specified in [RFC2616]). In contrast to WARC-
Block-Digest, the WARC-Payload-Digest field may also be used for data not actually present in the current
record block, for example when a block is left off in accordance with a 'revisit' profile (see 6.7), or when a
record is segmented (the WARC-Payload-Digest recorded in the first segment of a segmented record shall be
the digest of the payload of the logical record).
The WARC-Payload-Digest field may be used on WARC records with a well-defined payload and shall not be
used on records without a well-defined payload.
5.10 WARC-IP-Address
The WARC-IP-Address is the numeric Internet address contacted to retrieve any included content. An IPv4
address shall be written as a "dotted quad"; an IPv6 address shall be written as specified in [RFC1884]. For a
HTTP retrieval, this will be the IP address used at retrieval time corresponding to the hostname in the record's
target-URI.
WARC-IP-Address = "WARC-IP-Address" ":" (ipv4 | ipv6)
ipv4 = <"dotted quad">
ipv6 =
8 © ISO 2009 – All rights reserved
The WARC-IP-Address field may be used on 'response', 'resource', 'request', 'metadata', and 'revisit' records,
but shall not be used on 'warcinfo', 'conversion' or 'continuation' records.
5.11 WARC-Refers-To
The WARC-Refers-To field contains the WARC-Record-ID of a single record for which the present record
holds additional content.
WARC-Refers-To = "WARC-Refers-To" ":" uri
The WARC-Refers-To field may be used to associate a 'metadata' record to another record it describes. The
WARC-Refers-To field may also be used to associate a record of type 'revisit' or 'conversion' with the
preceding record which helped determine the present record content. The WARC-Refers-To field shall not be
used in 'warcinfo', 'response', ‘resource’, 'request', and 'continuation' records.
See Annex A for examples on usage of WARC-Refers-To fields.
5.12 WARC-Target-URI
The WARC-Target-URI is the original URI whose capture gave rise to the information content in this record. In
the context of web harvesting, this is the URI that was the target of a crawler's retrieval request. For a 'revisit'
record, it is the URI that was the target of a retrieval request. Indirectly, such as for a 'metadata', or
'conversion' record, it is a copy of the WARC-Target-URI appearing in the original record to which the newer
record pertains. The URI in this value shall be written as specified in [RFC3986].
WARC-Target-URI = "WARC-Target-URI" ":" uri
All 'response', 'resource', 'request', 'revisit', ‘conversion’ and 'continuation' records shall have a WARC-Target-
URI field. A 'metadata' record may have a WARC-Target-URI field. A 'warcinfo' record shall not have a
WARC-Target-URI field.
5.13 WARC-Truncated
For practical reasons, writers of the WARC format may place limits on the time or storage allocated to
archiving a single resource. As a result, only a truncated portion of the original resource may be available for
saving into a WARC record.
Any record may indicate that truncation of its content block has occurred and give the reason with a WARC-
Truncated field.
WARC-Truncated = "WARC-Truncated" ":" reason-token
reason-token = "length" ; exceeds configured max
; length
| "time" ; exceeds configured max time
| "disconnect" ; network disconnect
| "unspecified" ; other/unknown reason
| future-reason
future-reason = token
For example, if the capture of what appeared to be a multi-gigabyte resource was cut short after a transfer
time limit was reached, the partial resource could be saved to a WARC record with this field.
The WARC-Truncated field may be used on any WARC record. The WARC Content-Length field shall still
report the actual truncated size of the record block.
5.14 WARC-Warcinfo-ID
When present, the WARC-Warcinfo-ID indicates the WARC-Record-ID of the associated 'warcinfo' record for
this record. Typically, the WARC-Warcinfo-ID parameter is used when the context of the applicable 'warcinfo'
record is unavailable, such as after distributing single records into separate WARC files. WARC writing
applications (such as web crawlers) may choose to always record this parameter.
WARC-Warcinfo-ID = "WARC-Warcinfo-ID" ":" uri
The WARC-Warcinfo-ID field value overrides any association with a previously occurring (in the WARC)
'warcinfo' record, thus providing a way to protect the true association when records are combined from
different WARC files.
The WARC-Warcinfo-ID field may be used in any record type except 'warcinfo'.
5.15 WARC-Filename
The WARC-Filename is the filename containing the current 'warcinfo' record.
WARC-Filename = "WARC-Filename" ":" ( TEXT | quoted-string )
The WARC-Filename field may be used in 'warcinfo' type records and shall not be used for other record types.
5.16 WARC-Profile
The WARC-Profile is a URI signifying the kind of analysis and handling applied in a 'revisit' record. (Like an
XML namespace, the URI may, but need not, return human-readable or machine-readable documentation.) If
reading software does not recognize the given URI as a supported kind of handling, it shall not attempt to
interpret the associated record block.
WARC-Profile = "WARC-Profile" ":" uri
The section 'revisit' defines two initial profile options for the WARC-Profile header for 'revisit' records.
The WARC-Profile field is mandatory on 'revisit' type records and undefined for other record types.
5.17 WARC-Identified-Payload-Type
The WARC-Identified-Payload-Type is the content-type of the record's payload as determined by an
independent check. This string shall not be arrived at by blindly promoting a HTTP Content-Type value up
from a record block into the WARC header without direct analysis of the payload, as such values may often be
unreliable.
WARC-Identified-Payload-Type = "WARC-Identified-Payload-Type" ":"
media-type
The WARC-Identified-Payload-Type field may be used on WARC records with a well-defined payload and
shall not be used on records without a well-defined payload.
5.18 WARC-Segment-Number
The WARC-Segment-Number reports the current record's relative ordering in a sequence of segmented
records.
WARC-Segment-Number = "WARC-Segment-Number" ":" 1*DIGIT
10 © ISO 2009 – All rights reserved
In the first segment of any record that is completed in one or more later 'continuation' WARC records, this
parameter is mandatory. Its value there is "1". In a 'continuation' record, this parameter is also mandatory. Its
value is the sequence number of the current segment in the logical whole record, increasing by 1 in each next
segment.
See Clause 7 on record segmentation for full details on the use of WARC record segmentation.
5.19 WARC-Segment-Origin-ID
The WARC-Segment-Origin-ID identifies the starting record in a series of segmented records whose content
blocks are reassembled to obtain a logically complete content block.
WARC-Segment-Origin-ID = "WARC-Segment-Origin-ID" ":" uri
This field is mandatory on all 'continuation' records, and shall not be used in other records. See Clause 7 on
record segmentation for full details on the use of WARC record segmentation.
5.20 WARC-Segment-Total-Length
In the final record of a segmented series, the WARC-Segment-Total-Length reports the total length of all
segment content blocks when concatenated together.
WARC-Segment-Total-Length = "WARC-Segment-Total-Length" ":"
1*DIGIT
This field is mandatory on the last 'continuation' record of a series, and shall not be used elsewhere.
See Clause 7 on record segmentation for full details on the use of WARC record segmentation.
6 WARC record types
6.1 General
The purpose and use of each defined record type is described in 6.2 to 6.9.
Because new record types that extend the WARC format may be defined in future standards, WARC
processing software shall skip records of unknown type.
6.2 'warcinfo'
A 'warcinfo' record describes the records that follow it, up through end of file, end of input, or until next
'warcinfo' record. Typically, this appears once and at the beginning of a WARC file. For a web archive, it often
contains information about the web crawl which generated the following records.
The format of this descriptive record block may vary, though the use of the "application/warc-fields" content-
type is recommended. Allowable fields include, but are not limited to, all [DCMI] plus the following field
definitions. All fields are optional.
a) 'operator': contact information for the operator who created this WARC resource. A name or name and
email address is recommended.
b) 'software': the software and software version used to create this WARC resource. For example,
"heritrix/1.12.0".
c) 'robots': the robots policy followed by the harvester creating this WARC resource. The string 'classic'
indicates the 1994 web robots exclusion standard rules are being obeyed.
d) 'hostname': the hostname of the machine that created this WARC resource, such as
"crawling17.archive.org".
e) 'ip': the IP address of the machine that created this WARC resource, such as "123.2.3.4".
f) 'http-header-user-agent': the HTTP 'user-agent' header usually sent by the harvester along with each
request. Note that if 'request' records are used to save verbatim requests, this information is redundant.
(If a 'request' or 'metadata' record reports a different 'user-agent' for a specific request, the more
...










Questions, Comments and Discussion
Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.
Loading comments...