Postal Services - Open Standard Interface - Address Data File Format for OCR/VCS Dictionary Generation

This document defines a file format for the generation of postal address directories. It is designed to hold all information necessary to support address reading software including data required for forwarding applications. In typical postal automation systems these files will be processed by directory generation software which creates application specific loadable data. This data - usually referred to as operational directory - is heavily compressed and contains access tables tailored for the specific reading software.
Not in the scope of this document are topics external to file like compression, checksums, the interface for transmission to the supplier, modification permissions, error handling on inconsistent data and undo in updates.

Postalische Dienstleistungen - Offene Normschnittstelle - Adress Datei Format für die Generierung von Wörterbüchern in OCR/Videocodier-Systemen

1.1   Anwendungsbereich
Das vorliegende Dokument legt ein Dateiformat für die Erzeugung von Postadressverzeichnissen fest. Dieses Dateiformat muss so ausgeführt sein, dass es alle Informationen enthält, die zur Unterstützung der Lesesoftware für Adressen erforderlich sind, einschließlich von Daten für Nachsendungen. In typischen Postautomatisierungssystemen werden diese Dateien von einer Verzeichnis-Generierungssoftware verarbei-tet, die anwendungsspezifische, ladefähige Daten erzeugt. Diese Daten, die gewöhnlich als Betriebs-verzeichnis bezeichnet werden, sind stark komprimiert und enthalten Zugriffstabellen, die auf die spezielle Lesesoftware zugeschnitten sind.
Nicht zum Anwendungsbereich dieses Dokumentes gehören Themen, die Dateien nicht berühren, wie z. B. Komprimierung, Prüfsummen, die Schnittstelle für eine Übertragung zum Lieferanten, Modifikationsrechte, die Fehlerbehandlung von inkonsistenten Daten und Rücknahmefunktionen (Undo) in Aktualisierungen (Updates).
1.2   Zweck
Das Format wurde unter Berücksichtigung folgender Anforderungen entwickelt:
   es muss folgende Daten enthalten:
   Adressen, die sich aus den Adresskomponenten zusammensetzen (einschließlich von Parallel-bezeichnungen (Alias) und Bereichsdaten);
   Name der Person und der Organisation;
   Adresscodes, die gewöhnlich als Sortiercodes verwendet werden;
   Verknüpfungen zwischen Adressen z. B. für die Nachsendung;
                es sollte die Zeichencodierung nicht einschränken;
   es sollte für spezifische Anwendungen leicht benutzerdefinierbar sein;
   es sollte vollständige sowie inkrementelle Aktualisierungen ermöglichen, d. h. nur Änderung der Daten;
   für eine bessere Verarbeitung muss die Möglichkeit bestehen, die Daten in mehrere Dateien aufzuteilen.
Folgende Konzepte liegen diesem Format zugrunde:
   Das Format beruht auf XML.

Services posteaux - Interface de standard ouvert - Format de fichiers de données d'adresses pour la génération du dictionnaire OCR/VCS

Poštne storitve - Odprti standardni vmesnik - Datotečni format naslovnih podatkov za generiranje slovarja s pomočjo OCR/VCS (sistem za optično razpoznavanje znakov)

General Information

Status
Published
Publication Date
06-Apr-2009
Technical Committee
Current Stage
6060 - National Implementation/Publication (Adopted Project)
Start Date
02-Apr-2009
Due Date
07-Jun-2009
Completion Date
07-Apr-2009

Buy Standard

Technical specification
TS CEN/TS 15873:2009
English language
27 pages
sale 10% off
Preview
sale 10% off
Preview
e-Library read for
1 day

Standards Content (Sample)

SLOVENSKI STANDARD
SIST-TS CEN/TS 15873:2009
01-maj-2009
3RãWQHVWRULWYH2GSUWLVWDQGDUGQLYPHVQLN'DWRWHþQLIRUPDWQDVORYQLKSRGDWNRY
]DJHQHULUDQMHVORYDUMDVSRPRþMR2&59&6 VLVWHP]DRSWLþQRUD]SR]QDYDQMH
]QDNRY
Postal Services - Open Standard Interface - Address Data File Format for OCR/VCS
Dictionary Generation
Postalische Dienstleistungen - Offene Normschnittstelle - Adress Datei Format für die
Generierung von Wörterbüchern in OCR/Videocodier-Systemen
Services posteaux - Interface de standard ouvert - Format de fichiers de données
d'adresses pour la génération du dictionnaire OCR/VCS
Ta slovenski standard je istoveten z: CEN/TS 15873:2009
ICS:
03.240 Poštne storitve Postal services
35.240.69 Uporabniške rešitve IT pri IT applications in postal
poštnih storitvah services
SIST-TS CEN/TS 15873:2009 en
2003-01.Slovenski inštitut za standardizacijo. Razmnoževanje celote ali delov tega standarda ni dovoljeno.

---------------------- Page: 1 ----------------------

SIST-TS CEN/TS 15873:2009

---------------------- Page: 2 ----------------------

SIST-TS CEN/TS 15873:2009
TECHNICAL SPECIFICATION
CEN/TS 15873
SPÉCIFICATION TECHNIQUE
TECHNISCHE SPEZIFIKATION
March 2009
ICS 03.240; 35.240.60
English Version
Postal Services - Open Standard Interface - Address Data File
Format for OCR/VCS Dictionary Generation
Services postaux - Interface de standard ouvert - Format de Postalische Dienstleistungen - Offene Normschnittstelle -
fichiers de données d'adresses pour la génération du Adressdateiformat für die Generierung von Wörterbüchern
dictionnaire OCR/VCS in OCR/Videocodier-Systemen
This Technical Specification (CEN/TS) was approved by CEN on 1 March 2009 for provisional application.
The period of validity of this CEN/TS is limited initially to three years. After two years the members of CEN will be requested to submit their
comments, particularly on the question whether the CEN/TS can be converted into a European Standard.
CEN members are required to announce the existence of this CEN/TS in the same way as for an EN and to make the CEN/TS available
promptly at national level in an appropriate form. It is permissible to keep conflicting national standards in force (in parallel to the CEN/TS)
until the final decision about the possible conversion of the CEN/TS into an EN is reached.
CEN members are the national standards bodies of Austria, Belgium, Bulgaria, Cyprus, Czech Republic, Denmark, Estonia, Finland,
France, Germany, Greece, Hungary, Iceland, Ireland, Italy, Latvia, Lithuania, Luxembourg, Malta, Netherlands, Norway, Poland, Portugal,
Romania, Slovakia, Slovenia, Spain, Sweden, Switzerland and United Kingdom.
EUROPEAN COMMITTEE FOR STANDARDIZATION
COMITÉ EUROPÉEN DE NORMALISATION
EUROPÄISCHES KOMITEE FÜR NORMUNG
Management Centre: Avenue Marnix 17, B-1000 Brussels
© 2009 CEN All rights of exploitation in any form and by any means reserved Ref. No. CEN/TS 15873:2009: E
worldwide for CEN national Members.

---------------------- Page: 3 ----------------------

SIST-TS CEN/TS 15873:2009
CEN/TS 15873:2009 (E)
Contents Page
Foreword .3
1 Introduction .4
2 Scope and purpose.5
2.1 Scope .5
2.2 Purpose .5
3 Related Standards .7
3.1 UPU S42 .7
4 Symbols and Abbreviations .7
5 XML Schema adressTree .8
5.1 ,

and .9
5.2 Address Tree in , and .9
5.3 Attributes for , and . 11
5.4 String parts in , and . 11
5.5 Ranges in , and . 12
5.6 Aliases in ,
and . 13
5.7 other XML files . 14
5.8 Linking addresses via . 15
5.9 Project specific part of the XML schema . 16
6 XML Schema addressDeltaTree . 18
6.1 Joining deltas via and file names . 19
6.2 Update actions , and . 19
7 Miscellaneous . 21

Annex A . 22
A.1 General XML Schema part . 22
A.2 Example for a project specific XML Schema part . 24
A.3 Initial addressTree Example . 25
A.4 Update addressDeltaTree Example. 26
A.5 Updated addressTree Example . 27

2

---------------------- Page: 4 ----------------------

SIST-TS CEN/TS 15873:2009
CEN/TS 15873:2009 (E)
Foreword
This document (CEN/TS 15873:2009) has been prepared by Technical Committee CEN/TC 331 “Postal
Services”, the secretariat of which is held by NEN.
Attention is drawn to the possibility that some of the elements of this document may be the subject of patent
rights. CEN [and/or CENELEC] shall not be held responsible for identifying any or all such patent rights.
According to the CEN/CENELEC Internal Regulations, the national standards organizations of the following
countries are bound to announce this Technical Specification: Austria, Belgium, Bulgaria, Cyprus, Czech
Republic, Denmark, Estonia, Finland, France, Germany, Greece, Hungary, Iceland, Ireland, Italy, Latvia,
Lithuania, Luxembourg, Malta, Netherlands, Norway, Poland, Portugal, Romania, Slovakia, Slovenia, Spain,
Sweden, Switzerland and the United Kingdom.
NOTE This document has been prepared by experts from CEN/TC 331 and UPU, in the framework of the Memorandum of
Understanding between UPU and CEN.

3

---------------------- Page: 5 ----------------------

SIST-TS CEN/TS 15873:2009
CEN/TS 15873:2009 (E)
1 Introduction
In initial meetings of CEN/TC331/WG3 interfaces which will benefit from standardization have been identified
and agreed on. Candidates for Open Interface standardization are:
 interface between the image handler and automatic address readers or video coding places;
 interface from machine control to Barcode Printers;
 interface from machine control to Barcode Reader / Verifier;
 interface between scanner, image handler and machine control;
 file format of Sort Plan;
 MIS Interface (Statistics);
 file format of Address data files.
The new intended standard deals with the file format of Address Data Files.
OCR results and video coder inputs have to be verified against the “real” existing addresses in order to reach
high recognition rates combined with low error rates. For that purpose postal operators provide postal address
directories to the OCR/VCS suppliers. Usually different postal operators use different file formats for these
(source) directories. In typical postal automation systems these files will be processed by directory generation
software which creates application specific loadable data. This data – usually referred to as “operational
directory” – is heavily compressed and contains access tables tailored for the specific reading software.
Usually different OCR/VCS suppliers use different operational directory formats.
This standard shall define a common Address Data File format for postal address directories to be provided
from the postal operators to the OCR/VCS suppliers.
This Address Data File format shall be designed to hold all information necessary to support address reading
and video coding software including data required for special recognition tasks e.g. forwarding applications.

4

---------------------- Page: 6 ----------------------

SIST-TS CEN/TS 15873:2009
CEN/TS 15873:2009 (E)

2 Scope and purpose
2.1 Scope
This document defines a file format for the generation of postal address directories. It is designed to hold all
information necessary to support address reading software including data required for forwarding applications.
In typical postal automation systems these files will be processed by directory generation software which
creates application specific loadable data. This data – usually referred to as operational directory – is heavily
compressed and contains access tables tailored for the specific reading software.
Not in the scope of this document are topics external to file like compression, checksums, the interface for
transmission to the supplier, modification permissions, error handling on inconsistent data and undo in
updates.
2.2 Purpose
The format has been designed with the following requirements in mind:
 must be able to hold the following data:
 addresses composed of address components (including aliases and range-data);
 person and organization names;
 address codes typically used as sort codes;
 links between addresses e.g. for use in forwarding;
 should not restrict character encoding;
 easily customizable for specific applications;
 should allow complete as well as incremental updates, i.e. change-only data;
 it must be possible to split data in multiple files for better handling.

The ideas behind this format are as follows:
 The format is based on XML.

 The basic XML structure is general. Project (the term project is used throughout this document to describe
a specific application such as address data for a specific country or postal organization) specifics are
coded as attributes. This should make it easier to build project independent parsers and tools.
 Address data can be structured hierarchically. An address component appearing in a lot of addresses
shall be written once as parent node in all addresses it is used in the XML address tree.
 Beyond the pure address data, there are general as well as optional project specific attributes on the level
of address components and string parts.
 In favour of faster parser execution and smaller file sizes the names of XML elements appearing very
often are short strings.
5

---------------------- Page: 7 ----------------------

SIST-TS CEN/TS 15873:2009
CEN/TS 15873:2009 (E)
 Semantics are defined only in a basic manner and have to be completed in the project specific tailoring
process. E.g. a street without numbers in the data may be interpreted as a street which has no numbers,
or where all numbers are valid. Due to this users must be aware that the interoperability of this Technical
Specification may be limited to be applied to the specific project.
6

---------------------- Page: 8 ----------------------

SIST-TS CEN/TS 15873:2009
CEN/TS 15873:2009 (E)

3 Related Standards
3.1 UPU S42
1) UPU S42 is beginning with version -5 a two part standard. Part a contains concepts and the
theoretical language description. Part b contains practical examples from different countries and may
be supplemented with new examples in some future.
2) UPU S42a defines components an address is composed of as well as postal entities which can be
“described” using these address components. The standard goes into great detail in defining a
globally usable set of specific address components such as “postcode”, “door”, .
3) UPU S42b describes how to write an address given its constituting address components. It uses
templates to describe the order, line-breaks, etc. The templates are country specific (US, Brazil,
England, .) and also uses an country specific subset of the globally defined types.
4) UPU S42 address components are assumed to have a type and a string. They do not have
additional attributes and do not have aliases.
5) UPU S42 does not define a format for an individual address == address-component collection and
does not define a format for an address directory == set of addresses.
6) UPU S42 has no concept of sort codes or forwarding information.

UPU S42 will not conflict with the format defined in this document as it targets at a completely different
application and type of information described. The only thing in common with address data are the address-
component definitions themselves. These could be used in customizing the ADF for a specific project. UPU
S42’s excellent glossary should be reused where applicable.
4 Symbols and Abbreviations
XML eXtended Markup Language
ADF Address Data File
7

---------------------- Page: 9 ----------------------

SIST-TS CEN/TS 15873:2009
CEN/TS 15873:2009 (E)

5 XML Schema adressTree
The syntax is described as an XML schema, divided into a general and a project specific part. The general
part of the XML Schema defines the basic structures. It uses some types and attribute groups to be defined in
the project specific part. Basically the structure spans a tree of address components represented by XML
elements .
The general part of the schema is listed in section A.1A.1. The project specific part is explained in section 5.9.
This document contains also another XML Schema addressDeltaTree explained in chapter 6.
The following Figure 1 shows the general structure of the XML schema. Since the project specific part does
not change the general structure the diagram is independent from any project specifics.

Figure 1 — General data structure of the XML schema
8

---------------------- Page: 10 ----------------------

SIST-TS CEN/TS 15873:2009
CEN/TS 15873:2009 (E)
5.1 ,
and
Below the XML root element is one
and one section. The
stores a
string and optionally a list of global aliases. Aliases are described in section 5.6. A version string of
the data contained in the file is stored in the element.
5.2 Address Tree in , and
Addresses are stored in an address tree corresponding to the XML elements , and . is
the root node of this tree, address components stored in are the nodes and may be additional leafs.
Explanation on the element follows in section 5.8. One complete address corresponds to one leafs root
path. Each nodes root path identifies a partial address. Other XML elements and attributes carry additional
information for the address tree node. allows to split data into multiple files.
One address component in XML element holds a type mentioned in attribute tp and a string in child
elements or and . Attribute tp and other optional attributes for are described in section 5.3.
Other optional child elements are described in the following sections.
In this context one address component holds just a name with a type and does not necessarily describe a real
thing or place. Also abstract data like delivery point codes, sort codes or else may be stored in address
components represented as XML elements .
Example: Some addresses in various formats:
In a table:
Country City Street HNr
GERMANY BERLIN
GERMANY KONSTANZ BUECKLESTR 1
GERMANY KONSTANZ BUECKLESTR 2
GERMANY KONSTANZ BUECKLESTR 3
GERMANY KONSTANZ BUECKLESTR 4
As address tree:
Country
GERMANY
City City
BERLIN KONSTANZ
Street
BUECKLESTR
HNr HNr HNr HNr
1 2 3 4

Figure 2 — Addresses formatted as an address tree
9

---------------------- Page: 11 ----------------------

SIST-TS CEN/TS 15873:2009
CEN/TS 15873:2009 (E)
Address tree representation in XML:

    xsi:noNamespaceSchemaLocation="xml_projectAddressTree.xsd">
 

  0
 

 
  GERMANY
   BERLIN
   
   KONSTANZ
    BUECKLESTR
     1
     2
     3
     4
    
   
  
 


The following verbose XML representation of this example is more like the address table. Each address is
written as a child path of the element. Both XML representations are equivalent. Usage of the short
form as shown above is strongly recommended, due to less file size overhead and less risk of duplicating
errors.

    xsi:noNamespaceSchemaLocation="xml_projectAddressTree.xsd">
 

  0
 

 
  GERMANY
   BERLIN
   
  
  GERMANY
   KONSTANZ
    BUECKLESTR
     1
  
  GERMANY
   KONSTANZ
    BUECKLESTR
     2
  
  GERMANY
   KONSTANZ
    BUECKLESTR
     3
  
  GERMANY
   KONSTANZ
    BUECKLESTR
     4
  
 



10

---------------------- Page: 12 ----------------------

SIST-TS CEN/TS 15873:2009
CEN/TS 15873:2009 (E)
5.3 Attributes for ,
and
There are for a start two attributes available for certain XML elements mentioned in the description:
tp Type of the address component for XML elements and . Type names are to be defined
in the project specific part of the schema as enumeration in the xs:simpleType named typeAttrType.
Type names shall be short strings in order to reduce data amount.
id Identification code for XML elements and . See section 5.8 for details. The usage is
required if updates on basis of the XML schema addressDeltaTree as explained in chapter 6 should
take place. The id attribute allows distinguishing between different address entries with the same
spelling, if necessary.
The following general attributes are optional for , and
. Additionally attributes defined in the
project specific part of the schema as xs:attributeGroup named element_projectAttribGroup are available for
these XML elements also.
rg Indicates that the entry is a range. See section 5.5 for range details.
lang This attribute holds a xs:language string identifying a certain language. In multi language countries,
this allows to distinguish between spellings in different tongues.
ofl Boolean attribute indicating that the entry is the official name.
old Boolean attribute to mark an old name no longer valid but still commonly used.

Examples for general attribute usage. XML element is introduced in section 5.6:
...
 FLORENZ
     FIRENZE
 
...
   Siemens AG
             Siemens
         Siemens Dematic AG
         AEG Electrocom
   


5.4 String parts in , and
Text information is stored as strings in XML elements . Ranges explained in section 5.5 may use and
instead. As the type xs:token defines, only strings without control characters and single spaces as word
separators are allowed. The general attributes are listed here:
join Usually multiple string parts are combined with a word separating space character in between. If this
boolean attribute is true (=1) the actual string part is joined to the preceding string part.
Example: Three times the string “KONSTANZ AM BODENSEE”:
KONSTANZ AM BODENSEE
KONSTANZAMBODENSEE
KONSTANZ AM BODENSEE
opt This boolean attribute allows to mark optional part of strings.
11

---------------------- Page: 13 ----------------------

SIST-TS CEN/TS 15873:2009
CEN/TS 15873:2009 (E)
Examples:
SIEMENSAG
WERNER VONSIEMENS STR

Project specifically additional attributes for string parts could be defined as xs:attributeGroup named
string_projectAttribGroup in the project specific part of the schema.
Note that the recommended way to split strings is the insertion of one whitespace character. Multiple
elements should be used for applying attributes to parts of stings only.
5.5 Ranges in , and
A range is indicated by the attribute rg in the element with one of the values:
“a” All values in the sequence from the first to the last value are included
“e” Even values only
“o” Odd values only
There are two forms. In the first form the string in carries the first and the last value separated by a dash
character (hyphen-minus: '-'). If the dash is missing, the whole value is taken as the first value. If there is
more than one dash, the first one is taken as separator. The string may be split into more than one XML
element. In other words: if the rg attribute is set, the string is interpreted as a range in any case. Due to better
readability and less file size overhead, this form should be preferred, if sufficient.
The second form uses for the first value and for the last value. The rg attribute value in the
version is by default “a” and may be omitted. and may occur more than once, allowing to specify
different attributes to the string parts just like defined for explained in section 5.4. Both the general and
project specific attributes of may be applied also in and . The first (last) range value is always the
concatenation of all () values.
The first, last or both values may be omitted. Missing values will be defaulted by project specific minimal and
maximal values.
The semantics of non numeric ranges have to be clarified project specifically.
Note that ranges are just a short form and the range entries could be enumerated as several children in the
address tree also. See house numbers 1-4 in the examples of section 5.2. Enumerations like this are the
recommended way for sets of entries beyond the simple semantics of the ranges explained here.
Examples:
In the following examples the project specific default minimal value is 1, maximum 999.
Some representations for the complete HNr range 1-999. The two forms and the possibility to omit information
should be illustrated here. Anyhow the first example is recommended to be used as standard notation of such
a range.
  1-999
  1-
  1
  -999
  -
  
  
  1999
  1999
  1
12

---------------------- Page: 14 ----------------------

SIST-TS CEN/TS 15873:2009
CEN/TS 15873:2009 (E)
  999

Note that
....1-999
does not represent a range from 1 trough 999, but only the string "1-999".
In the following examples for non numeric ranges the concrete contents of the ranges have to be clarified
project specifically:
  10A-16C
  11A-15C

Two times the range 97X-107X, where X is optional.
  97X
           -107X
  97X
        107X

The first and last values in the following duration range example are notated as xs:dateTime types containing
dash characters. For this reason the only way to notate it, is the usage of and , since the usage of
and dash would lead to a weird range from “2006” to “09-T00:00:00-2006-09-06T23:59:59”:
  
   2006-09-05T00:00:00
   2006-09-06T23:59:59
  


5.6 Aliases in ,
and
An alias gives a variant or an abbreviation of an address components spelling. There are two kinds:
Local aliases are mentioned directly inside of XML element .
Global aliases are defined in the header of the XML file in elements and are valid for the whole XML
file. Such an alias is applied to address components if the types in and are equal and the
string for is equal to the first
or element below . Applying such an alias, leads to
adding all variants
and starting with the second one to the fitting address component in .
The usual alias element
stores the string in or and elements in the same manner as described
for above. A short form in element holds the string directly and cannot specify attributes on string parts.
Alias attributes are explained in section 5.3, attributes on string parts in section 5.4. Note that the type
attribute tp is not to be mentioned in
or , but in the parent element or .
Example:

    xsi:noNamespaceSchemaLocation="xml_projectAddressTree.xsd">
 

  0
        
   AY Str
   A STREET
  
 

 
  AY Str
  
13

---------------------- Page: 15 ----------------------

SIST-TS CEN/TS 15873:2009
CEN/TS 15873:2009 (E)
  
  BEE Str
   B Str     
  
  SEE Str
   SEE DRIVE     
   
A BC Str
  
 


Another example frequently can be found in holiday regions, where houses tend to have names:

    xsi:noNamespaceSchemaLocation="xml_projectAddressTree.xsd">
 

  0
 

 
  
   Bergstraße
   
   2
   Haus Gletscherblick
   
  
 


5.7 other XML files
All address data in one file may lead to handling problems with one very large file. So a method to split data
into several files using the element is provided here. This element contains the filename of another
XML file. All XML files should be placed into the same directory and pathnames should not be used in the
string of the element. All address data is expected to be reachable by walking recursively through
the include hierarchy, starting with a master file.
The address tree of an included XML file is added as subtree to the tree node where the element is
mentioned in. So the partial address data identified by the root path of the element in the address
tree are not to be repeated in the included file.
The strin
...

Questions, Comments and Discussion

Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.