SIST HD 592 S1:1997
(Main)Binary floating-point arithmetic for microprocessor systems
Binary floating-point arithmetic for microprocessor systems
Describes sytems which may be applied for binary floating- point arithmetic of microprocessors. Applies to any software and hardware.
Binäre Gleitpunkt-Arithmetik für Mikroprozessor-Systeme
Arithmétique binaire en virgule flottante pour systèmes à microprocesseurs
Définit, pour les nouveaux systèmes à microprocesseurs, la façon de manipuler l'arithmétique binaire en virgule flottante sous forme de logiciel, de matériel ou d'une quelconque combinaison des deux. Note: - Pour le prix de cette publication, veuillez consulter la liste du code-prix ISO/CEI.
Binarna aritmetika s plavajočo vejico za mikroprocesorske sisteme (IEC 60559:1989)
General Information
Standards Content (Sample)
SLOVENSKI STANDARD
SIST HD 592 S1:1997
01-avgust-1997
%LQDUQDDULWPHWLNDVSODYDMRþRYHMLFR]DPLNURSURFHVRUVNHVLVWHPH,(&
Binary floating-point arithmetic for microprocessor systems
Binäre Gleitpunkt-Arithmetik für Mikroprozessor-Systeme
Arithmétique binaire en virgule flottante pour systèmes à microprocesseurs
Ta slovenski standard je istoveten z: HD 592 S1:1991
ICS:
35.160 Mikroprocesorski sistemi Microprocessor systems
SIST HD 592 S1:1997 en
2003-01.Slovenski inštitut za standardizacijo. Razmnoževanje celote ali delov tega standarda ni dovoljeno.
---------------------- Page: 1 ----------------------
SIST HD 592 S1:1997
---------------------- Page: 2 ----------------------
SIST HD 592 S1:1997
---------------------- Page: 3 ----------------------
SIST HD 592 S1:1997
---------------------- Page: 4 ----------------------
SIST HD 592 S1:1997
NORME CEI
INTERNATIONALE IEC
60559
INTERNATIONAL
Deuxième édition
STAN DARD
Second edition
1989-01
Arithmétique binaire en virgule flottante
pour systèmes à microprocesseurs
Binary floating-point arithmetic
for microprocessor systems
© IEC 1989 Droits de reproduction réservés — Copyright - all rights reserved
Aucune partie de cette publication ne peut être reproduite ni No part of this publication may be reproduced or utilized in
utilisée sous quelque forme que ce soit et par aucun any form or by any means, electronic or mechanical,
procédé, électronique ou mécanique, y compris la photo- including photocopying and microfilm, without permission in
copie et les microfilms, sans l'accord écrit de l'éditeur. writing from the publisher.
International Electrotechnical Commission 3, rue de Varembé Geneva, Switzerland
Telefax: +41 22 919 0300 e-mail: inmail@iec.ch IEC web site http: //www.iec.ch
CODE PRIX
Commission Electrotechnique Internationale
S
PRICE CODE
International Electrotechnical Commission
IEC
Me?NgyHapogHaR 3neKTpOTexH114eCNHA HOMNCCNA
Pour prix, voir catalogue en vigueur
• • For price, see current catalogue
---------------------- Page: 5 ----------------------
SIST HD 592 S1:1997
559©IEC - 3 -
CONTENTS
Page
FOREWORD 5
5
PREFACE
Clause
7
1. Scope
7.
1.1 Implementation objectives
7
1.2 Inclusions
7
1.3 Exclusions
7
2. Definitions
11
3. Formats
13
3.1 Sets of values
15
3.2 Basic formats
17
3.3 Extended formats
17
3.4 Combinations of formats
19
4. Rounding
4.1 Round to nearest 19
4.2 Directed roundings 19
19
4.3 Rounding precision
5. Operations 21
5.1 Arithmetic 21
5.2 Square root 23
5.3 Floating-point format conversions 23
5.4 Conversions between floating-point and integer 23
5.5 Round floating-point number to integral value 23
4-4
5.6 Binary decimal conversion 23
5.7 Comparison 27
31
6. Infinity, NaNs and signed zero
31
6.1 Infinity arithmetic
31
6.2 Operations with NaNs
6.3 33 The sign bit
33
7. Exceptions
33
7.1 Invalid operations
35
7.2 Division by zero
35
7.3 Overflow
37
7.4 Underflow
7.5 39 Inexact
Traps 39
8.
8.1 41 Trap handler
41
8.2 Precedence
APPENDIX A - Recommended functions and predicates 43
---------------------- Page: 6 ----------------------
SIST HD 592 S1:1997
©
559 IEC - 5
INTERNATIONAL ELECTROTECHNICAL COMMISSION
BINARY FLOATING -POINT ARITHMETIC
FOR MICROPROCESSOR SYSTEMS
FOREWORD
1) The formal decisions or agreements of the IEC on technical matters,
prepared by Technical Committees on which all the National Committees
having a special interest therein are represented, express, as nearly
as possible, an international consensus of opinion on the subjects
dealt with.
They have the form of recommendations for international use and they
2)
are accepted by the National Committees in that sense.
3) In order to promote international unification, the IEC expresses the
wish that all National Committees should adopt the text of the IEC
recommendation for their national rules in so far as national
conditions will permit. Any divergence between the IEC recommendation
and the corresponding national rules should, as far as possible, be
clearly indicated in the latter.
PREFACE
This standard has been prepared by Sub-Committee 47B: Microprocessor
systems, of IEC Technical Committee No. 47: Semiconductor devices. (This
Sub-Committee has been taken over by ISO/IEC JTC 1 . )
This second edition of IEC Publication 559 replaces the first edition
issued in 1982.
The text of this standard is based on the following documents:
Six Months' Rule Report on Voting
47B(CO)19 47B(CO)26
Full information on the voting for the approval of this standard can be
found in the Voting Report indicated in the above table.
---------------------- Page: 7 ----------------------
SIST HD 592 S1:1997
559 ©I EC - 7 -
BINARY FLOATING-POINT ARITHMETIC
FOR MICROPROCESSOR SYSTEMS
1. Scope
1.1 Implementation objectives
It is intended that an implementation of a floating-point system
conforming to this standard can be realized entirely in software,
entirely in hardware, or in any combination of software and hardware.
It is the environment that the programmer or user of the system sees
that conforms or fails to conform to this standard. Hardware
components that require software support to conform shall not be said
to conform apart from such software.
1 .2 Inclusions
This standard specifies:
1) basic and extended floating-point number formats;
2) add, subtract, multiply, divide, square root, remainder and
compare operations;
3) conversions between integer and floating-point numbers;
4)
conversions between different floating-point formats;
5) numbers
conversions between basic format floating-point and
decimal strings, and
6) floating-point exceptions and their handling, including non-
numbers (NaNs) .
1.3 Exclusions
This standard does not specify:
1) formats of decimal strings and integers;
2) interpretation of the signs and significant fields of NaNs, or
3) binary decimal conversions to and from extended formats.
2. Definitions
Biased exponent
The sum of the exponent and a constant (bias) chosen to make the
biased exponent's range non-negative.
---------------------- Page: 8 ----------------------
SIST HD 592 S1:1997
559 ©I EC - 9
Binary floating-point number
A bit-string characterized by three components: a sign, a signed
exponent, and a significand. Its numerical value, if any, is the signed
product of its significand and two raised to the power of its
exponent. In this standard a bit-string is not always distinguished
from a number it may represent.
Denormalized number
A nonzero floating-point number, the exponent of which has a
reserved value, usually the format's minimum, and the explicit or
implicit leading significant bit of which is zero.
Destination
The location for the result of a binary or unary operation. The
destination may be either explicitly designated by the user or implicitly
supplied by the system (e.g. intermediate results in sub-expressions
or arguments for procedures) . Some languages place the results of
intermediate calculations in destinations beyond the user's control.
Nonetheless, this standard defines the result of an operation in terms
of that destination's format as well as the operands' values.
Exponent
The component of a binary floating-point number that normally
signifies the integer power to which two is raised in determining the
value of the represented number. Occasionally the exponent is called
the signed or unbiased exponent.
Fraction
The field of the significand that lies to the right of its implied
binary point.
Mode
A variable that a user may set, sense, save and restore, to control
the execution of subsequent arithmetic operations. The default mode is
the mode that a program can assume to be in effect unless an
explicitly contrary statement is included either in the program or in its
specification.
The following modes shall be implemented:
1) rounding, to control the direction of rounding errors, and in
certain implementations.
2) rounding precision, to shorten the precision of results. The imple-
mentor may, at his option, implement the following modes:
3) traps disabled/enabled, to handle exceptions.
---------------------- Page: 9 ----------------------
SIST HD 592 S1:1997
559 © IEC - 11 -
NaN
Not a number; a symbolic entity encoded in floating-point format.
There are two types of NaNs (see 6.2) . Signalling NaNs signal the
invalid operation exception (see 7.1) whenever they appear as
operands. Quiet NaNs propagate through almost every arithmetic
operation without signalling exceptions.
Result
The bit-string (usually representing a number) that is delivered to
the destination.
Significant
The component of a binary floating-point number which consists of
an explicit or implicit leading bit to the left of its implied binary point
and a fraction field to the right.
Shall
The word "shall" signifies that which is obligatory in any conforming
implementation.
Should
The word "should" signifies that which is strongly recommended as
being in keeping with the intent of the standard, although
architectural or other constraints beyond the scope of this standard
may, on occasion, render the recommendations impractical.
Status flag
A variable that may take two states, set and clear. A user may clear
a flag, copy it, or restore it to a previous state. When set, a status
flag may contain additional system-dependent information, possibly
inaccessible to some users. The operations of this standard may, as a
side-effect, set some of the following flags: inexact result, underflow,
overflow, divide by zero and invalid operation.
User
Any person, hardware, or program not itself specified by this
standard, having access to and controlling those operations of the
programming environment specified in this standard.
3. Formats
This standard defines four floating-point formats in two groups,
basic and extended, each having two widths, single and double. The
standard levels of implementation are distinguished by the combinations
of formats supported.
---------------------- Page: 10 ----------------------
SIST HD 592 S1:1997
559©IEC - 13 -
3.1 Sets of values
This sub-clause concerns only the numerical values representable
within a format, not the encodings which are the subject of the
following sub-clauses. The only values representable in a chosen
format are those specified via the following three integer parameters:
P = number of significant bits (precision)
E
maximum exponent, and
max
E=
minimum exponent
min
Each format's parameters are displayed in Table 1. Within each
format just the following entities shall be provided:
Numbers of the form (-1) s2 E (bo bl b2 . bp_1)
where:
s is 0 or 1;
E is any integer between E . and
E inclusive, and each b. is 0
min max i
or o 1.
+oo
Two infinities, and -so;
at least one signalling NaN, and
at least one quiet NaN.
Table 1 - Summary of format parameters
Format
Parameter
Double
Single
Si Double
Single
Extended
Extended
P
24 ?32 53 >_64
E +127 ?+1 023 +1 023 >+16 383
ma x
E .
-126 <-1 022 -1 022 :5.-16 382
min
Exponent bias +1 023
+127 Unspeci- Unspeci-
fied fied
Exponent width (bits) 11 ?15
8 >11
Format width (bits) ?79
32 ?43 64
The foregoing description enumerates some values redundantly, for
example:
2 ° (1.0) = 2 1 (0.1) = 2 2(0.01) _ .
However, the encodings of such nonzero values may be redundant
only in extended formats (see 3.3) . The nonzero values of the form
±2 Emin (0•b bl 2 . b ) are called denormalized. Reserved exponents
p-1
---------------------- Page: 11 ----------------------
SIST HD 592 S1:1997
559 © IE C - 15 -
may be used to encode NaNs, ±^, ±0, and denormalized numbers. For
any variable that has the value zero, the sign bit s provides an extra
bit of information. Although all formats have distinct representations
for +0, and -0, the signs are significant in some circumstances, like
division by zero, and not in others. In this standard 0 and 0. are
written without a sign when the sign does not matter.
3.2 Basic formats
Numbers in the single and double formats are composed of three
fields:
a 1-bit sign s,
a biased exponent e = E + bias, and
a fraction f = • b 1 b 2 . bp-1.
The range of the unbiased exponent E shall include every integer
E inclusive, and also two other
between two values and
E
min x
ma
reserved values: to encode ±0 and denormalized numbers, and
Emin -1
±oo
to encode and NaNs. The foregoing parameters appear in
+1
Emax
Table 1. Each nonzero numerical value has just one encoding. The
fields are interpreted as follows:
3.2.1 Single
A 32-bit single format number X is divided as shown in Figure 1.
The value y of X is inferred from its constituent fields thus:
1) If e = 255 and 0, then y is a NaN regardless of s
f #
If e = 255 and 0, then y = (-1)s.
2) f =
127
3) If 0 < e < 255, then v = (-1)s 2e-
(11)
(1)s 2-126 (0.f)
4) If e = 0 and 0, then y = (denormalized
f #
numbers)
5) If e = 0 and 0, then y = (-1) s 0 (zero)
f =
1 8 23 . widths
s e f
lsb . order
msb lsb msb
"msb" means "most significant bit"
"lsb" means "least significant bit"
Figure 1 - Single format
---------------------- Page: 12 ----------------------
SIST HD 592 S1:1997
559 © I EC - 17 -
3.2.2 Double
A 64-bit double format number X is divided as shown in Figure 2.
The value of X is inferred from its constituent fields thus:
y
s
1) If e = 2 047 and 0, then y is a NaN regardless of
f #
2) If e = 2 047 and f = 0, then y = (-1)s.
2e-1 023
(1
= (-1)s
3) If 0 < e < 2 047, then y f)
2-1
= (-1)s
022 (0.f) (denormalized
4) If e = 0 and 0, then y
f #
numbers)
5) If e = 0 and 0, then y = (-1) s 0 (zero)
f =
... widths
1 11 52
s e f
lsb . order
msb lsb msb
Figure 2 - Double format
3.3 Extended formats
The single extended and double extended formats encode in an
implementation-dependent way the sets of values in 3.1 subject to the
constraints of Table 1. This standard allows an implementation to
encode some values redundantly, provided that redundancy is
transparent to the user in the following sense: an implementation shall
either encode every nonzero value uniquely or not distinguish
redundant encodings of nonzero values. An implementation may also
reserve some bit strings for purposes beyond the scope of this
standard; when such a reserved bit string occurs as an operand the
result is not specified by this standard.
An implementation of this standard is not required to provide (and
the user should not assume) that single extended formats have greater
range than double extended formats.
3.4 Combinations of formats
All implementations conforming to this standard shall support the
single format. Implementations should support the extended format
corresponding to the widest basic format supported, and need not
support any other extended format.*
Only if upward compatibility and speed are important issues should a
system supporting the double extended format also support the single
extended format.
---------------------- Page: 13 ----------------------
SIST HD 592 S1:1997
559©IEC - 19 -
4. Rounding
Rounding takes a number regarded as infinitely precise and, if
necessary, modifies it to fit in the destination's format while signalling
the inexact exception (see 7.5) . Except for binary *-+ decimal
conversion (the weaker conditions of which are specified in 5.6),
every operation specified in clause 5 shall be performed as if it first
produced an intermediate result correct to infinite precision and with
unbounded range, and then rounded that result according to one of
the modes in this clause.
The rounding modes affect all arithmetic operations except
comparison and remainder. The rounding modes may affect the signs of
zero sums (see 6.3), and do affect the threshold beyond which over-
flow (see 7.3) and underflow (see 7.4) may be signalled.
4.1 Round to nearest
An implementation of this standard shall provide round to nearest as
the default rounding mode. In this mode, the representable value
nearest to the infinitely precise result shall be delivered; if the two
nearest representable values are equally near, the one with its least
significant bit equal to zero shall be delivered. However, an infinitely
precise result with magnitude at least 2 Emax (2 - 2- ') shall round to
with no change in sign; here E and P are determined by the
max
destination format (clause
3) unless overridden by a rounding
precision mode (see 4.3) .
4.2 Directed roundings
An implementation shall also provide three user-selectable directed
+oo,
rounding modes: round toward round toward -oc,. and round
toward O.
When rounding toward +0., the result shall be the format's value
+co)
(possibly closest to and no less than the infinitely precise result.
When rounding toward -co, the result shall be the format's value
(possibly -o.) closest to and no greater than the infinitely precise
result.
When rounding toward 0, the result shall be the format's value
closest to and no greater in magnitude than the infinitely precise
result.
4.3 Rounding precision
Normally a result is rounded to the precision of its destination.
However, some systems deliver results only to double or extended
destinations. On such a system the user, which may be a high-level
language compiler, shall be able to specify that a result be rounded
instead to single precision, though it may be stored in the double or
---------------------- Page: 14 ----------------------
SIST HD 592 S1:1997
559 © IEC - 21 -
extended format with its wider exponent range.* Similarly, a system
that delivers results only to double extended destinations shall permit
the user to specify rounding to single or double precision. Note that
to meet the specifications in 4.1, the result cannot suffer more than
one rounding error.
5. Operations
All conforming implementations of this standard shall provide oper-
ations to add, subtract, multiply, divide, extract the square root, find
the remainder, round to integer in floating-point format, convert
between different floating-point formats, convert between floating-point
and integer formats, convert binary decimal, and compare. Whether
copying without change of format is considered as an operation is an
implementation option. Except for binary •--* decimal conversion, each of
the operations shall be performed as if it first produced an inter-
mediate result correct to infinite precision and with unbounded range,
and then coerced this intermediate result to fit in the destination's
format (clauses 4 and 7) . Clause 6 augments the following specifica-
tions to cover ±0, ±., and NaNs; clause 7 enumerates exceptions
caused by exceptional operands and exceptional results.
5.1
Arithmetic
An implementation shall provide the add, subtract, multiply, divide,
and remainder operations for any two operands of the same format, for
each supported format; it should also provide the operations for
operands of differing formats. The destination format (regardless of
the rounding precision control of 4.3) shall be at least as wide as the
wider operand's format. All results shall be rounded as specified in
clause 4.
When y # 0, the remainder â = x
REM y is defined regardless of the
rounding mode by the mathematical relation tf = x - y x n, where n is
the integer nearest the exact value
x/y; whenever In - x/yj = I, then
n is even. Thus, the remainder is always exact. If Z = 0, its sign
shall be that of x.
Precision control (see 4.3) shall not apply to the
remainder operation.
Control of rounding precision is intended to allow systems the destina-
tions of which are always double or extended to mimic, in the absence
of over/underflow, the precisions of systems with single and double
destinations. An implementation should not provide operations that
combine double or extended operands to produce a single result, nor
operations that combine double extended operands to produce a double
result, with just one rounding.
---------------------- Page: 15 ----------------------
SIST HD 592 S1:1997
©
559 I EC - 23 -
5.2 Square root
The square root operation shall be provided in all supported
formats. The result is defined and has a positive sign for all operands
?0, except that 1-0 shall be -O. The destination format shall be at
least as wide as the operand's. The result shall be rounded as speci-
fied in clause 4.
5.3 Floating-point format conversions
It shall be possible to convert floating-point numbers between all
supported formats. If the conversion is to a narrower precision, the
result shall be rounded as specified in clause 4. Conversion to a wider
precision shall be exact. There is no exception.
5.4 Conversions between floating-point and integer
It shall be possible to convert between all supported floating-point
formats and all supported integer formats. Conversion to integer shall
be effected by rounding as specified in clause 4. Conversions between
floating-point integers and integer formats shall be exact unless an
exception arises as specified in 7.1.
5.5 Round floating-point number to
integral value
It shall be possible to round a floating-point number to an integral
valued floating-point number in the same format. The rounding shall
be as specified in clause 4, with the understanding that when
rounding to nearest, if the difference between the unrounded operand
and the rounded result is exactly one half, the rounded result is
even.
5.6 Binary F--t decimal conversion
Conversion between decimal strings in at least one format, and
binary floating-point numbers in all supported basic formats, shall be
pro
...
Questions, Comments and Discussion
Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.