The REAL signature

The `REAL` signature

The REAL signature specifies structures that implement floating-point numbers. The semantics of floating-point numbers should follow the IEEE standard [CITE]754-1985/ and the ANSI/IEEE standard [CITE]854-1987/. In addition, implementations of the REAL signature are required to use non-trapping semantics. Additional aspects of the design of the REAL and MATH signatures were guided by the Floating-Point C [CITE]Extensions/ developed by the X3J11 ANSI committee and the lecture [CITE]notes/ by W. Kahan on the IEEE standard 754.

The relation between the comparison predicates defined here and those defined by IEEE, ANSI C and FORTRAN is specified in the following table.

SML	IEEE	C	FORTRAN
`==`	`=`	`==`	`.EQ.`
`!=`	`?<>`	`!=`	`.NE.`
`<`	`<`	`<`	`.LT.`
`<=`	`<=`	`<=`	`.LE.`
`>`	`>`	`>`	`.GT.`
`>=`	`>=`	`>=`	`.GE.`
`?=`	`?=`	`!islessgreater`	`.UE.`
`not o ?=`	`<>`	`islessgreater`	`.LG.`
`unordered`	`?`	`isunordered`	`unordered`
`not o unordered`	`<=>`	`!isunordered`	`.LEG.`
`not o op <`	`?>=`	`! <`	`.UGE.`
`not o op <=`	`?>`	`! <=`	`.UG.`
`not o op >`	`?<=`	`! >`	`.ULE.`
`not o op >=`	`?<`	`! >=`	`.UL.`

In the functions below, unless specified otherwise, if any argument is a NaN, the return value is a NaN. In a list of rules specifying the behavior of a function in special cases, the first matching rule defines the semantics.

Rationale:

The specification of the default signature and structure for non-integer arithmetic, particularly concerning exceptional conditions, was the source of much debate, given the desire of allowing implementations to provide efficient floating-point modules. Permitting implementations to differ on whether or not, for example, to raise Div on division by zero meant that the user really did not have a standard to program against. Portable code would require adopting the more conservative position of explicitly handling exceptions. A second alternative was to specify that functions in the Real structure must raise exceptions, but that implementations so desiring could provide additional structures matching REAL with explicit floating-point semantics. This was rejected because it meant that the default real type would not be the same as a defined floating-point real type. This imbued a second-class status on the latter, while providing a default real of lesser performance and involving additional implementation complexity for little benefit.
Deciding if real should be an eqtype, and if so, what should equality mean, was also problematic. IEEE specifies that the sign of zeros be ignored in comparisons, and that equality evaluate to false if either argument is a NaN. These constraints are disturbing to the SML programmer. The former implies that 0 = ~0 is true while r/0 = r/~0 is false. The latter implies such anomalies as r = r is false, or that, for a ref cell rr, we could have rr = rr but not have !rr = !rr. We accepted the unsigned comparison of zeros, but felt that the reflexive property of equality, structural equality, and the desire that <> be equivalent to not o = ought to be preserved. Additional complications led to the decision to not have real be an eqtype. Additional rationale.
The type, signature and structure identifiers real, REAL and Real, although misnomers in light of the floating-point-specific nature of the modules, were retained for historical reasons.

Synopsis

signature REAL structure Real : REAL structure LargeReal : REAL structure Real{N} : REAL

Interface

type real structure Math : MATH val radix : int val precision : int val maxFinite : real val minPos : real val minNormalPos : real val posInf : real val negInf : real val + : (real * real) -> real val - : (real * real) -> real val * : (real * real) -> real val / : (real * real) -> real val *+ : real * real * real -> real val *- : real * real * real -> real val ~ : real -> real val abs : real -> real val min : (real * real) -> real val max : (real * real) -> real val sign : real -> int val signBit : real -> bool val sameSign : (real * real) -> bool val copySign : (real * real) -> real val compare : (real * real) -> order val compareReal : (real * real) -> IEEEReal.real_order val < : (real * real) -> bool val <= : (real * real) -> bool val > : (real * real) -> bool val >= : (real * real) -> bool val == : (real * real) -> bool val != : (real * real) -> bool val ?= : (real * real) -> bool val unordered : (real * real) -> bool val isFinite : real -> bool val isNan : real -> bool val isNormal : real -> bool val class : real -> IEEEReal.float_class val fmt : StringCvt.realfmt -> real -> string val toString : real -> string val fromString : string -> real option val scan : (char, 'a) StringCvt.reader -> (real, 'a) StringCvt.reader val toManExp : real -> {man : real, exp : int} val fromManExp : {man : real, exp : int} -> real val split : real -> {whole : real, frac : real} val realMod : real -> real val rem : (real * real) -> real val nextAfter : (real * real) -> real val checkFloat : real ->real val realFloor : real -> real val realCeil : real -> real val realTrunc : real -> real val floor : real -> Int.int val ceil : real -> Int.int val trunc : real -> Int.int val round : real -> Int.int val toInt : IEEEReal.rounding_mode -> real -> int val toLargeInt : IEEEReal.rounding_mode -> real -> LargeInt.int val fromInt : int -> real val fromLargeInt : LargeInt.int -> real val toLarge : real -> LargeReal.real val fromLarge : IEEEReal.rounding_mode -> LargeReal.real -> real val toDecimal : real -> IEEEReal.decimal_approx val fromDecimal : IEEEReal.decimal_approx -> real

Description

type real

Note that, as discussed above real is not an eqtype.

structure Math

radix

is the base of the representation, e.g., 2 or 10 for IEEE floating point.

precision

is the number of digits, each between 0 and radix-1, in the mantissa.

maxFinite

minPos

minNormalPos

are the maximum finite number, the minimum non-zero positive number and the minimum non-zero normalized number, respectively.

val posInf

val negInf

Positive and negative infinity values.

r1 + r2

r1 - r2

the sum and difference of r1 and r2. If one argument is finite and the other infinite, the result is infinite with the correct sign, e.g., 5 - (-infinity) = infinity. We also have infinity + infinity = infinity and (-infinity) + (-infinity) = (-infinity). Any other combination of two infinities produces a NaN.

r1 * r2

the product of r1 and r2. The product of zero and an infinity produces a NaN. Otherwise, if one argument is infinite, the result is infinite with the correct sign, e.g., -5 * (-infinity) = infinity, infinity * (-infinity) = -infinity.

r1 / r2

the quotient of r1 and r2. We have 0 / 0 = NaN and +-infinity / +-infinity = NaN. Dividing a finite, non-zero number by a zero, or an infinity by a finite number produces an infinity with the correct sign. (Note that zeros are signed.) A finite number divided by an infinity is 0 with the correct sign.

*+ (a, b, c)

*- (a, b, c)

return a*b + c and a*b - c, respectively. Their behaviors on infinities follow from the behaviors derived from addition, subtraction and multiplication.

The precise semantics of these operations depend on the language implementation and the underlying hardware. Specifically, certain architectures provide these operations as a single instruction, possibly using a single rounding operation. Thus, the use of these operations may be faster than performing the individual arithmetic operations sequentially, but may also cause different rounding behavior.

~ r

the negation of r, i.e., (- r). ~ (+-infinity) = -+infinity.

abs r

the absolute value of r. abs (+-infinity) = infinity.

min (a, b)

max (a, b)

returns the smaller (respectively, larger) of a and b. If exactly one argument is NaN, return the other argument. If both arguments are NaN, return NaN.

sign r

~1 if r is negative, 0 if r is zero, or 1 if r is positive. An infinity returns its sign; a zero returns 0 regardless of its sign. Raises Domain on NaN.

signBit r

returns true if and only if the sign of r (infinities, zeros and NaNs, included) is negative.

sameSign (r1, r2)

returns true if and only if signBit r1 equals signBit r2.

copySign (x, y)

returns x with the sign of y, even if y is a NaN.

compare (r1, r2)

compareReal (r1, r2)

The function compare returns LESS, EQUAL or GREATER according to whether r1 is less than, equal to, or greater than r2. It raises IEEEReal.Unordered on unordered arguments.

The function compareReal behaves similarly except it returns values of type IEEEReal.real_order and returns IEEEReal.UNORDERED on unordered arguments.

Implementation note:

Implementations should try to optimize use of Real.compare, since it is necessary for catching NaNs.

r1 < r2

r1 <= r2

r1 > r2

r1 >= r2

return true if the corresponding relation holds between the two reals.

Note that these operators return false on unordered arguments, i.e., if either argument is NaN, so that the usual reversal of comparison under negation does not hold, e.g., a < b is not the same as not (a >= b).

== (x, y)

!= (x, y)

The first returns true if and only if neither y nor x is NaN, and y and x are equal, ignoring signs on zeros. This is equivalent to the IEEE = operator.

The second function != is equivalent to not o op == and the IEEE ?<> operator.

?= (x, y)

returns true if either argument is a NaN or if the arguments are bitwise equal, ignoring signs on zeros. It is equivalent to the IEEE ?= operator.

unordered (x, y)

returns true if x and y are unordered, i.e., at least one of x and y is a NaN.

isFinite x

returns true if x is neither a NaN nor an infinity.

isNan x

returns true if x is a NaN.

isNormal x

returns true if x is normal, i.e., neither zero, subnormal, infinite nor NaN.

class x

returns the IEEEReal.float_class to which x belongs.

fmt spec r

toString r

convert reals into strings. The conversion provided by the function fmt is parameterized by spec, which has the following forms and interpretations.

SCI arg: Scientific notation: [~]d.dddE[~]dd, where there is always one digit before the decimal point, nonzero if the number is nonzero. arg specifies the number of digits to appear after the decimal point, with 6 the default if arg is NONE.
FIX arg: Fixed-point notation: [~]ddd.ddd. arg specifies the number of digits to appear after the decimal point, with 6 the default if arg is NONE.
GEN arg: Adaptive notation: the notation used is either scientific or fixed-point depending on the value converted. arg specifies the maximum number of significant digits used, with 12 the default if arg is NONE.
EXACT: Exact decimal notation: refer to IEEEReal.toString for a complete description of this format.

In all cases, positive and negative infinities are converted to "inf" and "~inf", respectively. If spec is not EXACT, NaN values are returned as "nan"; otherwise, NaN values are converted to the form "nan(d₍₁₎d₍₂₎...d_(n))".

fmt raises Size if spec is an invalid precision, i.e., if spec is

SCI (SOME i) with i < 0
FIX (SOME i) with i < 0
GEN (SOME i) with i < 1

The value returned by toString is equivalent to:

(fmt (StringCvt.GEN NONE) r)

fromString s

returns SOME(r) if a real value can be scanned from a prefix of s, ignoring any initial whitespace; otherwise, returns NONE. Equivalent to StringCvt.scanString scan.

scan getc a

scans a real value from character source a using reader getc, ignoring initial whitespace. If successful, returns SOME(r,rest) where r is the scanned real value and rest is the unused portion of the character source a. Raises Overflow if the value cannot be represented in real value.

The format for valid string representation of reals is given by the regular expression

	  [+~-]?(([0-9]+(\.[0-9]+)?)|(\.[0-9]+))([eE][+~-]?[0-9]+)?

toManExp r

returns {man, exp}, where man and exp are the mantissa and exponent of r, respectively. Specifically, we have the relation

r = man * radix^(exp)

where 1.0 <= man * radix < radix. This function is comparable to frexp in the C library.

If r is +-0, man is +-0 and exp is +0. If r is +-infinity, man is +-infinity and exp is unspecified. If r is NaN, man is Nan and exp is unspecified.

fromManExp {man, exp}

returns man * radix^(exp). This function is comparable to ldexp in the C library. Note that non-exceptional arguments can produce zero or infinities, essentially because of underflows and overflows.

If man is +-0, the result is +-0. If man is +-infinity, the result is +-infinity. If man is NaN, the result is NaN.

split r

realMod r

The former returns {whole, frac}, where frac and whole are the fractional and integral parts of r, respectively. Specifically, whole is integral, |frac| < 1.0, whole and frac have the same sign as r, and r = whole + frac. This function is comparable to modf in the C library.

If r is +-infinity, whole is +-infinity and frac is +-0. If r is NaN, both whole and frac are NaN.

realMod is equivalent to #frac o split.

rem (x, y)

returns the remainder x - n*y, where n = trunc (x / y). The result has the same sign as x and has absolute value less than the absolute value of y.

If x is an infinity or y is 0, rem returns NaN. If y is an infinity, rem returns x.

nextAfter (r, t)

returns the next representable real after r in the direction of t. Thus, if t is less than r, nextAfter returns the largest representable floating-point number less than r. If r = t then it returns r. If r is +-infinity, it returns +-infinity. If either argument is a NaN, this returns NaN.

checkFloat x

raises Overflow if x is an infinity, and raises Div if x is a NaN. Otherwise, it returns its argument.

This can be used to synthesize trapping arithmetic from the non-trapping operations given here. Note, however, that infinities can be converted to NaNs by some operations, so that if accurate exceptions are required, checks must be done after each operation.

realFloor r

realCeil r

realTrunc r

truncate reals to integer-valued reals. realFloor produces the largest integer not larger than r. realCeil produces the smallest integer not less than r. realTrunc rounds r towards zero. If r is NaN or an infinity, these functions return r.

floor r

ceil r

trunc r

round r

convert reals to integers. floor produces the largest int not larger than r. ceil produces the smallest int not less than r. trunc rounds r towards zero. round yields the integer nearest to r. In the case of a tie, rounds to the nearest even integer. They raise Overflow if the resulting value cannot be represented as an int, for example, on infinity. They raise Domain on NaN arguments.

These are respectively equivalent to:

         toInt IEEEReal.TO_NEGINF r
         toInt IEEEReal.TO_POSINF r
         toInt IEEEReal.TO_ZERO r
         toInt IEEEReal.TO_NEAREST r

toInt mode x

toLargeInt mode x

convert the argument x to an integral type using the specified rounding mode. Raise Overflow if the result is not representable, in particular, if x is an infinity. Raise Domain if the input real is a NaN.

fromInt i

fromLargeInt i

convert integers to type real.

toLarge x

fromLarge mode x

convert between values of type real and type LargeReal.real.

toDecimal r

fromDecimal d

convert between real values and decimal approximations. Decimal approximations are to be converted using the IEEEReal.TO_NEAREST rounding mode. toDecimal should produce only as many digits as are necessary for fromDecimal to convert back to the same number, i.e., for any Normal or SubNormal real value r, we have:

    fromDecimal (toDecimal r) = r.

For toDecimal, when the kind field is not Normal or SubNormal, then exp = 0 and digits = [] except if kind is NAN, which case the digits field provides a decimal representation of the fraction field of r.

For fromDecimal, if kind is ZERO or INF, the resulting real is the appropriate signed zero or infinity, with the digits and exp fields ignored. If kind is NAN, a signed NaN is generated, where the exp field is ignored and the digits field is used as the decimal representation of the fractional field. If the resulting fractional field has all zero bits, which corresponds to an infinity, fromDecimal raises the Domain exception. If digits is empty, an implementation-dependent NaN is produced. If kind is NORMAL or SUBNORMAL, the sign, digits and exp fields are used to produce a real value. Note that the conversion itself should ignore the kind field, so that the resulting value might have class NORMAL, SUBNORMAL or ZERO. In particular, is digits is empty or a list of all 0's, the result should be a signed zero.

Implementation note:

Algorithms for accurately and efficiently converting between binary and decimal real representations are readily available, e.g., see the technical report by [CITE]Gay/.

Discussion

The sign of a zero is ignored in all comparisons.

Note that, if x is real, ~x is equivalent to ~(x), that is, it is identical to x but with its sign bit flipped. In particular, the literal ~0.0 is just 0.0 with it sign bit set. On the other hand, this might not be the same as 0.0-0.0, in which rounding modes come into play.

Except for the *+ and *- functions, arithmetic should be done in the exact precision specified by the precision value. In particular, arithmetic must not be done in some extended precision and then rounded.

Implementation note:

Implementations may choose to provide a debugging mode, in which NaNs and Infs are detected when they are generated.

The Standard ML Basis Library