My Height: A Model For Numeric Information

William Kent
Database Technology Department
Hewlett-Packard Laboratories
Palo Alto, California

March 1993

> 1 INTRODUCTION . . . 2
> 2 THE PROBLEM . . . 2
> 13 CONCLUSIONS . . . 16
> 14 REFERENCES . . . 17


Units of measure and data types have always been awkward to integrate into database schemas. It's especially tricky in object models, with their insistence on a clean separation between interface and implementation: on which side do units and data types belong? The problem surfaces yet again in dealing with domain mismatch when integrating heterogeneous data sources.

An elegant approach treats units and data types as distinct constructs in their own right, mapping in two stages from abstract magnitudes to communicable and recordable symbols.


I am six feet or 72 inches tall. How should that be modeled in a database schema, object oriented or otherwise? Although different sorts of data models use different syntaxes, a property such as birthplace or height is always describable as some sort of mapping between kinds of things:

Birthplace: Person -> City,
Height: Person -> ????.

What exactly does height map to? To begin with, do we mean 6 feet (an integer) or 6.0 feet (a "real")?

Height: Person -> Integer?
Height: Person -> Real?

Are Real and Integer types in the same sense as Person and City? What is the relationship between Real and Integer data types? In type graphs, we tend to show Real and Integer as disjoint types, having no instances in common. But in the "real" world of mathematics, integers are a subset of the reals, i.e., the number 6 is both an integer and a real number. Do we really believe that the integer 6 and the real number 6 are different things? Are we confused between different numbers and different ways of representing numbers?

We're also confused about the names of our data types. "Real" is a misnomer; at best we might be talking about the rational numbers, not all the reals. To be more precise, we are talking about a subset of the rational numbers expressible with a finite and usually fixed precision. Some languages call these Floating Point, or Decimal, although Decimal sometimes means integers represented in the base ten, rather than two or eight or sixteen. (The actual set of data types isn't what we want to focus on; we'll pick some arbitrary set for discussion purposes.)

Beyond that, what do we do about feet and inches and meters? What is the type of the value of the height property? We might introduce types like Feet and Meters, or Height_in_Feet and Height_in_Meters...

Height: Person -> Feet?
Height: Person -> Meters?
Height: Person -> Height_in_Feet?
Height: Person -> Height_in_Meters?

If the height of people is expressed in feet and the height of buildings is expressed in meters, do they or don't they have the same sort of values for the height property? In some sense, we'd like to say the height of anything is a distance.

Where would we put such types in a type graph? Sometimes we put them in as subtypes of the numeric types. Does that make sense? Are heights and weights and ages numbers? Or are they some other concepts, whose measurements are expressed in numbers? Distance is essentially a space between two points, not a number. Weight is essentially a heaviness, not a number.


Abstractly, we can describe a simple three-stage process to deal with all this. In the first stage, the value of a property such as my height is simply a distance, i.e., a space between two points. Picture the answer being given by someone with arm extended, palm down, saying "So high." No joke. That is the direct answer to the question of how tall I am.

The second stage is measurement, mapping this into numbers. Units of measure are different ways of mapping such things into numbers, yielding different numbers for different units. The numbers are still abstractions, not represented in any particular data type. If we ask someone my height in feet, imagine him holding up all the fingers of one hand and the index finger of the other: "This many." If we ask my height in inches, you'll have to imagine seven people holding up all their fingers, with yet another person holding up his thumbs, all saying "This many".

The third stage is representation. The number of feet in my height would be written as 6 in decimal, VI in Roman, 110 in binary, 111111 in unary, six in English, seix in Spanish, and so on. (I'm ignoring quoting conventions; that's another level of complexity, which we don't need to tackle here.)

We're trying to discuss concepts, but we can't communicate without using some form of representation. So let's say that, for the moment, /-/ represents an abstract distance corresponding to my height, while * represents an abstract number corresponding to the number of points in a snowflake. The three stages can then be summarized as

Height(Bill) = /-/
Feet(/-/) = *
Decimal(*) = 6.

We can talk about Height(Person), or even Height(Bill), as an abstraction. That's okay so long as we only want to think about my height, as an abstract concept. As soon as we want to communicate it or record it, we have to make it more concrete and put it into some form of representation, requiring an intermediate stage of measurement.

Imagine the following dialog with a somewhat perverse computer:

You: Do you know Bill's height?

Computer: Yes.

You: Well, would you show it to me, please?

Computer: I can't. The screen's not big enough to display two points such that the distance between them is the same as Bill's height.

You: Listen, all I want is a number.

Computer: What number? Bill's height is a distance, not a number.

You: Would you please measure Bill's height!

Computer: Certainly. What units shall I measure it in?

You: Feet will do nicely, thank you.

Computer: All right.

You: Well, could you please show me the answer?

Computer: Sorry, I don't have any fingers to hold up. I can only display character graphics on your screen. In what sort of displayable symbol would you like the number represented?

You: A decimal integer, for Pete's sake.

Computer: Sorry, I'll try to remember that that's what you expect next time. The answer is 6.

You: Well, thanks a lot.


Let's use "magnitude" to mean a measurable abstraction such as distance, weight, chunks of time, and so on, before it gets measured or represented. Let's also agree that "number" means an abstract quantity, not expressed in any particular notation. Then we have:

          measure        datatype

Units of measure are functions that map magnitudes into numbers:

Feet: Distance -> Number
Inches: Distance -> Number
Meters: Distance -> Number
Miles: Distance -> Number

Pounds: Weight -> Number
Grams: Weight -> Number
Kilograms: Weight -> Number

Quarts: Volume -> Number
Gallons: Volume -> Number

Degrees: Angle -> Number
Radians: Angle -> Number


Thus, for a given distance /-/, the mappings


each yield a different (abstract) number.

Data types, in turn, are functions that map abstract numbers into character string symbols. If % denotes the number of inches in my height, then we might have

DecimalType(%) = 72,
OctalType(%) = 110,
HexType(%) = 48.

A data type has two aspects: a set of symbols and a mapping from numbers into that set. The set of symbols is typically defined as the finite sequences of characters from a specified character set. Hexadecimal symbols are defined over the set of characters {0-9,A-F}. The corresponding mapping is then defined as

HexType: Number -> Hexadecimal.

Dates need similar clarification: are we talking about a certain day, or a particular way of representing it? My birthday is a certain particular day in the past, which has its own abstract existence in the same way as the number ten. That day is that day. There are lots of ways to represent it: February 19, 1936; 2/19/36; 19-Feb-36; 19360219; 50/1936 (the 50th day of 1936); the 12,803rd day in the 20th century; etc., etc., etc. Each of those corresponds to a different data type. We don't think of them as being different entity types, different kinds of things. They are just different representations of the same abstract concept.

As a matter of fact, we even have a kind of measurement concept for dates. Before we can represent a date, we have to pick a calendar in which to "measure" it. Dates come out quite different in the Gregorian, Chinese, Jewish, and other calendars, not to mention the various forms used inside computers. Each uses a different sort of "yardstick", as well as a different origin as reference point. Our calendar "measure" is analogous to an odd sort of yardstick, as though the three feet in a yard each had their own name, and each contained its own number of inches. So our year is divided into twelve months, each being named, and each having its own number of days.

Thus we again have a two stage process: pick a calendar in which to "measure" a day, then pick a representation format for the date within that calendar. Keeping this date analogy in mind might make it easier to understand the treatment of other numeric quantities.


We can draw a fine line between concept and realization.

Suppose you ask me if I know how tall I am. I say yes; an idea forms in my mind, capturing in an indescribable way some notion of a particular degree of "bigness". But in order to tell you how tall I am, I ultimately have to transform that notion into something very concrete, involving very real matter and energy, which will have some impact on your eyes, ears, muscles, or other physical receptors. The same is true if I want to record that information somewhere, so you can get at it later on. Information has to be converted to very tangible configurations of matter and energy.

Symbols are an intermediate stage of that transformation from concept to material communication. Symbols make it possible for me to tell you how tall I am without having to scratch a mark on the ground as long as my height, and to tell you how much I weigh without having to hand you a rock as heavy as I am. Symbols are at the boundary between hardware and software. Software deals with computer systems as symbol manipulators; hardware implements symbols in the electrons, holes, and magnetic fields in silicon, copper, phosphors, etc.

Symbols are also at the interfaces of computers. For our purposes, a symbol is anything that can pass across a communication or storage interface of a computer. Technological advances are rapidly expanding the domain of such symbols to include sounds, pictures, cursor movements, finger touches, hand-drawn script, and so on. For our immediate purposes, we can concentrate on symbols formed as linear sequences of characters. Such symbols can represent elementary concepts, aggregates that organize the elementary concepts into useful information, and procedures to be executed by the computer in processing information. This paper focuses on symbols representing elementary concepts.

Of these three levels of abstraction - conceptual, symbolic, and material - only the first two are relevant to data modeling. We'll say that conceptual specifications describe information at the conceptual level, while concrete specifications describe the symbolic level (admittedly an abuse of metaphor, but the alliteration is attractive).


We are now in a better position to differentiate between conceptual and concrete information specifications. This applies to any sort of data model, object-oriented or not.

Conceptual specifications describe the information to be maintained inside the computing system. Concrete specifications tell how to map that information into symbols which can be recorded, or which can be communicated between the computer and users. Note that concrete specifications apply to two facets: how the computer communicates at interfaces with users, and how the computer records information internally in storage media:

conceptual specification --------------- communicating<-------------| information |------------->recording concrete --------------- concrete specification specification

For our purposes, the communication side includes delivery of information into variables in an application's program space, and also the encoding of parameters for requests to be sent to the information system. Thus a conceptual specification might define operations for getting and setting the height of a person, or for moving a vertex, as

Height: Person -> Distance
Height: Person <- Distance
Move: Vertex, Distance, Direction -> Location

without specifying units or data types for any of the parameters. (That syntax is illustrative only, and should not provoke discussion of the semantics of attributes or update.) In an object model, that might be the form of an abstract object interface.

Conceptual specifications suffice to describe algorithms, procedures, and other relationships in the information. For example, operations can be specified to compare heights, or to add and subtract them, without reference to representation in terms of units or data types. The expression

Height(Bill) - Height(George)

is a meaningful operation on distances, not numbers. The result is an abstract distance, i.e., a space between two points (the tops of our heads? - there's a joke there).

Concrete specifications, augmenting conceptual specifications with units and data types, would define external interfaces, which might be characterized as client or application or presentation or communication interfaces.

This configuration corresponds nicely with the ANSI three-schema architecture [3]. The client interface is the external view, the conceptual specification is the conceptual view, and the storage specification is the internal view. Both the client and storage specifications are concrete.

Data independence arises from the separability of the concrete specifications for external and storage interfaces, being linked via the conceptual specifications. A given conceptual specification can be implemented in different storage specifications, involving different units and data types (as well as different data structures and program code). These differences might arise serially, as when an application is ported to different environments or underlying implementations are tuned for performance. Different implementations might also be experienced concurrently, as in multi-database integration. There can also be different external specifications, involving different units and data types, as well as different display structures, e.g., tables, pie charts, and bar charts.

In practice, an "access path" can be compiled for a particular pair of external and storage specifications, allowing for efficient execution of a single composite mapping.

This configuration contrasts with current object models, which lump the client interface and the conceptual specification into a single notion of object interface. The storage specification then corresponds to implementation.

conceptual specification --------------- communicating<-------------| information |------------->recording concrete --------------- concrete specification specification <----------------------------------------><---------------------> OBJECT INTERFACE IMPLEMENTATION

Though the partitioning is correct, this description is oversimplified by neglecting data structures and procedures.

It might be more accurate to describe current object models as

conceptual specification --------------- communicating<-------------| information |------------->recording concrete --------------- concrete specification specification <----------------------------------------> OBJECT INTERFACE <---------------------------------------------------------------> IMPLEMENTATION

That is, an object interface includes the client's concrete interface, and an implementation includes a description of the object interface it implements. This configuration allows only one client interface, but many storage interfaces, for a given conceptual specification.


Conceptual specifications describe the categories of things occurring in the information, and the associations defined among those categories of things.

From the conceptual point of view, magnitudes such as distances, weights, time intervals, velocities, angles, etc., are all very distinct kinds of things, and all distinct from the notions of numbers. This is in keeping with a fundamental principle that distinguishes object models from value-based models: things have an existence and identity apart from the values of their properties. People exist independently of their social security numbers or employee numbers. Colors exist independently of the names we might call them. So then does a certain distance, or a certain degree of heaviness, exist apart from the units in which they are measured. So also do numbers exist apart from the symbols we use to represent them [1].

Our ontology of concepts includes such categories of things as persons, employees, colors, distances, time intervals, velocities, angles, real and imaginary numbers, rational numbers, integers, and so on. We know some relationships among these categories: employees constitute a subset of persons, and the integers are a subset of the rationals which in turn are a subset of the reals.

A diagram of this conceptual ontology might take the form

  |       |         |                                    |
Person  Color     Magnitude                            Number
  |                 |                                    |
  |      ----------------------------------          ---------
  |      |   |   |    |      |        |   |          |       |
  | Distance | Volume | Time_Interval | Angle       Real   Imaginary
  |         Area    Weight          Speed            |         
Employee                                           Integer

Symbols are themselves concepts, since we can think about character strings and numerals. For illustrative purposes, we might subdivide symbols into such things as images, sounds, and strings over various character sets. Aggregates such as multisets, sets, and lists are also abstract concepts. So also are programs, a generic term for various concepts expressing behavior. Thus the ontology might be extended with

                |                        |               |
              Symbol                  Aggregate        Program
                |                        |
        --------------------....     ----------....
        |         |        |         |        |
      String   Picture   Sound      List   Multiset
        |                                     |
     ----------....                          Set
     |       |
  ASCII   Binary
  |           |      |
Alphabetic   Hex   Decimal
  |                  |
 Roman             Octal

The strings are sets of characters drawn from various character sets. Thus Hexadecimal contains strings over the set {0-9,A-F}.

Associations (mappings, functions, perhaps even operations) are defined in terms of the categories of things they associate. (In this paper we neglect further specifications of the criteria or algorithms which determine such associations, such as procedure bodies, constraints, or pre- and post-conditions.)

The concept of height is an association between persons and distances:

Height: Person -> Distance.

The concept of name is an association between persons and alphabetic strings:

Name: Person -> Alphabetic.

An operation for moving a vertex would also be defined here:

Move: Vertex, Distance, Direction -> Location.

At this conceptual level, we also understand the concepts of measurement and representation. Thus the following sorts of associations would also be in the conceptual specifications:

Feet: Distance -> Number
Inches: Distance -> Number
Meters: Distance -> Number
Miles: Distance -> Number
Pounds: Weight -> Number
Grams: Weight -> Number
Kilograms: Weight -> Number
Quarts: Volume -> Number
Gallons: Volume -> Number
Degrees: Angle -> Number
Radians: Angle -> Number

HexType: Number -> Hexadecimal
OctalType: Number -> Octal

That is, the notions of measuring magnitudes and representing numbers are understood at the conceptual level. What isn't specified at the conceptual level is how such things as the heights of people are measured or represented.


The conceptual specification treats the computer as a black box full of information. We don't yet know how it records or communicates the information. To design mechanisms for recording or communicating, the conceptual specification needs to be augmented with concrete specifications mapping all information into symbols. Otherwise we don't know how to tell the computer how tall I am, how the computer will remember that, or how the computer will pass that information on.

Given a fact abstractly specified as

Height: Person -> Distance,

the description of how to implement this fact in stored data, or how to communicate it in response to an inquiry, needs to be augmented in some form such as

Height: Person -> Binary(Feet(Distance)),


Height: Person -> Decimal(Inches(Distance)).

As a syntactic device to emphasize the separation of conceptual and concrete specifications, we could employ a notation such as

Height: Person -> Distance => Feet => Binary,
Name: Person -> Alphabetic.

A vertically aligned notation might make it easier to map multiple parameters:

Move: Vertex, Distance, Direction -> Location | | | | | cm degrees rect_coord = | | | | | oid decimal integer cm cm | | decimal decimal

In that format, it might also be easier to distinguish conceptual and concrete portions of the schema, since they occur on separate lines. Different concrete specifications (implementations) can be obtained simply by replacing the "concrete" lines of the specification. (More elaborate implementation specifications would include data structures and method code. That could be addressed in another paper.)

The mapping to concrete specification is not always a simple two-stage mapping. Clearly if the information is already symbolic, as with names, then no mapping is required at all. If the information is already numeric, as with counts (e.g., inventories) or ratios, then no measurement is involved, but data type mapping is still required.

Complex magnitudes, on the other hand, might involve more than two levels of mappings. Locations in two dimensions might be measured either in rectangular or polar coordinates (first level of mapping). For either of these, the distances and angles involved themselves go through two more mappings for measurement and representation.

There is potentially yet another level of mapping, for such things as edit formats or masks, e.g.,

These might be handled as a specialization of data types, but probably better handled as a separate level of mapping. These are symbol-to-symbol maps.

With conventional documentation techniques, it's difficult to maintain consistent replicas of the conceptual specs together with each different set of concrete specs in which it is realized. (E.g., class definitions are sometimes said to incorporate type definitions.) Modern media management techniques can alleviate this, allowing a single "master copy" of the conceptual specification to be overlaid with different concrete specifications.


The units and datatype functions described above are hypothetical abstractions, fictions and figments of our imaginations. They might be executed in our minds, but never in any real computers. Real computations are mappings among symbols, not among magnitudes or abstract numbers. There is no real operation that converts a height to an abstract number, or converts an abstract number to its decimal representation.

Instead of datatype mappings between numbers and symbols, we have type conversions between symbols in the compiled composite mappings between external and internal specifications. In principle, the conversion from hex to decimal involves mapping a hex symbol to the abstract number it represents, and then to the decimal representation of that number:

symbol(in hex) -> number -> symbol(in decimal).

In practice, of course, there's only one computation, directly from one symbol to the other:

symbol(in hex) -> symbol(in decimal).

The same is true of units conversions. A conversion from feet to meters never materializes an actual abstract distance as an intermediate result. The conversion first accounts for data type differences, and then maps from the representation in feet to the representation in meters.

Real requests typically involve a chain of computations beginning and ending with symbols: input data, output data, or stored data. Intermediate results can be opaque (not specifying units or data types), but the computation is ultimately governed by a pair of representations at the beginning and the end. That determines the conversions required; the intermediate mappings to magnitudes and abstract numbers need never be visibly (concretely) executed.


If I'm six feet tall, then I'm also 1.8288036 meters tall - more or less.

Numbers map into symbols imperfectly. Communication and storage media can't handle infinitely long symbols. They work best with symbols of bounded length, limiting the precision with which numbers can be represented. Thus it might only be possible to store or communicate integers between 0 and 2**31.

Constrained precision blurs the clean line we try to draw between conceptual and concrete specifications. Sometimes that's handled by implicit conventions; while the conceptual specification may say integers are supported, everyone "knows" certain integers are too big to handle. This may be spelled out as an implementation restriction in a software manual, or it may be "common knowledge" based on the word length of the underlying computer.

A certain amount of respectability is gained by abstractly defining finite subsets of numbers, such as short and long integers, or 31-bit integers, in signed and unsigned variants. These at least have the merit of defining the populations abstractly in the conceptual specification, even though they are induced by underlying implementation constraints.

Such compromises are a fact of life we endure all the time. You can't hide implementation completely. It keeps sneaking in between the cracks, nudging us with "implementation restrictions". As hard as our calculators and computers try to implement the model of arithmetic, the best they can do is finite calculation. Truncation and roundoff keep intruding non-arithmetic behavior: 3*(1/3) comes out 0.9999999999, not equal to 1. We don't alter our concept of arithmetic because of that.


Sometimes a client interface will not specify a desired unit of measure or data type. This may be because it is implicitly assumed that the stored data satisfies the client's needs (data dependence), or because the client wishes to accept the data in whatever form it is stored to avoid conversions. In the latter case the concrete specification may include the units and/or data types as data items, making the mappings more complex. In effect, the concrete specification may take the form

Height: Person -> Symbol x Measure x Datatype,

where the Measure and Datatype information might have to map to something in the mapping between the conceptual and storage specifications. This more complex situation bears further investigation.


The essential point is to recognize magnitudes, units of measure, numbers, data types, and symbols as distinct constructs. Measures and data types don't have to be modeled as functions.

Alternative models can be developed in terms of curried equivalences, which define a kind of equivalence transformation which can exist at the schema or model (meta-schema) level. At the schema level, it accounts for one sort of schema mismatch in integrating heterogeneous databases [2]. At the model level, it can define some equivalences among different models.

The general idea is that, given a function

f: X, Y -> Z,

there exists a set of functions

fi: Y -> Z,

one for each member xi of X, such that

fi(y) = f(xi,y).

Conversely, given a set of functions having similar signatures

fi: Y -> Z,

there is a corresponding set of objects X containing one xi for each fi, and a function

f: X, Y -> Z

such that

f(xi,y) = fi(y).

We will make use of the converse transformation. Let's relabel the units of measure discussed above as InchesF, FeetF, MetersF, etc., to emphasize that they are functions:

InchesF: Distance -> Number
FeetF: Distance -> Number
MetersF: Distance -> Number.

Those are the fi, with Y being Distance. For X, we can introduce a set of Units whose members xi are InchesU, FeetU, MetersU, etc. For f, we introduce a new function Measure:

Measure: Units, Distance -> Number

such that

Measure(InchesU,d) = InchesF(d)
Measure(FeetU, d) = FeetF (d)
Measure(MetersU,d) = MetersF(d)

Conceptually, this formulation makes the process of measurement an explicitly visible activity, with the units of measure being passive participants (parameters). It is also more conducive to allowing units of measure to be stored with self-describing information. This latter might arguably be a more satisfying intuitive treatment, but there is nonetheless an equivalence mapping into the model developed earlier.

Data types can be treated similarly, using Represent as the collective function:

Datatype = {Integer, Hex, Octal, ...}

Integer: Number -> Symbol
Hex: Number -> Symbol

Represent: Datatype, Number -> Symbol

Represent(Integer,n) = Integer(n)
Represent(Hex,n) = Hex(n)

In this approach, the graph of conceptual categories is extended to include measures and data types:

                |                  |   
             Measure            Datatype

The two approaches can be unified if we don't mind letting functions also be things which themselves can occur as arguments to other functions. The members of Measure and Datatype could themselves be functions, such that InchesU is in fact InchesF. Then Measure and InchesF could both be functions, with the equivalence

Measure(InchesF,d) = InchesF(d)


Units of measure and data types are distinct constructs which map magnitudes into symbols. This approach eliminates a lot of confusion in conceptual information specifications, whether in the context of the ANSI three-schema architecture or in the context of abstract object interface specifications. The approach also facilitates a systematic treatment of domain mismatch for multi-database integration.

Magnitudes such as distance, weight, and time are distinct concepts, distinct from the numbers that measure them and from the symbols that represent those numbers. They are related to the properties of objects by all or some of the following mappings:

       property           measure        datatype

In conceptual specifications, information involving magnitudes is expressed purely in terms of those magnitudes:


e.g., Height: Person->Distance.

Concrete specifications are required to map these into symbols both externally for communication with clients and internally for recording in storage implementations. Concrete specifications use units of measure and data types to map magnitudes into symbols:

          measure        datatype

Data independence is obtained through the independent mappings of external and internal interfaces to the conceptual model. Current object models already separate object interfaces from implementation. One further distinction is required, separating the abstract object interface from client interfaces.

Domain mismatch in multi-database integration can be handled in similar steps. Correspondence between attributes should first be established at the conceptual level in terms of common magnitudes, e.g., distance or weight or time, etc. Then the units of measure and data types used to implement these can be identified, with appropriate conversions specified.


  1. William Kent, "A Rigorous Model of Object Reference, Identity, and Existence", Journal of Object-Oriented Programming 4(3) June 1991 pp. 28-38. [html]
  2. William Kent, "Solving Domain Mismatch and Schema Mismatch Problems With an Object-Oriented Database Programming Language", Proc. 17th Intl. Conf. on Very Large Data Bases, Sept. 3-6, 1991, Barcelona, Spain. [pdf]
  3. D. Tsichritzis and A. Klug (eds), "The ANSI/X3/SPARC DBMS Framework. Report of Study Group on Data Base Management Systems", AFIPS Press, Montvale NJ, 1977.