Meta-Data and Uncertainty

The uncertainty of information is becoming an important topic with the increased use of computers for data processing, presentation and analysis. This is being addressed with position statements such as the Environmental Protection Agency's Locational Data Policy (1991) and the National Center for Geographic Information and Analysis' Visualization of Data Quality initiative (MacEachren 1992, 47). Yet, these cannot eliminate uncertainty, which exists at the most basic levels of measurement according to Heisenberg's Uncertainty Principle (Capra 1983). Because of this, David Rejeski (1991) suggests that uncertainties should be addressed openly in order to ensure that decisions have both utility and believability. He, as well as Granger Morgan and Max Henrion, recognize that uncertainty can be valuable information. Morgan and Henrion (1990, 3) give three specific reasons for the inclusion of uncertainty in policy oriented research:

  1. A central purpose of policy research and policy analysis is to help identify important factors and the sources of disagreement in a problem, and to help anticipate the unexpected. An explicit treatment of uncertainty forces us to think more carefully about such matters, helps us identify which factors are most and least important, and helps us plan for contingencies or hedge our bets.
  2. Increasingly we must rely on experts when we make decisions. It is often hard to be sure we understand exactly what they are telling us. It is harder still to know what to do when different experts appear to be telling us different things. If we insist they tell us about the uncertainty of their judgments, we will be clearer about how much they think they know and whether they really disagree.
  3. Rarely is any problem solved once and for all. Problems have a way of resurfacing. The details may change but the basic problems keep coming back again and again. Sometimes we would like to be able to use, or adapt, policy analyses that have been done in the past to help with the problems of the moment. This is much easier to do when the uncertainties of the past work have been carefully described, because then we can have greater confidence that we are using the earlier work in an appropriate way.

Uncertainty has dictionary definitions such as, "uncertain in respect of duration, continuance, occurrence, etc.," "liability to chance," "indeterminate as to magnitude or value" (Simpson and Weiner, Oxford English Dictionary, 1989, 899)1. A more useful interpretation for use in environmental risk analysis would be that uncertainty is the information contained in the data about data (that is, meta-data). By defining uncertainty this way, meta-data can be used as another piece of information in the analysis and presentation of data, including risk-based policy making.

Uncertainty has a taxonomy that should be useful in delimiting the origins of uncertainty and the reliability of data at any given point in an analysis (see Figure 1.2). Although Morgan and Henrion (1990) discuss "The Nature and Sources of Uncertainty" as chapter 4 of Uncertainty: A Guide to Dealing with Uncertainty in Quantitative Risk and Policy Analysis in detail, this presentation is a distillation based on several sources, which include Morgan and Henrion and others sources, as noted. The taxonomy can be broken down into three groups: uncertainties of the physical world; uncertainties of the computer world; and uncertainties of the human world. Problems of the measurement of natural phenomena constitute the first category; Alan MacEachren (1992, 48) reports that the National Center for Geographic Information and Analysis calls this "data quality". The uncertainty of the physical world can be further split into measurement uncertainty and parameter uncertainty.

Figure 1.2
Figure 1.2 A taxonomy of uncertainty.

Uncertainties of the Physical World

Measurement uncertainty for geographic data includes both location and attribute uncertainties. Location uncertainties (Rejeski and Kapuscinski 1990, 10) can be considered the accuracy (closeness to a 'true' value) of the instruments and the reliability, or precision, (repeatability of a measurement) of the methods used to calculate a site's position (see Figure 1.3). For example a location of a phenomenon can be determined by use of many methods, such as a professional surveyor's analysis, use of a global positioning system, or terrain analysis estimation. Each of these methods is of varying accuracy and precision, and the meta-data that should be recorded includes the manner in which a site location was first defined, the method use to derive its location, the time it was derived, and an estimate of its accuracy.

Figure 1.3
Figure 1.3 An example of the difference between accuracy and reliability in spatial location measurements.

Attribute uncertainty can be considered the accuracy and reliability of the instruments used to take a measurement of the environment. For example, an instrument may be set up to measure atmospheric concentrations of carbon monoxide, measuring levels in parts per million and with an accuracy of plus or minus one part per million. The data in this situation is the part per million measurement and the meta-data includes the plus or minus one part per million accuracy of the instrument.

Parameter uncertainty (Rejeski and Kapuscinski 1990, 11; MacEachren 1992, 47-8) involves the problem of the aggregation and generalization of point samples to areas for spatial data, or trend lines for linear data; the temporal variability between data items and between measurement times and data usage; and the logical consistency and completeness of data. This is the question of whether the measurements that are recorded and then used for analysis are adequate measurements of what was intended to be measured; it entails the consequences of the assumption of autocorrelation. Before a model is constructed to provide an explanation or a projection of environmental phenomena, it must be recognized that there are few (if any) phenomena that can be precisely defined and measured for all possible occurrences. Because of this, interpolation and extrapolation--whether linear (as in a measurement), spatial (as in an area generalization), or temporal (as in data from one time as estimates for some other time)--must be done even though the process introduces uncertainty into the analysis.

Figure 1.4
Figure 1.4 An example of a change from a hard to a 'fuzzy' boundary.

There have been suggestions made for reducing and representing parameter uncertainty. For generalizing spatial data, Cort, Rowe and Philpot (1985; in MacEachren 1992, 44) suggest that interpolation in spherical, rather than planar, coordinates will introduce less uncertainty in the creation of area information from point samples (particularly for large areas). MacEachren and Davidson (1987; in MacEachren 1992, 46) demonstrate that increasing sampling frequency will also reduce (but not eliminate) the uncertainty of interpolation; this should also be true for linear and temporal data as well. For, representing parameter uncertainty, Rejeski and Kapuscinski (1990, 10) suggest the use of transitional buffer zones to represent "fuzzy" boundaries, rather than one line demarcating "hard edges" (see Figure 1.4).

Uncertainties of the Computer World

Problems of the use of computers for storing and manipulating data constitute the second category of uncertainty. This can be subdivided into four groups: descriptive uncertainty, computational uncertainty, propagational uncertainty and modeling uncertainty. Thoughtful use of programs and programming techniques can reduce the amount of uncertainty introduced to data by computer manipulation. Herman Knoble (1990, 2) states that correct and accurate computer programs must be an ethical responsibility when dealing with numerical algorithms, all of which, "at the bottom line...affect people."

Descriptive uncertainty deals with the representation of data in computers, including both numeric and spatial problems. Numeric uncertainty arises from the method computers use to store data. For example, the binary number system does not have an exact representation of some common decimal numbers, such as 1/100. Another type of numeric uncertainty arising from number storage is the shifting of significant digits. If a number is input into a computer that has the ability to store a longer number than is input, the computer will 'pad' the extra space with zeros. These digits will be available for computation in numeric modeling even though they add no information and are meaningless. Significant digit shift can occur in the other direction as well. If a number is input into a computer that stores fewer digits than is input, the computer will round or truncate the number to fit its numeric scheme. These problems can be controlled, but not eliminated, by specifically programming the computer for the required operations (Knoble 1990), but for generally available software this is not possible.

Spatially, data is generally stored as either a table of vectors that define sharp boundaries between regions, or in regular tessellations (square, 'raster' grids) that force a predefined spatial pattern on an area (Rejeski and Kapuscinski 1990, 10). Both of these methods introduce uncertainty: the vector representation forces boundary lines where transition zones may be; and tessellations assume the entire grid cell is homogeneous. These problems can be reduced, but again not eliminated, by using smaller polygons or grid cells, but this forces a trade off between data file size and computational time, and decreased uncertainty, which may not be pragmatically feasible.

Computational uncertainty deals with the problems of numeric modelling by using computers (Knoble 1990; Rejeski and Kapuscinski 1990, 13). Once data are stored in a digital format, any further processing can introduce uncertainty. Type shifting (such as from an integer format to a real number format, or from reals to integers) can introduce uncertainty by forcing rounding to occur. More commonly, rounding occurs within real numbers when numbers that are not similar in value are arithmetically joined. Knoble (1990, 4-5) gives an example of the results of this type of rounding: an IBM 3090 Model 600, VS FORTRAN 2.4 program that for the formula:

P=((A+X)**2-A**2-2.*A*X)/X**2

generates an answer of -4999.99609 when A is equal to 1000.0 and X is equal to 0.01, despite the fact that the formula simplifies to P equalling one for all X's not equal to zero.

Significant digit shift can also occur in arithmetic operations, particularly in operations involving subtraction of similar values. Subtracting 1.23456787 * 107 from 1.23456789 * 107 yields 0.2, but a computer can store this value as 2.00000000 * 10-1, and as with data input, will allow computation on all of the digits to the right of the 2, which is the only meaningful digit in the number.

Two additional, similar problems that may occur are overflow and underflow. These result when a number is incremented, or decremented beyond the storage type's ability to represent numbers. Depending on the operating system this may cause an error or may be ignored with the value remaining the same or drastically changing. Borland's Turbo C++ 2.0, running under MS-DOS 5.0 on an IBM AT, compiles programs that allow adding one to the thirty-two bit, 'unsigned long' integer 4294967295 (which is equal to 232 - 1, and is the largest integer that can be represented in thirty-two bits), changing the value of the variable to 0. Turbo C++ compiled programs will give an overflow error when thirty-two bit real numbers (type 'float') are incremented outside of type float's range. When these computational uncertainty errors occur in a program, and are not handled well, the program could continue and generate an apparently correct answer purely by chance (Knoble 1990, 2).

Propagational uncertainty deals with the problem of how uncertainty from a physical world measurement or another computer-related uncertainty moves through successive iterations of a model. It can be tested by varying the input to a numeric model in small steps to see if small changes can make large differences in the output of a computation, which in principle should not occur. For example, by making the value of A equal to 100.0 and X equal to 0.01, Knoble's (1990, 5) program generates a value for P equal to -39.0624847; by changing X to 0.0078125, the program generates a value of P equal to 0.0000000. Propagational uncertainty can also be tested if the computer program can be rewritten to an algebraically equivalent, but computationally different manner, which would allow comparison between programs that should generate the same output. By simplifying the formula in Knoble's example to a command line such as:

if X not equal to 0 then P = 1 else print "DIVISION BY ZERO"

the problems of floating point arithmetic can be avoided.

Figure 1.5
Figure 1.5 Polygon overlay can cause shifting in boundary lines, or can create boundary slivers.

Propagational uncertainty not only involves the accuracy of real number computations for numeric data, but also the generation of polygon overlay slivers within vector spatial data, such as that used by ARC/INFO (see Figure 1.5). The slivers may be handled by "fuzzy tolerances" (ARC Command References, Commands J-Z, 1991, UNION 1), which would allow the shifting of close lines so that they merged in the output. This could cause the shifting of data from one layer of a known high accuracy (such as a surveyor's cadastral data) to correspond with a layer of lower or unknown accuracy (such as information digitized from a medium or small scale paper map). All future use of the merged data layer would include the uncertainty created by the overlaying of two data layers of varying accuracy, and the meta-data that should be attached to the new data layer would have to reflect an estimate of how much shifting occurred. This type of shifting can be eliminated by avoiding fuzzy overlays, although this can cause the generation of sliver regions along boundaries that will cause increased storage and processing time for use of the new layer, and may not hold any useful information other than the indication of the difference between the two layers used to create the new layer.

Modeling uncertainty culminates the types of uncertainty associated with computers. Although there are several types of modelling (verbal, graphic, physical, and mathematical), each of which is subject to questions of robustness and validity versus the goal of the model (such as description or prediction), it is only mathematical models that tend to be dependent on computers for execution and are thus subject to the other computer uncertainties. The robustness of a model is reflected in a model's ability to handle all appropriate input and produce a reasonable output. It is thus tied to propagational uncertainty, but robustness also entails that not only do small changes in input not cause inappropriate changes in output, but that the model's output, in practice, is also within the expected range of the mathematical model, in principle. The validity of a model is the question of whether a model, in practice, actually represents the phenomena being modeled, in principle. This questions the methods used to operationalize a numerical model, such as the validity of using `if...then' statements to ensure the apparent robustness of a procedure that would otherwise produce non-robust output, when no such statements are apparent in the 'real world' phenomena.

Uncertainties of the Human World

Problems of the human communication of information and meanings constitutes the third area of uncertainty (Rejeski and Kapuscinski 1990, 12). In attempting to communicate information meta-data can be lost in two ways: the sender of information can give data without the meta-data information, or the receiver of information does not understand that part of the message. The first way, not giving meta-data with data, can be the result of restrictions of space within publication materials, the desire for the appearance of greater accuracy (Star 1985), and until recently a lack of awareness of the potential importance of uncertainty information, particularly in the areas of human and environmental risk analysis (Rejeski and Kapuscinski 1990). The second way that meta-data can be lost is when the receiver of information does not receive that part of a message. This can result from the ignoring of meta-data or the inability to interpret the meta-data through lack of experience in dealing with the way it is conveyed. This type of failed effective communication can be the result of the variable interpretations of both words and images. The meaning of words constitutes one of the major problems the U. S. Environmental Protection Agency has had to deal with in risk analysis (Rejeski and Kapuscinski 1990, 12). Effective communication of meta-data has been studied in a strictly human realm (body language, etc.), but little has been done in the area of communication of uncertainty within the realm of scientific communication, particularly with spatial data (Rejeski and Kapuscinski 1990, MacEachren 1992).

This taxonomy should prove useful, particularly for reducing human communication uncertainties. By recognizing that uncertainty exists in measurement and is propagated in computer manipulation of data, these uncertainties can be dealt with openly and honestly, as Rejeski (1991) suggests. This, aided by visual communication techniques, should reduce human uncertainty by increasing the amount of information (by communicating meta-data) given in the presentation of data.


1 Uncertainty begins as the vagueness of duration, etc with Wyclif in 1382. By 1982, Oxford (1989, 900) reports: "What the uncertainty principle asserts is that for no state of any system can all dynamical variables be arbitrarily well-determined."