The Principles of Data

For geographic phenomena, there are two types of data that comprise information about a phenomena: attribute data (the measured characteristic of a location), and location data (the measured location of a characteristic). This pairing of information types is reflected in the two major approaches of geographic information systems: vector systems, which emphasize the attribute; and raster systems, which emphasize the location. ARC/INFO now combines both of these, allowing a wide variety of analysis and mapping, but the appropriate use of this information is still dependent on the analyst/cartographer.

Attribute data can be grouped into several categories--empirical levels--which influence the design requirements for representing information. Each empirical level contains all of the information of lower levels (and thus can be simplified) while adding additional information (see Figure 1.1 for a comparison of attribute data levels with spatial characteristics and 'visual variables' for area data). The lowest empirical level is nominal data; this data only indicates that something is different than something else. Names of states are an example of nominal categorization. The next level of empirical data is ordinal data; this data indicates that something is more (or less) than something else, but no quantity can be given to that distance. The terms high, medium and low reflect ordinal differences. The third level of empirical data is interval data; a linear measurement of distance can be used to gauge the differences between instances. The highest level of empirical data is ratio data; this data indicates that something is more (or less) than something else (and thus different), that a linear distance can be used to gauge differences and that ratio comparisons (such as: this is twice that) can also be done. The essential difference between ratio and interval data is that for ratio data, a non-arbitrary zero point is intrinsic in the measurement; this difference is small enough that, for visualization, the methods used for representing both types of data are the same. An example of the difference between interval and ratio measurement is the difference between temperature measured in degrees Celsius and degrees Kelvin; the Celsius system has an arbitrary zero (the freezing point of water) but the Kelvin system's zero is not arbitrary (absolute zero: point at which the particle motion that is 'temperature,' stops).

Figure 1.1 Figure 1.1 Spatial data models for area data and their cartographic representations; developed from MacEachren and DiBiase (1991) and DiBiase, Krygier, Reeves, MacEachren and Brenner (1991). In a thematic map, the information to be displayed can be categorized by its empirical level (nominal to ratio), how the data changes within any region it is aggregated into, and how the data changes between adjacent regions. Once this categorization is done, visual variables can be selected on the basis of the ability to represent the information so categorized.

Because of these differences in attribute data, care must be taken when comparisons between different levels of data area made (see Table 1.1 for the types of comparisons that can be performed on data of given levels). For example, ordinal data may be stored as integers that represent the order, but the values of these integers do not indicate any measurement of the variability of the data. This lack of measurement does not, however, preclude a software system from acting on the data as if it were ratio data and thereby calculating essentially meaningless statistics (or spatial patterns). If comparisons of data of different levels must be done, either the higher level of data must be reduced to the lower, or a transformation (an addition of information) must be done to raise the lower level to the higher. For nominal or ordinal data, the procedures of psychometrics can be used to recast ordinal information into interval or ratio data. Essentially this involves assigning utility values (such as money) to data levels that do not normally have this type of information associated with it (such as aesthetic values). This can be accomplished by conducting a survey to get individual assignments of value, and then using the collected data to assign overall values to the data.

Nominal Ordinal Interval Ratio
Rank Comparison Invalid Valid Valid Valid
Addition, Subtraction Invalid Invalid Valid Valid
Multiplication, Division Invalid Invalid Invalid Valid
Statistics:
Parametric Invalid Invalid Valid Valid
Nonparametric Valid Valid Valid Valid
Table 1.1 Valid data comparisons. Because of the varying degree of numerical precision associated with different data levels (nominal through ratio) only certain operations can be applied to comparisons between two data sets; comparisons between data of two different levels must occur at the lower of the two classes (ESRI Grid Class Notes 1992, 11-22).

Although all information is classified due to the nature of measurement uncertainty in the recording of data (MacEachren 1992, 48), attribute data can be further classed into groups for both presentation and analysis (see See Classifying Data for a discussion of data classification in ARC/INFO). Classification involves the creation of range categories that individual instances fit into and thus take on that value range. This results in a loss of information and can be a reduction in the empirical level of a measurement (interval/ratio data is reduced to ordinal, for example). This loss of precision is offset by the simplification of the presentation of information. These classifications are accomplished by representing ranges of data as categories on the display medium, through the use of symbolization that is appropriate to the level of measurement (see Figure 1.1).

Spatial data can be grouped into four categories. The first type of data represents point specific information. The second type of data represents linear information. The third type of data represents area information. The fourth type of data represents volume information. Each of these categories is scale specific; an area feature such as stream may require mapping as a linear feature if the total area under consideration is small enough that displaying the stream as having both length and breadth becomes too tedious or difficult to represent in the available media, for the additional information retained. Volumetric information is also dependant on the means of display--the appearance of three dimensions can only be approximated on a two dimensional surface. Although the visual variables that best represent different levels of information do not change for each of the types of spatial data, the ARC/INFO methods for accomplishing those representations change.

For environmental data, Mark Monmonier and Branden Johnson (1990, 5-7) have characterized spatial data into: single location; single location and affected area; and multiple locations and the pattern of distribution. Single location answers not only 'what' but 'where'; this can be applied to all of the basic types of spatial data and allows the map user to relate environmental information to his or her own experience (this is what Edward Tufte (1990) calls micro/macro readings). Building on single location, single location and affected area adds information concerning how a 'what' influences its surroundings (because 'influence' may be more subject to interpretation than a measurement, presentation of meta-data can become very important in the presentation of data). Finally, multiple locations and the pattern of distribution integrates more than one instance of single location and affected area into one map.

Area phenomena can also be categorized on the basis of its spatial grouping and how it changes over space (MacEachren and DiBiase 1991) (see Figure 1.1). Grouping ranges from continuous (no grouping) to discrete (complete grouping). This reflects the degree of spatial autocorrelation within areas. Changes over space can be smooth to abrupt. This reflects the degree of spatial autocorrelation between areas. These changes suggest that appropriate symbolization choices be made that accurately reflect the nature of the data. The possible symbolization choices include, but are not limited to: graduated symbols for abrupt, discrete data; dot density for smooth, discrete data; isopleth (or the '3-D' equivalent fishnet) for smooth, continuous data; and choropleth for abrupt, continuous data; this is in contradiction with the all-to-common practice of making choropleth maps for all types of area data.