Classifying Data

Data to be mapped for presentation should generally be classified (into five or six groups) in order to aid ease of interpretation. When drawing features in ARCPLOT, the commands: ARCLINES, LABELMARKERS, POINTMARKERS, and POLYGONSHADES, as well as AML driven RESELECT's, allow the use of lookup tables to define a data to symbol relationship for feature display (the CLASS command also allows grouping of data). These tables allow user defined ranges for cartographic products, rather than the default of directly relating the data item identifications with the symbol set identifications. These ranges must be determined by the cartographer and range types include: quantiles, equal intervals, geometric progressions, mean and standard deviation intervals, and natural breaks (the Jenks' Optimal classification) (see Figure 2.7). Certain data sets may need to be classified on the basis of predetermined breaks (such as maximum allowable concentration of a pollutant). This can be accomplished by manually specifying all breakpoints, or by use of another classification scheme, with the externally defined break added in. For example: run the Jenks' optimal AML for five classes, note the breakpoints, then run the manual specification AML on a new lookup table and specify the Jenks' breakpoints plus the predefined point.

Figure 2.7
Figure 2.7 Breakpoints set by different classification schemes. Equal Interval is close to the statistically optimal (Jenk's), for this example. Quantile separates relatively similar values at the extremes of the distribution. Mean and Standard Deviation fail because the distribution is bimodal, not normal; Exponential fails by grouping all of the items greater than 256 into one category.

Manual Classification

For both the CLASS command and lookup tables, break points can be chosen manually, by exporting the data to a statistical package, or with the assistance of the ARCPLOT command STATISTICS. The command syntax is:

STATISTICS <cover> <feature_class>

<cover>
contains the items to be classed
<feature_class>
includes POINTS, ARCS and POLYS

STATISTICS has several subcommands:

  • SUM <item>
  • MEAN <item>
  • MINIMUM <item>
  • MAXIMUM <item>
  • STANDARDDEVIATION <item>
    <item> is a field in the attribute table of <cover>.
  • END - indicates all subcommands have been entered.

These values can be used to calculate mean and standard deviation break points, as well as quantiles and geometric progressions.

There are several methods of defining classes within ARC/INFO, as well as several methods of determining class breaks. As MacEachren (1992) notes, interval and quantile classification systems, which are automated parts of the CLASS command, generally are not the best methods for grouping data and therefore will not be discussed here. The syntax for manually specifying intervals with the ARCPLOT command CLASS is:

CLASS MANUAL <#classes> <break...break>

<#classes>
the number of classes that will be generated
<break...break>
the numeric class breaks. There must be <#classes> - 1 values.

CLASS NONE - turns off the classification scheme.

Until the CLASS NONE command is given the classification remains in effect, and will cause all subsequent uses of ARCLINES, LABELMARKERS, POINTMARKERS, and POLYGONSHADES to be classified.

Another method of classifying data is to use lookup tables. These are INFO files that perform a similar function as the CLASS MANUAL command. A manual procedure from ARC for the creation of a lookup tables is:

  1. Use PULLITEM to extract the attribute table item that holds the data values.
  2. Use ADDITEM to add a numeric field called SYMBOL.
  3. Enter INFO, SELECT the table and PURGE the old data.
  4. ADD records to the lookup table.
    1. Specify the breakpoint value for numeric data, or the alphanumeric for text data.
    2. Specify the symbol number.
  5. SORT on the data field (not the SYMBOL field).

It is possible to automate the creation and specification of lookup tables within ARCPLOT. SETMAN.aml allows the manual update of symbol values within a lookup table, as well as creation a new table. The AML handles both nominal and numeric data. The syntax is:

SETMAN <lookup_table>

<lookup_table>
will be created if it does not exist, existing tables will be selected for update.

Natural Breakpoints

Because the Jenks' Optimal classification scheme is generally considered to provide the best classification of numeric data (MacEachren 1992), it should be included in ARC/INFO (and with any mapping program). Unfortunately it is not, but ARC/INFO's does provide a macro language, commands to export data, and a method for calling system programs. These can be combined to allow the automated generation of the breakpoints for Jenks' optimal classification. The AML, JENKS.aml, exports the data to be classified, calls the C program, jenks (which must be in the executable search path), that calculates the breakpoints, and constructs a lookup table. The cartographer must still specify symbol matches though. Please note that the routine generates a lookup table that contains more information than is required for ARCPLOT's use (although ARCPLOT can still use it); this additional information is included for use by the AML's presented in later chapters. The additional information is: an initial line that is one smaller than the smallest data value (this is included only in non-nominal lookup tables, including the output from Jenk's); a column that contains the number of values in the associated coverage in the first record, and the number of values in each category in the following records; and a column that contains the coverage name in the first record, an 'n' (for nominal data), an 'o' (for ordinal data) or an `r' (for interval/ratio data), the coverage type in the third record (point, arc, etc), and for route and section lookup tables, the route name in the forth record. This additional information allows these AMLs to select the coverage it is based on and, for legends, provide category totals and appropriate symbolization.

The C program jenks.c can be compiled on a workstation with an ANSI C compiler (including Data General's DG/UX 5.4) by using the command line:

        cc -ansi -o jenks jenks.c -lm

The command syntax for using JENKS.aml is:

JENKS <cover> <feature_class> <data_item> <out_lookup> <classes>

<cover>
the coverage that the lookup table will be created for
<feature_class>
the feature type (point, line, poly, etc) of <cover>
<data_item>
an interval or ratio level data field in the attribute table of <cover>
<out_lookup>
the name of the lookup table to be generated
<classes>
the number of data classes in the output lookup table.

Eyton's Equiprobability Ellipse Bivariate Classification

The uniqueness of Eyton's ellipse as a bivariate mapping scheme requires that a column be added to the attribute table of the data to be mapped. As with the calculation of Jenks' optimal classifications this is best done with a combination of macro and C program. EYTON.aml requires two ratio data items--the classification system is based on the parametric statistic, Pearson's r. It also requires that the system program, eyton, be in the executable search path. Like jenks.c, the eyton.c program can be compiled on a workstation with an ANSI C compiler by using the command line:

        cc -ansi -o eyton eyton.c -lm

The command line syntax for using EYTON.aml is:

EYTON <cover> <feature_class> <data_item_1> <data_item_2> <out_lookup> {classes} {chi_square_value}

<cover>
the coverage that the lookup table will be created for
<feature_class>
the feature type (point, line, poly, etc.) of <cover>
<data_item_1> <data_item_2>
interval or ratio data fields in the attribute table of < cover>
<out_lookup>
the output lookup table
{classes}
the number of classes on each axis, valid codes are 2 (default) or 3
{chi_square_value}
a value for the selecting the number of points in the central ellipse,
defaults to 1.386, which is 50% of the observations.

Symbol Value Update

The lookup tables generated by these commands may require the modification of the symbol numbers. There are several ways this can be accomplished: use of '&SYS ARC INFO' to enter INFO from ARCPLOT; manually declaring a cursor and using it to update the lookup table; using the AML given above for manually changing values; or use of the AML, SETAUTO.aml. This AML updates a lookup table by replacing the SYMBOL values with an ordered set of numbers. It requires a starting value, a step value, and can optionally be given two other values. Nonlinear progressions can be specified by including a 'scale' value, and decreasing numbers can be obtained by specifying a 'subtraction_value.' These non-linear progressions should be used when a SYMBOL value is used for specifying something other than a predefined symbol, such as symbol value. In this case value should range from black to very light grey with a greater change in the black/dark-gray values (value differences are easier to discriminate for lighter values). For five classes, the use of 0 50 0.8 and, if necessary, 69 as the subtraction value, will set up an appropriate value progression.

SETAUTO <lookup_table> <start_value> <step_value> {scale} {subtraction_value}

<lookup_table>
the lookup table to be updated
<start_value>
the beginning value of the progression
<step_value>
the value used to change the beginning value
{scale}
an exponent that is applied to <step_value>--defaults to 1
{subtraction_value}
value that a symbol value will be subtracted from in order to generate decreasing progressions

Unclassed Maps

Unclassified maps (those that are continuously symbolized) can be created in ARC/INFO, but the symbolization schemes available can be limiting and the potential improvement in the ability to depict data (as suggested by Monmonier, 1976) may not justify the effort for presentation of data. For exploration and analysis however, unclassified maps are an approach that may be worthwhile, although software/hardware limitations can impinge on truly unclassified displays. Eight bit plane graphic displays can only display 256 colors at once, thereby preventing unclassed display of data sets with more than 256 values. This can be partially circumvented by using the Jenks' optimal classification system to create a lookup table with 256 classes. For data sets with less than 256 data values or for systems with `true color' (24 bit plane) displays, use the manual lookup table definition AML (SETMAN.aml) to define a nominal lookup table. A limitation of `true color' displays is that 24 bit graphics systems can only generate 256 shades of gray. So, in general, for `unclassed' maps, create a lookup table with 256 categories (or fewer if there are not 256 data values); once a lookup table is created, symbol values must be assign on the basis of the lookup table's item value. SETUNCL.aml accomplishes this. The command syntax is:

SETUNCL <lookup_table> {scale_factor} {subtraction_value}

<lookup_table>
the table with symbol values to be updated
{scale_factor}
an exponent for generating non-linear scaling, defaults to 1
{subtraction_value}
allows creation of descending scales.