Thematic Map Accuracy Assessment
The thematic map accuracy assessment procedure was conducted in three separate phases. First, a stratified random sampling design was used to generate positional coordinates for the field locations of the accuracy assessment sampling stations. Next, these sampling stations were occupied in the field and habitat information was collected at each station. The final phase included the actual accuracy assessment analyses, comparing the field data to the corresponding photointerpreted map data at the stratified random locations. These comparisons included summary descriptions and quantitative analyses that could be subject to scientifically sound statistical testing.
Stratified Random Sampling Design
The ESRI polygon shapefiles of the second-draft thematic habitat maps and georeferenced imagery were imported into an ESRI ArcView project. A total of twelve separate maps were imported, one map from each of the three different types of remotely sensed imagery used at each of the four test areas. Four separate views were created, one for each of the island test areas. This ArcView project was used to generate the locations of the stratified random sampling stations for accuracy assessment fieldwork.
For each test area, the maps created from color aerial photography were stratified by sets of polygons representing each second-level thematic class in the habitat classification scheme. Using an ArcView random point generator extension, 100 points were generated within each map strata (habitat class). A seven meter radius buffer was created around each sampling point, representing the aerial extent of the sampling station. Sampling was without replacement, so if any two or more points had overlapping buffers, all but one of the overlapping stations was omitted from sampling. The positional coordinates for the center point of these sampling stations were converted into GPS waypoints for use in the field.
These sampling methods have been developed and validated by other researchers conducting similar accuracy assessments of comparable thematic maps (Hudson and Ramm 1987; Congalton, 1991). Rosenfield et al. (1982) have calculated the number of field stations (sample size) required for accuracy assessment using the same stratified random sampling design, and have determined that a statically valid data set (at the 90% to 95% confidence level) is obtained when at least 50 field observations per habitat type (strata) are occupied. Each strata had more than 50 random points created within their boundary because it was anticipated that some of the points could not be safely occupied when in the field due to unforeseen circumstances (shallow areas with dangerous waves, stations occupied by other boaters, stations with snorkelers or other users which may have presented a danger to either users or surveyors, etc.).
Paper maps of the georeferenced imagery overlaid with random sampling station GPS waypoints were printed and laminated for use in the field. The photointerpreted habitat type was not noted on these field maps to avoid suggestively biasing the field observer. Although sampling points were generated only within the strata of the maps made from color aerial photography, these stations would be superimposed over the maps made from the other two types of imagery after preliminary field sampling was complete. If any strata within the other maps had an insufficient number of sample stations, more random points were generated and occupied in the field until a sufficient sample size was obtained for maps made from all three types of imagery.
Accuracy Assessment Data Collection
The four test areas were surveyed consecutively beginning with Kāne‘ohe Bay, (O‘ahu), and continuing with Northern Kona (Hawai‘i), the south shore of Moloka‘i, and Southwest Maui. A minimum of ten full field days were allotted at each of the four locations. The time between sampling at different test areas was kept to a minimum to avoid changes in field conditions occurring between acquisition of remotely sensed data and subsequent fieldwork.
A unique data dictionary containing the hierarchal Hawai‘i benthic habitat classification scheme and other field attributes was used for all fieldwork. The data dictionary was created using Trimble Pathfinder Office GPS software. It contained all levels of the classification scheme. It also contained several other attributes to be collected at each station in the field, including time and date, water depth, relative topographic relief, and automatically generated GPS statistics associated with the positional data.
The stratified random accuracy assessment GPS waypoints and the unique data dictionary were uploaded into resource grade Trimble GeoExplorer 3 hand-held GPS data loggers. These waypoints were occupied in the field, using a variety of watercraft ranging from two-person kayaks for shallower areas to 36 foot commercial charter boats for deeper and more distant locations. Some locations near shore were accessed by foot where possible.
Field personnel navigated to GPS waypoints and made benthic habitat characterizations for each station that could be safely occupied. When close to the random station, a weighted buoy was deployed to mark the center of the sampling station. The aerial extent of the sample station was a circle with an estimated seven-meter radius around the buoy. The boat was positioned over the buoy, and positional GPS data were collected for the precise location. If the station was in water less than one foot deep or on emergent features, a small temporary survey marker was placed on the substrate to mark the center of the station. Approximately one hundred code phase GPS positions were collected at one-second intervals and averaged at each station.
The majority of the habitat data collected was entered directly into the GPS data logger. Depth at the buoy was obtained using a hand-held depth sounder. Three benthic habitat characterizations were recorded. A central GPS point assessment was conducted by recording the habitat class in a 1 square meter area directly under the buoy. Two area assessments were conducted within the seven meter radius around the buoy and recorded in the data logger. The first assessment identified the most common habitat type within the area and the second identified the second most common habitat type with in the area. The second habitat type had to occupy at least an estimated 10% of the overall sample area in order to be included. If the entire area was all one homogeneous habitat type, that class was entered for all three assessments (the point, the first and second habitats in area). These three identifications had to conform to the classes of the Hawaii benthic habitat classification scheme for subsequent mapping and accuracy assessment to be valid. The same observer made all the characterization calls by either breath-hold diving, viewing the substrate through a glass-bottom look box, or observing from the surface in shallow water conditions.
Handwritten notes were entered in a waterproof field surveying book, including detailed information and descriptions about each sample station that could not be entered directly into the data logger. Any remarkable observances were also noted. High resolution digital underwater photographs were taken at representative sample stations. Information about the photos was also written in the survey books.
At some sampling stations, a special minimum mapping unit circumstance occurred where the sampling station was one particular habit type, but the greater area around it was another habitat type. In these instances, the type of habitat in the sample stations was large enough to occupy the entire seven meter radius sample, but was not large enough to constitute an acre of habitat, and therefore not large enough to be mapped by the image interpreter. When this occurred, the surrounding habitat types were also noted in the field book. For instance, if the sample station fell within a patch of sand that was big enough to constitute an entire seven meter radius sample area but the sand patch was much smaller than the minimum mapping unit of one acre, the greater habitat type surrounding the sand patch was also noted.
At the end of the field day the GPS data logger was downloaded using Trimble Pathfinder Office software, and the positional data were differentially corrected from the nearest available continuously operating reference station (CORS). After differential correction, the data were exported as a flat file. A comments field was added and all the longhand notes were manually entered and the data were checked twice for errors or blunders. This modified file was archived and converted to an ESRI point shapefile for subsequent analysis in the GIS project. The underwater digital photographs were also downloaded and named according to the unique sampling station ID number.
The spatial accuracy of the digital imagery was checked against the GPS data periodically throughout the course of the accuracy assessment fieldwork. For at least five days at each test area, GPS data were collected on above-water features that were visible in the imagery. This was often done on piers, breakwaters, or right-angle corners of cement parking lots that could be identified to one pixel in the imagery. These GPS data were collected in code phase for approximately 100 continuous seconds at one-second intervals, exactly as GPS data collected at sampling stations during the course of the day. These positions were differentially corrected and overlaid on the imagery at the end of field day to see how closely the GPS points aligned with the features in the imagery. This would give an indication of how well the imagery was registering with the corresponding field GPS data.
The spatial accuracy of the positional GPS data was checked against known National Geodetic Survey (NGS) benchmarks which were occupied in or around the study areas. At each test area, at least one benchmark was occupied and GPS data were collected in code phase for approximately 100 continuous seconds at one second intervals, exactly as GPS data collected at sampling stations during the course of the day. These GPS data were differentially post-processed and their positions were compared to the known benchmark positions.
Accuracy Assessment Analyses
Error or confusion matrices were created for maps made from each of the three types of imagery at all four of the test areas. A total of twelve error matrices were created at the most detailed level of the classification scheme. An additional twelve matrices were created for the major hierarchical habitat classes by combining the detailed level classes together into their parent class. All statistical accuracy assessment analyses were based on these major habitat level error matrices. The matrices were generated using ArcView software. The point shapefiles created from the accuracy assessment sampling fieldwork were projected over the second draft thematic map polygon shapefiles and the detailed habitat attributes of the point file were compared to that of the map polygons in which the points were located. This was done using several standard selection tools available in ArcView. All map polygons of one detailed habitat type (stratum) were selected, and then a “select by theme” query was preformed on the field data to select all sampling points occurring within those polygons. Summary tables were displayed for each stratum query, and the fields of the tabular error matrix were populated with the results of these summary tables. By convention, the rows of the matrix represent the polygons created by the image interpreter, and the columns represent the reference data from field sampling stations. This process was repeated for each of the maps made from the different types of imagery at all four test areas.
The diagonal of the error matrix indicates class coincidence or agreement between the thematic map attribute and the field attribute. The off-diagonal fields in the matrix indicate where errors of omission and errors of commission have occurred. Errors of omission exist where the photointerpreter failed to include the sample station habitat type in the polygons of that habitat type (off-diagonal columns). Errors of commission exist where the photointerpreter included incorrect sample station habitats in the delineated map polygons (off-diagonal rows). All errors were individually scrutinized in the GIS and any patterns of error were noted.
Although all subsequent accuracy assessment analyses were derived from these detailed error matrices, the only measure of agreement reported at the detailed habitat classification level was the overall map accuracy. For all other reported statistics (user’s and producer’s accuracy, kappa, tau, and tests of significance) the detailed error matrices were aggregated together into new matrices representing the four major levels of the habitat classification scheme (e.g. Table 6 bottom). This aggregation minimized the influence of small sample size, which occurred within several classes when all 28 detailed levels of the classification scheme had their own category in the matrices. For example, the major class of submerged aquatic vegetation (SAV) contained two second-level classes (seagrass and macroalgae). Within each of these second-level classes were another three third-level classes (10%-<50% cover, 50%-<90% cover, and 90%-100% cover). This totaled six detailed third-level classes with the major SAV class. There were very few map polygons for the second-level seagrass due to the fact that very little of this habitat occurs in the main Hawaiian Islands. There were no occurrences of continuous (90%-100% cover) seagrass in the state. Sample size would not be sufficient for the analysis at this detailed level, so the seagrass polygons were aggregated with the macroalgae polygons to the first major-level of the classification scheme hierarchy. This is one of the advantages to using a hierarchical classification scheme.
Overall accuracy was determined for the detailed-level matrices (up to 28 classes,), and for matrices aggregated together into the four major classes of the hierarchal classification scheme. The overall accuracy was determined by dividing the total number of correctly classified field stations by the total number of field stations sampled. This was done by summing the diagonal of the matrix and dividing this sum by the total number of sampling stations in the entire matrix. This proportion was multiplied by 100 and reported as an overall percentage of correct classifications.
Confidence intervals were calculated around the overall accuracy for matrices aggregated together into the four major classes of the hierarchal classification scheme. These intervals were determined at the 95% confidence level, and give an upper and lower limit around the overall accuracy percentage in which the test statistic is likely to occur 19 out of 20 times. Conversely, the overall accuracy percentage could occur within these intervals one out of twenty by chance alone (Sokal & Rohlf, 1981). These intervals are based on the binomial distribution of the diagonal (correct or incorrect) at a large sample size (N>100) as per Hord and Bruner (1976).
A producer’s accuracy was determined for each class in the aggregated matrices representing the four major levels of the habitat classification scheme. They were calculated from the matrix columns. The number of correctly classified samples of a particular class (the diagonal cell value) was divided by the total number of field samples in that class (the column total). This proportion was multiplied by 100 and reported as a percentage. This provides the percentage of each class that has been correctly identified by the image interpreter (Story & Congalton, 1986). It is a measurement of how well the image interpreter can classify a given habitat type.
A user’s accuracy was determined for each class in the aggregated matrices representing the four major levels of the habitat classification scheme. They were calculated from the matrix rows. The number of correctly classified samples of a class (the diagonal cell value) was divided by the total number of samples included by the image interpreter in that class (the row total). This indicates for each class what percentage of the area on the map will actually be that class when visited in the field (Story & Congalton, 1986). It is a measurement of how often a map polygon is classified correctly.
The Kappa Analysis
A kappa statistic was generated for each aggregated major level error matrix.
While the true kappa value cannot be determined, a Khat statistic () which is the best estimate of the kappa, can be calculated. The Khat statistic is derived from the entire error matrix. The kappa is a discrete multivariate technique which takes into consideration the off diagonal elements of the matrix (the error of omission and commission) as a product of the marginal totals (Cohen, 1960). Since these values are discrete and multinomially distributed, (parametric) normal theory analyses techniques are not appropriate (Sokal & Rohlf, 1981).
Although normal theory may not apply directly to the kappa, its approximate large sample variance can be determined. The distribution of this variance approximates normality, and as such can be used to perform tests of significant differences between different Khat statistics. However this only allows for significance tests between two different photointerpreters or classifications using the same reference (field) data, not between two different maps using different subsets of reference data. Since overall research objectives were to quantify and compare different imagery and map accuracy based on field data subsets, the kappa was not chosen for tests of significant differences between various maps. The Khat is reported for each matrix because it is entrenched in the accuracy assessment literature as an indication of how well the classification agrees with the reference (field) data. Green et al (2000) state the Khat statistic “…expresses the proportionate reduction in error generated by a classification process compared with the error of a completely random classification” (p. 64).
The kappa analysis generates a Khat () statistic with a range from zero to one. It is generally described by Verbyla (1995) as:
= overall classification accuracy – expected classification accuracy
More specifically it is computed in terms of the rows, columns and marginals of the error matrix as follows:
Where r is the number of rows in the matrix, Xii is the number of observations in row i and column i, Xi+ are the marginal total of row i, X+i are the marginal totals of column i, and N is the total number of observations (sample stations). See Congalton and Green (1999) for complete derivation of the kappa.
The Tau Coefficient
A tau coefficient (Te) was generated for each aggregated major level error matrix. The tau coefficient is similar to the kappa statistic, however it is easier to calculate and interpret (Ma & Redmond, 1995). Like the kappa, the tau is a discrete multivariate technique taking the matrix off-diagonal into account as a product of the marginals. One difference, however, is the tau may or may not use probabilities of class membership that are determined from the matrix.
There are two types of tau coefficients commonly used for accuracy assessment of thematic maps made from remotely sensed data. In one type, the probability of class membership is known a priori, and in the other type it is not. In the second case, probability may be determined from the matrix marginals or the number of classes present. Ma and Redmond (1995) describe tau (Te) where the classes are assigned equal probabilities based on the number of classes present in the classification scheme (our thematic strata). This probability is determined by the reciprocal of the number of classes. For instance, 4 classes would have an equal probability of 0.25 (or ¼) for each class.
For this mapping project, it was not known a priori if there was a greater likelihood of one map class being confused with another map class. For example, it was not known ahead of time how often the photointerpreter assigned seagrass habitat to the sand map class. Consequently a tau based on equal probability of group membership (Te) was preferred for this research. Simply stated, the tau coefficient measures how many more sample stations are classified correctly than would be expected by chance alone (Green et al, 2000).
The tau analysis generates a statistic with a value ranging from +1 to -1. Values closer to zero indicate less agreement or association between map classes and field sample stations. Values of +1 or -1 indicate complete agreement. The general form of the tau (Te) coefficient for equal class probability is:
Te = overall classification accuracy – equal probability of class
More specifically it is computed in terms of the error matrix using the same notation as equation 1:
Where r is the number of rows in the matrix, Xii is the number of observations in row i and column i, M is the number of classes present, and N is the total number of observations (sample stations). The 1/M term gives the equal probability of class assignment based on number of classes. See Ma and Redmond (1995) for complete derivation of the tau.
Tests for significant differences between two tau coefficients
A sample variance for each Tau (Te) coefficient was determined. As in the case of the sample variance determined for the kappa statistic, the actual data in the error matrix are discrete and multinomially distributed, however the sample variances approximate normality. Because of this fact, these sample variances allow for tests for significant differences between various pairs of tau (Te) coefficients to be preformed. These significance tests were preformed between all pairwise combinations of the three tau (Te) coefficients generated at each test area. For example, the (Te) for the map generated from color aerial photography at the Kona test site was tested against the (Te) for the map generated from the hyperspectral imagery at the Kona test site. Then the (Te) from the map interpreted from IKONOS imagery at the Kona test site was tested against both the color photography (Te) and the hyperspectral (Te). This process was repeated for each test area, for a total of 12 contrasts.
The following assumptions regarding sample variance allow for tests of significant differences between two tau coefficients. The percentage agreement (overall accuracy) used in the tau follows a binomial distribution, and a large sample variance distribution approximates normality (N>100). Unlike the kappa, significance tests preformed on the tau are valid with different sample sizes per class, and different numbers of classes per matrix (Ma & Redmond, 1995; Naesset, 1996). The sample variance (σ2) for the tau (Te) with equal probabilities can be calculated as follows:
Where Po is the overall classification accuracy as in equation 1 or 2, N is the total number of observations (sample stations), and Pr is equal probability (1/M in equation 2).
With this sample variance, Z-tests can be preformed for significant differences. The test statistic general form, as per Green et al (2000) is:
Where T1 and T2 are the two different Te being compared and the denominator is the square root of the sum of the sample variances from T1 and T2, respectively. These calculated z statistics were compared to a critical table where Z is standardized and normally distributed (the standard normal deviate). The two Te were different at the 95% significance level if the absolute value of the z statistic was greater than 1.960 (>Za 0.05 = 1.96).
Last Update: 04/21/2008
By: Lea Hollingsworth
Hawai‘i Coral Reef Assessment & Monitoring Program
Hawai‘i Institute of Marine Biology
P.O. Box 1346
Kāne‘ohe, HI 96744