GEO 465/565
Data Quality

Data Quality
Introduction
Introduction (cont.)
- All spatial data are inaccurate to some degree, BUT data are generally represented in the computer to high precision
Accuracy
- the closeness of results, computations, or estimates to TRUE values (or values accepted to be true)since spatial data are usually a generalization of the real world, it is often difficult to identify a TRUE value
Accuracy (cont.)
- accuracy of the database may have little relationship to the accuracy of the products computed from the database
Precision
Accuracy vs. Precision Graphic
Important Questions
- since all spatial data are of limited accuracy, the important questions are:how to measure accuracy
  how to track the way errors are propagated through GIS operation
  show to ensure that users dont ascribe greater accuracy to data than they deserve
Components of Data Quality
- Alternative definitions. . . .conformance to expectations: fulfilling arbitrary thresholds
  following established procedures, as with geodetic standards
  fitness for use: truth in labelling (distinct roles of producer and consumer)
Components (cont.)
- 1982-1988: Members of the American Congress on Surveying and Mapping met to produce a draft standard
  Later became the Spatial Data Transfer Standard
Components (cont.)
- standard model to be used for describing digital data accuracy
  similar standards being adopted in other countries
Components (cont.)
- National Committee for Digital Cartographic Data Standards (NCDCDS) identifies several components of data quality:
  positional accuracy
  attribute accuracy
  logical consistency
  completeness
  lineage
Positional Accuracy
- defined as the closeness of locational information (usually coordinates) to the true position
  maps are accurate to roughly one line width or 0.5 mm
  equivalent to 12 m on 1:24,000 or 125 m on 1:250,000 maps
Positional Accuracy (cont.)
- within a database a typical UTM coordinate pair might be:
  Easting 579124.349 m, Northing 5194732.247 m
Testing Positional Accuracy
Testing Accuracy (cont.)
- Compute accuracy from knowledge of the errors introduced by different sources
  e.g., 1 mm in source document
  0.5 mm in map registration for digitizing
  0.2 mm in digitizing
Attribute Accuracy
Attribute Accuracy (cont.)
Logical Consistency
Completeness
- concerns the degree to which the data exhaust the universe of possible items
  e.g., are all possible object included within the database?
  affected by rules of selection
  - by generalization
    - by scale
Lineage
ERROR in Database Creation
- error is introduced at almost every step of database creation
  What are these steps, and what kinds of error are introduced??
  POSITIONAL MEASUREMENT ERROR
- GPS is a powerful way of augmenting this network
GEOID
- Geodetic control points correlate lat/long, height, scale and orientation throughout the U.S. - based on the geoid
  Geoid is the shape that would approximate by an undisturbed mean sea level - surface where gravity everywhere is perpendicular
Geoid Map
- Geoid heights range from lows in the Atlantic to highs in the Rockies
Positional Measurement Error
- most positional data on land are derived from air photoshere accuracy depends on the establishment of good control pointsdata from remote sensing is more difficult to position accurately because of the size & quantity of each pixel
Digitizing Errors
Digitizing Errors
- 2 types of errors normally occur in stream-mode digitizing
  physiological - hitting the button twice or involuntary muscle spasms that tend to produce spikes, switchbacks, or loops
Digitizing Errors
Attribute Errors
Compilation Errors
Processing Errors
- mathematical errors
  accuracy lost due to low precision computations
  rasterization of vector data
  misuse of logic
  generalization/smoothing and related problems of interpretation
Data Quality Reports
- Reports issued under U.S. National Standards. . . .
Questions?
The End!

* How well do spatial objects represent the real world?
* How well do GIS functional algorithms compute the true values of products?
* Accuracy of data is among the most important of technical issues in GIS
* Data quality, error, uncertainty, scale, resolution, and precision all affect ways in which data can be used & interpreted

Introduction (cont.)

All spatial data are inaccurate to some degree, BUT data are generally represented in the computer to high precision

Accuracy

the closeness of results, computations, or estimates to TRUE values (or values accepted to be true)
since spatial data are usually a generalization of the real world, it is often difficult to identify a TRUE value

e.g., in measuring the accuracy of a contour, we compare the contour as drawn on the source map, since the contour does not exist as a real line on the surface of the earth

Accuracy (cont.)

accuracy of the database may have little relationship to the accuracy of the products computed from the database e.g., accuracy of a slope, aspect, or watershed computed from a DEM is not easily related to the accuracy of the elevations in the DEM itself

Precision

* not necessarily the closeness of results, but the number of decimal places or significant digits in a measurement
* Precision is not the same as accuracy!
* Repeatability vs. "truth"
* A large number of significant digits doesn't necessarily indicate that the measurement is accurate
* A GIS works at high precision, usually much higher than the accuracy of the data themselves

Accuracy vs. Precision Graphic

Important Questions

since all spatial data are of limited accuracy, the important questions are:
how to measure accuracy
how to track the way errors are propagated through GIS operations
how to ensure that users dont ascribe greater accuracy to data than they deserve

Components of Data Quality

Alternative definitions. . . .conformance to expectations: fulfilling arbitrary thresholds
following established procedures, as with geodetic standards
fitness for use: truth in labelling (distinct roles of producer and consumer)

Components (cont.)

National Committee for Digital Cartographic Data Standards (NCDCDS) was recently developed by a coordinated national effort in the U.S.

1982-1988: Members of the American Congress on Surveying and Mapping met to produce a draft standard
Later became the Spatial Data Transfer Standard

If the database was digitized from a 1:24,000 map sheet, the last four digits in each coordinate (units, tenths, hundredths, thousandths) would be questionable

Testing Positional Accuracy

* Use an independent source of higher accuracy:
* find a larger scale map
* use GPS
* use raw survey data
* Use internal evidence:
* digitized polygons that are unclosed, lines that overshoot or undershoot nodes, etc. are indications of inaccuracy
* sizes of gaps, overshoots, etc. may be a measure of positional accuracy

Testing Accuracy (cont.)

Compute accuracy from knowledge of the errors introduced by different sources
e.g., 1 mm in source document
0.5 mm in map registration for digitizing
0.2 mm in digitizing

if sources combine independently, we can get an estimate of overall accuracy by summing the squares of each component and taking the square root of the sum

Attribute Accuracy

* defined as the closeness of attribute values to their true value
* while location may not change with time, attributes often do
* attribute accuracy must be analyzed in different ways depending on the nature of the data
* for continuous attributes (surfaces) such as DEMs or TINs:
* accuracy is expressed as a measurement of error: e.g., elevation to +/- 1 m

Attribute Accuracy (cont.)

* for categorical attributes such as classified polygons:
* are the categories appropriate, sufficiently detailed and defined?
* gross errors, such as a polygon classified at A (shopping center) when it should have been B (golf course), are possible
* more likely the polygon will be heterogeneous, i.e. vegetation zones where the area may be 70% A and 30% B

Logical Consistency

* refers to the internal consistency of the data structure, particularly as it applies to topological consistency
* is the database consistent with its definitions? . . .
* if there are polygons, do they close?
* is there exactly one label within each polygon?
* are there nodes wherever arcs cross, or do arcs sometimes cross without forming nodes?

Completeness

concerns the degree to which the data exhaust the universe of possible items
e.g., are all possible object included within the database?
affected by rules of selection by generalization by scale

Lineage

* a record of the data sources and of the operations which created the database. . .
* how was it digitized, from what documents?
* when were the data collected?
* what agency collected the data?
* what steps were used to process the data?
* what was the precision of the computational results?
* Lineage is often useful as an indicator of accuracy

Additional stuff not covered in video lecture:

ERROR in Database Creation

error is introduced at almost every step of database creation
What are these steps, and what kinds of error are introduced??
POSITIONAL MEASUREMENT ERROR

the most accurate basis of absolute positional data is the geodetic control network = a series of points whose positions are known with high precision

GPS is a powerful way of augmenting this network

GEOID

Geodetic control points correlate lat/long, height, scale and orientation throughout the U.S. - based on the geoid

Geoid is the shape that would approximate by an undisturbed mean sea level - surface where gravity everywhere is perpendicular

must correct for the deflections of the vertical so that measurements of distance on the earth's surface will be consistent with those determined by astronomic observations

Geoid Map

Geoid heights range from lows in the Atlantic to highs in the Rockies

Positional Measurement Error

most positional data on land are derived from air photos
here accuracy depends on the establishment of good control points
data from remote sensing is more difficult to position accurately because of the size & quantity of each pixel

some positional data come from text descriptions, e.g., old surveys tied in to marks on trees or boundary following a watershed or midline of a river

Digitizing Errors

* digitizers encode manuscript lines as sets of x-y coordinate pairs
* resolution of coordinate data are dependent on mode of digitizing:
* point-mode - selecting & encoding only those points deemed critical to truly representing a line
* stream-mode - digitizing device automatically selects points on a distance or time parameter
* high density of coordinate pairs is selected

Digitizing Errors

2 types of errors normally occur in stream-mode digitizing

physiological - hitting the button twice or involuntary muscle spasms that tend to produce spikes, switchbacks, or loops

Digitizing Errors Cont.

psychomotor- digitizing operator can't see the line or can't properly move the cross-hairs along the line

* may also involve misinterpretation or too much generalization
* not easy to remove these automatically
* in spite of physiological and psychomotor errors, digitizing itself is not a major source of positional error
* also errors in registration & control points, as well as shrinkage or shredding of paper

Attribute Errors

* attributes usually obtained through a combination of field collection and interpretation
* categories may be subjective (e.g., "diversity," or "old growth" used in forest mgmt.)
* attributes such as these may not be easy to check in the field
* for social data, a major source of inaccuracy is undercounting, e.g., missing certain social groups in a Census

Compilation Errors

* common practices in map compilation introduce further inaccuracies
* generalization - practice of reducing the # of points on a line but still keeping the line's appearance
* line smoothing - reducing the # of points on a line & also changing the line's appearance
* separation of features - e.g., railroad moved on a map so as not to overlap adjacent road
* usefulness & meaning of data as well

GEO 465/565Data Quality

Additional stuff not covered in video lecture:

GEO 465/565
Data Quality