DOCUMENTING HETEROGENEOUS GEOSPATIAL DATASETS

An Impediment to Implementing FGDC's Metadata Standard

Geoffrey Dutton 22 November 1994

 

The new provisional standard for geospatial metadata is a big step in the right direction, after many years of ignoring and failing to communicate important aspects of spatial data quality to actual and potential dataset users. FGDC should be congratulated for its timely and thoughtful effort in developing the standard. In many ways it is sensible and usable, and should make an immediate and needed contribution to the dissemination and application of geospatial data. We know it will be used, because Federal agencies are mandated to begin documenting all new datasets starting Janary first, but we do not know how carefully or enthusiastically they will adhere to the standard nor how useful their submissions of metadata might be to data users.

The metadata standard is per force formulated from a producer's perspective. It is, one assumes, the responsibility of data producers to document published datasets, and there is not much consumers can do other than to offer feedback on the adequacy of the organization, usability and quality of datasets they acquire. But there seems to be a tacit assumption on the part of the standard that most datasets will be created according to standardized procedures and be relatively uniform with respect to sources of data, methods of compilation and quality control. This is, after all, the mode of production used by major Federal geodata generators such as USGS, Census, FEMA, SCS, DMA and NOAA. But it may not be how most private sector, academic and government organizations digitize and assemble spatial data, and this reality may lessen the standard's viability.

Beyond Fed-land, the standard is entirely a voluntary one, to be followed as one's conscience, policy and budget dictate. It will take substantial amounts of blood, sweat and tears for us in the provinces to comply with the standard, and one cannot assume that compliance will be full or uniform. But a lot of people do want to play, and would like to parade their wares through the FGDC Clearinghouse. They are willing to get their docs lined up, and don't need too many more incentives (a few more grants-in-aid would still be welcome). The problem is that the current standard may not let them do this in a way that can easily or adequately describe all the elements of their datasets.

Geodata generated by small-scale, part-time or ad-hoc producers tends to contain information borrowed from more than one source, and may have been compiled at different scales and times with various methods for different purposes. Sometimes a given layer or theme will come from a single source but may have been registered to some other layer or handled in other ways that have differential effects across the space it covers. In many cases, however, different regions will contain geometry from different sources, even within one layer: Street networks from a local DPW may be inset into a TIGER file or a state DOT road network. Hydrography may be updated in an area where water projects have been undertaken. GPS surveys may alter the location of some control points. And map overlay operations will very nearly always yield coverages that are inherently heterogeneous with respect to data quality. The explosion of GIS use in the '90's has led to a proliferation of geospatial data, and we better get set to deal with it.

The metadata standard is intended to apply to entire datasets, and while it does not prescribe how they should be organized, it does presume that everything within them is the same. The only way that differences in data quality can be documented in the standard is via narrative fields, which are free-form and not easily parsed automatically. Trouble in River City Which starts with T, and that rhymes with B, which stands for Boole. Said differently, what we have here is a failure to communicate.

The failure we are communicating here can be overcome. Logic needs to be inserted into the metadata standard that allows the quality of subsets of data within a file to be described. Some way needs to be found to express what data differ, how they differ and where they differ when they differ.

To fully describe the diversity of a coverage's lineage, it may be useful for the logic to be recursive, as any number of generations may be represented in a geodataset. How to accomplish this is less clear; documenting heterogeneity could become so cumbersome that producers will not bother, or may make mistakes in attempting to provide metadata. The FGDC standard is simple enough to implement using word processor templates. If various levels and subsets of data need to be identified, the complexity of compliance and the chances for introducing errors would obviously increase.

As messy as it is, the heterogeneity issue must be faced and overcome if we are to serve trustworthy geodata in an internetworked environment. As time goes on more and more datasets will start to miscegenate and ramify; if we have no viable means for describing differences between generations of datasets, the differences may go unnoticed, but their consequences can be many, and not necessarily benign.

This essay has been posted as an attempt to articulate a looming problem, in the hope that others will come forward to speak to it. Your Opinions are sought in order to gauge the extent and awareness of the problem, and to begin to develop approaches to handling it. Specifically,

  1. What is the extent of heterogeneity in geodata?
  2. What types of heterogeneity can be expected?
  3. What problems do users face in coping with heterogeneity?
  4. How can metadata help to overcome those problems?
  5. What solutions does the current metadata standard enable?
  6. How can the standard be revised or extended to encourage and enable more thorough documentation of heterogeneity?

A personal note: My own opinion is that many of the difficulties in coping with positional data accuracy (which is only one aspect of metadata, but an important one) devolve from the incapacity of coordinate tuples to identify the scale or accuracy of the locations they represent. Overcoming this limitation may require instituting some fundamental changesin locational notation, such as by replacing coordinates with hierarchical location codes that can capture such metadata. This could be implemented in ways transparent to users, while allowing each point location to document its reliability, and eliminate the need to maintain external metadata to describe positional accuracy (but not necessarily lineage).

During the five years I have been advocating such an approach I have not observed any groundswell of support for it, but as time goes on and ever greater amounts of untrustworthy geodata wash over us, I believe that an increasing number of people will be looking for more direct solutions, and will independently or cooperatively arrive at similar conclusions.

top

Do you know...

Home