Scientific Information Model
This paper appears in Marine Geodesy, 20(4): 367-379, 1997.

Dawn J. Wright**, Christopher G. Fox*, and Andra M. Bobbitt+

**Dept. of Geosciences, Oregon State University
*NOAA-PMEL, OSU Hatfield Marine Science Center
+CIMRS-PMEL, OSU Hatfield Marine Science Center

A Scientific Information Model for Deepsea Mapping and Sampling


Copyright reserved by Dawn Wright. May be freely distributed electronically in whole or in part, but please keep this notice attached and do not alter the text.

Abstract

The Vents Program of NOAA's Pacific Marine Environmental Laboratory is an interdisciplinary research initiative that brings together scientists from a wide range of disciplines, including geophysics, geology, physical oceanography, chemistry, and biology. Each discipline collects a variety of data types of varying structures and requiring intercomparison. The challenge of scientific information management is thus approached with a view of supporting data from multiple survey, mapping, and sampling tools and subject to multiple levels of interpretation. The ultimate objective is a system that integrates the functions of data storage, selective retrieval, display and archiving. The results of our ongoing efforts in scientific information modeling and management have produced a relational database in which marine geological, geophysical, chemical, and biological observations can be accessed by any investigator.

Key Words: Information management and modeling, mapping and exploration, GIS, seafloor-spreading, interdisciplinary science

Introduction

The Pacific Marine Environmental Laboratory (PMEL) of the National Oceanic and Atmospheric Administration (NOAA) conducts interdisciplinary scientific investigations in physical oceanography, marine meteorology, marine geology and geophysics, geochemistry, and related subjects. The Vents Program at PMEL focuses on determining the oceanic impacts and consequences of submarine hydrothermal venting (Hammond et al. 1991; Fox, 1995). Most of its efforts are directed toward determining patterns and pathways for the regional transport of hydrothermal emissions, and their relationship to the geology and tectonics of northeast Pacific Ocean seafloor-spreading centers. The understanding obtained from this relatively isolated system will eventually be extended to a prediction of the impact of seafloor hydrothermal systems on the global ocean. The attainment of the overall program goal therefore requires a long-term interdisciplinary approach (Hammond et al. 1991; Fox, 1995).

The increasing amount and inherent complexity of oceanographic data collected in multiagency, multidisciplinary national and international research programs will require the implementation of more powerful data management and analytical techniques (e.g., Bobbitt et al., 1993). The cost of acquiring the data alone justifies the development of dedicated systems for integration of these data. Not only do a wide variety of data sources need to be dealt with, but a myriad of data structures as well (e.g., tables of chemical concentration versus raster images versus gridded bathymetry versus four-dimensional data, etc.). The synergy of different types of data provides the scientific community with more information and insight than that obtained by considering each type of data separately. Such an approach requires that the vast amount of information that will accrue be intelligently catalogued, as well as spatially and temporally coregistered. Scales of information range from hundreds of kilometers to millimeters, and decades to milliseconds. It is important that data are in such a format that the realtime information is available to the broad community within one fieldwork cycle (~1 year). In this fashion all related and germane experiments performed in one year will be able to draw on previously obtained information and insights. Data quality criteria must be established and Internet connections made available to all. Here a scientific information management infrastructure, which provides an essential technology for data dissemination, sharing, cataloguing, archiving, display, and mapping, has an obvious relevance. The Vents Program has therefore established scientific information management as an organizational priority.

According to Gritton and Baxter (1992), any comprehensive scientific information management infrastructure should have at least three components: (1) provision of uniform access to data from multiple sources; (2) testing and evaluation of both commercial and "homegrown" technologies; and (3) assimilation of new data acquisition instrumentation, computers and database management software into the existing system. The Vents Program has adopted a technical approach that emphasizes these components. A geographic information system (GIS) is the foundation of the computing environment. It integrates data from multiple conventional sensor and sample analysis systems, as well as interpretive information from video and sidescan sonar data analysis. One crucial component in the effective implementation of the entire infrastructure is the formulation of a scientific information model (SIM).

Rationale for a SIM

Specific scientific goals of work at PMEL include: (1) the coordinated collection of geological, chemical, biological, and physical data sets; and (2) high level analyses of the processes and process interactions represented in the data sets. A SIM is the formal description of the scientific information requirements for this work. Normally it is the first and most critical activity of any database system implementation (Environmental Systems Research Institute, 1994; Gritton and Baxter, 1992). Specifically, it represents the objects, places, concepts, or events, as well as their interrelationships, which are pertinent to the real-world scenario of data flow and data analysis. It includes important concepts of scientific information management such as flexible data grouping, data traceability, data quality assessment. Such models provide a basis for communicating information requirements, performing database design, and enhancing user interaction with the information management system.

The formulation of a SIM is desirable for the following reasons:

SIM Formulation and Implementation

One approach to formulating a SIM puts emphasis on aligning database structures with sensor acquisition systems. This is fine if one wants to manage only the sensor data itself, and not the information derived from it. Vents Program researchers deal with so many different acquisition systems that this is not feasible (e.g., Sea Beam, Hydrosweep, Sea Beam 2000, or SeaMARC II bathymetry; Alvin, Turtle, or Shinkai 6500 submersible observations and samples; SeaMARC I, IA, II, or DeepTow sidescan sonar imagery). Also, sensor acquisition systems are constantly in flux. What happens when the technology changes?

For the Vents Program a better approach is to emphasize the management of the scientific information that comes out of the acquisition systems. This better reflects the information requirements of the scientists and thus helps them to gain a better understanding of the scientific problems to be addressed. This is especially important as the interdisciplinary research of the Vents Program often requires each contributor to work with information in which they may have limited expertise, but must still interpret the results. Such an approach also allows the model to remain stable despite changes in data acquisition systems or perhaps even data management systems. The result is the ability to evolve to more advanced technical solutions. The SIM must therefore identify the inherent structure of oceanographic facts and hypotheses (e.g., hypotheses about processes associated with seafloor-spreading), rather than the structure of the sensor data record.

In order to address its scientific goals and objectives, the Vents Program performs certain organizational functions to accomodate its interdisciplinary goals. These functions were the starting point for the design of the SIM. Modeling proceeded with the identification of the functions and a general description of the activities within each function (Figure 1). Once the functions were compiled, the next logical step was to identify the data supporting each function. Table 1 is a data/function cross- reference matrix, where function is essentially a particular subdiscipline of oceanography. The data/function matrix shows a high-level classification of data for the Vents Program, the interdependence of data and function, those functions or disciplines which create commonly required data, the interdependence of certain functions (such as Geology and Chemistry in Table 1). It is particularly useful for pinpointing information generalities (i.e., data sets useful to more than one or all sub-disciplines), which should in term be reflected in the SIM.

Description of the Vents Program SIM

The SIM (e.g., Figure 2a) is expressed mainly in terms of entities, attributes, and relationships. An entity (shown as a labeled box in Figure 2a) is an information template corresponding to objects, places, concepts, or events about which one wishes to store information (Fleming and Vonltalle, 1989; Gritton et al., 1989). Attributes (plain, unitalicized text in Figure 2b) define the information to be recorded about entities, and relationships (labeled lines in Figure 2) express an association between entities information (Fleming and Vonltalle, 1989; Gritton et al., 1989). Figure 2 illustrates a very high level view of the SIM, and thus for simplicity attributes are not included for every single entity and relationship cardinality includes only two classes: one to one or one to many.

The SIM illustrates a three-part pipeline from the marine environment being measured to an abstract description of the scientist's understanding of that environment. Part 1 (Figure 2a) focuses on data acquisition, the first piece of the pipeline, and is thus concerned with the accurate sensing and collection of measurements from the marine environment and the transformation of these measurements from raw into processed for scientific consumption. Part 1 of the SIM therefore illustrates a situation where the entities are largely independent of the data model of the GIS or database management system (DBMS). Here data collection is made in two ways: (1) short-term (2 weeks to 1 month) mobile entities, such as research cruises employing various instrument platforms, ROV deployments, or submersible dives; or (2) long-term (1 year or more) fixed entities, such as hydrophones and various other types of moored instrumentation. For the mobile entities, data collection runs gather in situ samples, casts, and tows of various types for future laboratory analysis. These runs may be made at predefined sites, or along predefined transects or tracklines. For the fixed entities, raw data are sent back to shore from the hydrophones or moorings for encryption and/or processing. Ocean observations from the mobile and fixed entities are made at points in space and time and may have multiple assessments of the geophysical, geological, chemical, physical, and biological properties that exist in the marine environment. Each component of an ocean observation group may have one to many assessments of data quality or error. In turn, one to many observation sets or types may arise from a particular ocean observation group. These observation sets or types may be derived by a theoretical method or derived from a sensor, created from a laboratory run, or derived by observing a process (Figure 2a).

Part 2 of the SIM (Figure 2b) focuses on the next portion of the pipeline, database management. This portion must provide integrated management of diverse scientific data which will be accessed, often simultaneously, in numerous ways by a variety of users. Further, each end- user scientist must be able to access the database in a way pertinent to their particular scientific function or discipline. In this part of the SIM the entities are thus very much constrained by the data model of the chosen GIS or DBMS. The Vents Program uses the Arc/Info GIS. The entities in Part 2 are essentially coverages, the basic unit of vector data storage in Arc/Info. A coverage stores geographic features as primary features, such as arcs, nodes, polygons, and label points, and secondary features, such as map extents, links, and annotation (Environmental Systems Research Institute, 1994). Associated attributes of the primary geographic features are stored in feature attribute tables. In Figure 2b entities are once again shown as labeled boxes but each entity representing an Arc/Info coverage is shaded according to its topology, which is classed as point, line, polygon, line & point, or line & polygon. Topology is the spatial relationship between connecting or adjacent features in a coverage (Environmental Systems Research Institute, 1994). In Figure 2b the associated attributes for each Arc/Info coverage are listed below it. The boxed M next to each coverage stands for the metadata and lineage information associated with that coverage. Metadata or data about data would include such ancillary information as sensor calibration, data quality assessment, processing algorithm used, etc., while lineage would be the time stamp for each of these.

Part 3 of the SIM (Figure 2c) focuses on the final piece of the pipeline, data presentation and analysis, in which scientists derive meaning and knowledge from the data. In this portion of the SIM, the entities are largely dependent on the scientist. Here it is more of a flow chart than an entity-attribute-relationship diagram as it illustrates the flow of activity by Vents Program users in the course of doing empirical and positivist science. For example, in Figure 2c scientists may retrieve and display an Arc/Info coverage either directly from Arc/Info, ArcView (the graphical user interface to Arc/Info), or by way of a World Wide Web page that provides an interactive link to ArcView and the Arc/Info coverages illustrated in Figure 2b (the Uniform Resource Locator for this page is http://www.pmel.noaa.gov/ vents/coax/coax.html; Wright et al., in prep.). Concurrently, data manipulation or analysis may occur in the form of overlays, buffering, clipping, and the like. The end result after data retrieval/display and/or data manipulation/analysis is that a Vents Program scientist has the power to view, query, summarize, calculate, and make decisions (Figure 2c). Data are interpreted, new knowledge is gained, and hypotheses may be derived and/or tested. This cycle of activity may then return either to more data retrieval/display, more manipulation/analysis or to the marine environment in Part 1 of the SIM (Figure 2a) to collect more data.

Insights Resulting from the SIM

These are the preliminary insights resulting from the SIM at this stage in its development:

Conclusion

The Vents Program has provided a framework for successful integrated data management by establishing scientific information management as an organizational priority. The ultimate objective is a system that integrates the functions of data storage, selective retrieval, display and archiving. The results of our ongoing efforts in scientific information modeling and management have produced a relational database in which marine geological, geophysical, chemical, and biological observations can be accessed by any investigator within the Vents Program and throughout the marine science community. The Vents Program SIM provides an effective schema for database design, simplifies a complex data management scenario for end-user scientists, serves as an effective communications tool for scientists and database managers, and demonstrates that GIS technology can be an effective, if not crucial, tool for supporting oceanographic research.

References

Applegate, T. B. 1990. Volcanic and structural morphology of the south flank of Axial Volcano, Juan de Fuca Ridge: Results from a Sea MARC I side scan sonar survey. Journal of Geophysical Research, 95(B8): 12,765-12,783.

Baker, E. T., and G. A. Cannon. 1993. Long-term monitoring of hydrothermal heat flux using moored temperature sensors, Cleft segment, Juan de Fuca Ridge. Geophys. Res. Lett., 20(17): 1855- 1858.

Bobbitt, A., T.-K. Lau, and C. G. Fox. 1993. Integrating multidisciplinary data sets from the Juan de Fuca Ridge using geographic information systems (abstract). Eos, Transactions of the American Geophysical Union, 74(43): 88.

Cannon, G. A., D. J. Pashinski, and T. J. Stanley. 1995. Fate of event hydrothermal plumes on the Juan de Fuca Ridge. Geophys. Res. Lett., 22(2): 163-166.

Chadwick, W. W., Jr., R. W. Embley, and C. G. Fox. 1995a. Sea Beam depth changes associated with recent lava flows, CoAxial segment, Juan de Fuca Ridge: Evidence for multiple eruptions between 1981-1993. Geophys. Res. Lett., 22(2): 167-170.

Chadwick, W. W., H. B. Milburn, and R. W. Embley. 1995b. Acoustic extensometer: Measuring mid-ocean spreading. Sea Technol., 36(4): 33-38.

Dziak, R. P., C. G. Fox, and A. E. Schreiner. 1995. The June-July 1993 seismo-acoustic event at CoAxial segment, Juan de Fuca Ridge: Evidence for a lateral dike injection. Geophys. Res. Lett., 22(2): 135- 138.

Embley, R. W., W. W. Chadwick, Jr., I. R. Jonasson, D. A. Butterfield, and E. T. Baker. 1995. Initial results of the rapid response to the 1993 CoAxial event: Relationships between hydrothermal and volcanic processes. Geophys. Res. Lett., 22(2): 143-146.

Environmental Systems Research Institute. 1994. ARC/INFO¬ Data Management: Concepts, Data Models, Database Design, and Storage. Redlands, California: Environmental Systems Research Institute.

Fleming, C. C., and B. Vonltalle. 1989. Handbook of Relational Database Design. New York: Addison-Wesley.

Fotheringham, A. S., and P. A. Rogerson. 1993. GIS and spatial analytical problems. Int. J. Geographical Info. Sys., 7(1): 3-19.

Fox, C. G. 1993. Five years of ground deformation monitoring on Axial Seamount using a bottom pressure recorder. Geophys. Res. Lett., 20(17): 1859-1862.

Fox, C. G. 1995. Special collection on the June 1993 volcanic eruption on the CoAxial segment, Juan de Fuca Ridge. Geophys. Res. Lett., 22(2): 129-130.

Fox, C. G., and S. R. Hammond. 1994. The VENTS program T-phase project and NOAA's role in ocean environmental research. Mar. Tech. Soc. J., 27(4): 70-73.

Fox, C. G., K. M. Murphy, and R. W. Embley. 1988. Automated display and statistical analysis of interpreted deep-sea bottom photographs. Mar. Geol., 78(199-216.

Gritton, B. R., and C. H. Baxter. 1992. Video database systems in the marine sciences. Mar. Technol. Soc. J., 26(4): 59-72.

Gritton, B., D. Badal, D. Davis, K. Lashkari, G. Morris, A. Pearce, and H. Wright. 1989. Data management at MBARI. In Oceans '89 Proceedings: The Global Ocean, 1681-1685.

Hammond, S. R., E. B. Baker, E. N. Bernard, G. J. Massoth, C. G. Fox, R. A. Feely, R. W. Embley, P. A. Rona, and G. A. Cannon. 1991. The NOAA VENTS Program: Understanding chemical and thermal oceanic effects of hydrothermal activity along the mid-ocean ridge. Eos, Transactions of the American Geophysical Union, 72(50): 561-566.

Hamre, T., 1993, Integrating remote sensing, in situ and model data in a marine information system (MIS). Proc. Neste Generasjons GIS 1993, Norway, 181-192.

Lupton, J. E., J. R. Delaney, H. P. Johnson, and M. K. Tivey. 1985. Entrainment and vertical transport of deep-ocean water by buoyant hydrothermal plumes. Nature, 316(621-623.

Mason, D. C., M. A. O'Conaill, and S. B. M. Bell. 1994. Handling four- dimensional geo-referenced data in environmental GIS. Int. J. Geographical Info. Sys., 8(2): 191-215.

Openshaw, S. 1991. Developing appropriate spatial analysis methods for GIS. In Geographical Information Systems: Principles and Applications, eds. Maguire, D. J., Goodchild, M. F., and Rhind, D. W., 389-402. New York: John Wiley and Sons.

Raper, J. F., and B. Kelk. 1991. Three-dimensional GIS. In Geographical Information Systems: Principles and Applications, eds. Maguire, D. J., Goodchild, M. F., and Rhind, D. W., 299-317. New York: John Wiley and Sons.

Figure and Table Captions




Return to Dawn's home page