Author: Austin Troy, U.C. Berkeley Department of Environmental Science, Policy and Management
email: austint@nature.berkeley.edu
http://www.ucgis.org/oregon/papers/troy.htmThis project examines the effects of the recently passed California Natural Hazard Disclosure Law (AB 1195) on property markets throughout the state. The law requires property sellers to inform potential buyers of several types of natural hazard that may affect the property, including flood, wildfire and seismic hazards. The law was intended to inform consumers and to discourage development of hazardous areas by capitalizing some of the perceived and real costs associated with natural hazards into the selling price of a property. The study was initiated in response to the lack of research on the impacts of hazard disclosure on property markets (Cross 1985; Palm 1982 are two of the only exceptions). The project aims to answer three questions about hazard disclosure: 1) What impact has the California Hazard Disclosure Law had on markets for developed property? 2) What impact has this law had on markets for vacant land? And 3) how do these effects vary with demographic indicators, including housing supply-demand balance, race and neighborhood socio-economic status?
This paper reviews the methodology used in this project. A commonly used method for housing market analysis is hedonic analysis. In hedonic analysis, observed housing prices are regressed against vectors of structural, neighborhood and locational attributes so as to derive an index of marginal, implicit attribute prices. Here, hedonic analysis is used to isolate the effects of hazard disclosure on property values and to assess how that price effect changes with differing socio-economic and supply-demand conditions By controlling for the wide range of structural, locational and neighborhood attributes, an implicit price, or unobserved capitalized value, can be derived for a home's location in or out of designated hazard zones.
This methodology differs from past hedonic studies in several ways. First, it attempts to draw conclusions at a dramatically larger spatial extent than any previous study. Second, it attempts to draw conclusions across an extremely heterogeneous series of populations and housing markets. Finally, it parameterizes far more locational attributes that are of importance to the housing price function than do most other hedonic studies. Therefore, this methodology represents a novel approach that could prove useful in conducting future housing market and non-market valuation research at the state or regional scale. More specifically, it is a useful template for those who wish to analyze the diversity of price effects of an environmental amenity or disamenity that is widely and unevenly distributed across a very large geographic area.
This paper explores one possible approach for undertaking a large-extent hedonic analysis and scrutinizes some possible permutations of that approach. Firstly, it looks at how GIS software and pre-existing digital data sets can be used to relatively easily quantify most of the variables needed in a hedonic analysis. Second, it looks at how to sample extremely large data sets (in this case, all real estate in California) so as to produce robust and meaningful parameter estimates while capturing diversity. Here, a two-stage cluster sample methodology is used to obtain relatively unbiased samples of households across diverse cross sections of the population. Finally, it presents several possible approaches for dealing with the problem of stratifying data so as to model the appropriate spatial level of housing submarket activity and obtain the best fit.
By looking at the preliminary results of the larger study, this paper analyzes some of the tradeoffs between estimating equations across large and small areas. It also makes the point that the level of stratification chosen for regressing housing data should depend on the nature of the variable whose hedonic price is of interest. In this case, a relatively sparse and heteroskadastic distribution of properties with hazard zone designations means that, while estimating localized hedonic equations yields a good fit, the coefficient on the hazard variables is often meaningless in many of these narrowly defined strata where hazard properties are depauperate. Oversampling certain hazard-prone zip codes in the first stage and hazard zone properties in the second stage of sampling helps remediate this to a certain extent, but not enough. Therefore, pooling data may be the best method for data sets in which the variable of interest has an irregular distribution across the landscape and in which there are not enough observations to remediate that through oversampling. Localized stratifications appear to be best when the variable of interest is more abundantly and/or evenly distributed across sample areas. Where it is unclear which method is appropriate, it may be best to try both methods and compare results.
A significant body of research has used hedonic analysis to try to isolate the housing price effects of location in natural hazard zones. However, these studies suffer from several major shortfalls. First, none are able to control for the role of information. Housing markets are only affected by disamenities inasmuch as people know about them. Without assurances of perfect or near-perfect information, a hedonic study on hazards loses meaning. No comprehensive hazard disclosure requirement for residential sales existed prior to the 1998 California law. Second, all studies cover a very limited spatial extent, generally one or several neighborhoods. Not only does this make their models less generalizable, but it limits their conclusions to a single, or several housing markets. Given the significant differences in the housing price function between submarkets (Goodman and Thibodeau 1998, Strazheim, 1975), omitting the diversity of effects across geographic areas could seriously oversimplify the complexity of the effects of such an environmental encumbrance. Because most of these studies lack spatial breadth, they are less able to make conclusions about how such price effects vary with numerous demographic attributes, such as income, race, education, etc. Nevertheless, the existing research does provide a good road map for the range of effects to be expected and the methodological pitfalls to avoid.
The major studies on the housing price effects of hazard disclosure (Cross 1985, Palm 1982, 1981) are inadequate, because prior to the 1998 California law, disclosure requirements were ill-defined and maps were not readily available, so disclosure was routinely ignored or glossed over. Therefore, it is not surprising that both found no effect from hazard disclosure. Both authors also conducted studies that were small in spatial extent, and included only wealthy, high demand neighborhoods, which, theory predicts would be the least likely to see a negative capitalization effect. This again underlines the need for including a diversity of neighborhoods, and hence a large spatial extent in this type of study. Palm's study's used hedonic analysis along with interview techniques, but the hedonic analysis, limited by the technology at the time, was unable to paramaterize enough variables to control for much of the locational and neighborhood variance.
Several studies used hedonic methods to see if properties in the FEMA 100 year floodplain sell for less than comparable properties outside the floodplain without actually controlling for information. MacDonald et al. (1987) found a selling price differential of between 6 and 8 percent, depending on the value of the home. They found this differential consists of the capitalized cost of insurance premiums plus an "option price" for risk aversion, which is a reflection of the perception of potential future damages over and above what the insurance policy pays. Shilling et al. (1985) found that floodplain location for NFIP-participants in Louisiana reduced home values on average by 6.4 percent from comparable houses outside the floodplain. They found that this reduction in value equaled the capitalized value of insurance costs calculated at a 5 percent discount rate, but that an "option price" appeared when a higher discount rate was used. Damianos and Shabman (1976) also used hedonics to look at comparable properties within and outside of the floodplain, concluding that floodplain location lowered residential property values by 8 to 12 percent. Donnelly (1989) concluded that floodplain location reduced sales price by 12 percent in his study area in Wisconsin. Muckleston (1983) is one of the few authors who found no appreciable effect on price from floodplain location. He examined property appreciation rates of homes inside and outside of floodplains in western Oregon and concluded that neither flood hazard nor floodplain regulations affected property value. However, this study has been criticized for not controlling for enough variables in analyzing the effect of floodplain location (Holway and Burby 1990). These studies suffered because they could not control for information. That is, lacking an adequate disclosure law, their data sets included transactions where the parties both did and did not know about the hazard designation at the time of sale.
They also suffered because they limited themselves to single, or just a few linked markets, limiting their explanation of the diversity of price effects. Only one study covered significantly diverse geographic areas to able to make conclusion about how price effects vary with other factors. Tobin and Montz (1994) found that recent occurrence of a flood reduced transacted property values in their study sites in Pennsylvania, California and Illinois. The magnitude of this relationship was found to be a function of flood severity as well as several local factors not related to flooding. Two of the local factors of importance were the demand for housing and the availability of flood free land. They were able to make such conclusions because they included diverse geographical areas, and stratified their data by housing submarket to better model that variation. Nevertheless, this study suffers from poor statistical methods. Rather than combining a hedonic analysis of housing prices (cross sectional) with analysis of difference in prices before and after a flood event (longitudinal), the hedonic and the analysis of difference are separated, making the longitudinal comparisons less reliable. Moreover, the fit of the hedonic equations are very poor, with R squared values from .17 to .56, due to the lack of variables.
Economists including Rosen (1974), Quigley and Kain (1970) and Griliches (1971) developed the concept of hedonics, which is based on the idea that the observed price of a good represents a bundle of implicit attribute prices. That is, the market price of a good can be thought of as the sum of the unobserved prices of a large bundle of attributes. Implicit prices are reflections of consumers' willingness to pay (WTP) for those attributes under equilibrium. Rosen (1974) interpreted the hedonic function as an equilibrium price schedule for an attribute. He interpreted the partial derivatives of the price function with respect to an attribute as the market-clearing marginal price for that attribute. Or, put more simply, it is the price for a one unit increase in that attribute at that particular quantity. The hedonic equation is based on the concept that households maximize their utility by moving along the marginal price schedule for each attribute until they reach the point where marginal WTP for an additional unit of each characteristic equals the marginal implicit price of that characteristic (Harrison and Rubinfeld 1976).
Because hedonic analysis can be used to estimate implicit prices, it is very useful in real estate and land use research. Hedonics allows the price of housing or land to be disaggregated into a series of implicit prices for various structural, neighborhood and locational attributes. Since most of those attributes can be quantified, their relationship to observed price can be modeled and marginal values can be given to those attributes. Not only does this help more accurately model housing and land prices, but it is instrumental in valuing amenities and disamenities that have no market price. For instance, hedonics can be used to determine the marginal WTP function for positives, like additional open space and clean air, or for negatives, like presence in a hazard zone, or proximity to a hazardous waste site. The WTP function can then be used, if adequate household data is available, to derive aggregate consumer surplus values and to measure costs and benefits.
Econometrically, hedonic price indices are estimated by regressing observed market price on a vector of neighborhood, locational and structural attributes, in addition to the (dis)amenity of interest, using ordinary least squares or weighted least squares. Once the hedonic price equation has been estimated, each household’s willingness to pay for a marginal change in the attribute can be determined by taking first order conditions for utility maximization subject to the budget constraint. This gives the partial derivative of the hedonic price equation for housing with respect to a marginal attribute change in equilibrium:
Wa (h)= dp(h)/ d(-a),
where Wa (h) is the WTP for a
marginal change in the amenity, p(h)
is the hedonic equation and a is the
level of the (dis)amenity.
Calculated separately for each household this derivative is an estimate of the household’s WTP for marginal increases in the amenity. A marginal WTP function for all households can then be determined by second step regression of this derivative against the amenity in question and demographic variables, including income. This yields a series of functions, analogous to an inverse demand curve, from which aggregate lower and upper bound changes in economic surplus can be estimated. In this case, only first step regression will be used since the purpose of this study is simply to see if hazard disclosure has been capitalized negatively into land markets.
The first step of the hedonic analysis involves identification and quantification of all the non-hazard attributes that serve as control variables. These include vectors of structural features (e.g. number of bedrooms and bathrooms, pools or other special features), site features (e.g. views, slope, lot size, sewer and utility access) and location features (e.g. proximity to urban centers, highways and amenities, as well as distance from disamenities). Next, a proper functional form for the hedonic price function must be chosen. Although some economists believe that proper segmentation of submarkets can yield approximate linearities of price functions across the range of data for each submarket (Dale-Johnson 1980), it is generally accepted that such linearities do not occur, because consumers are not able to freely untie and repackage bundles of attributes (Rosen 1974, Freeman 1979). While more complex non-linear functional forms may yield a much better fit, this is done at the expense of easy interpretation and, moreover, the best-fit specification may not “explain” the effects of any specific variable (Cassel and Mendelsohn 1985). Semi-log and trans-log models are also considered to be easily interpretable. In the former, coefficients translate into percent changes in the response variable for a marginal change in the predictor. For the latter, coefficients can be interpreted as elasticities. Nevertheless, such transformations are important and will be explored for their contribution to fit. After the methods of Halvorsen and Pollakowski (1978) and Bender et al. (1980), one way of gaining insight into best fit functional forms is by Box-Cox maximum likelihood analysis of dependent and/or independent variables. The value of l yields information on the best fitting form, with l=1 implying a linear specification, l =.5 implying a square root transformation and l=0 implying a natural logarithmic form.
One important question is whether the hedonic price function varies by housing sub-market. Many economists (e.g. Dale-Johnson 1979, Goodman and Thibodeau 1998, Strazheim 1974) have found that because there is a series of separate compartmentalized housing markets in any urban area, there is a need to estimate separate, stratified hedonic functions for each submarket. Stratifying geographically can dramatically increase the goodness of fit of estimated models because the relationship between attributes and price can vary significantly between housing market segments. That is, the marginal price of a third bathroom may vary significantly between two neighborhoods. If this were not the case, it would imply that there is perfect substitutability between housing markets in different locations. This is clearly not the case because structural types and neighborhood amenities vary by location, and those location-dependent attributes cannot easily be overcome because of structural fixity (Goodman and Thibodeau 1998). A significant question, though, is how to define a submarket both geographically and temporally. While it is no longer accepted that all housing markets can be defined perfectly by one single housing price function, there is considerable debate as to what constitutes the appropriate spatial and temporal extent of a housing market. Submarkets could be defined in terms of contiguous areas based on similarities in structural characteristic, fixed neighborhood characteristics or by population demographic characteristics. In many cases, market segmentation results from past or current racial discrimination (King and Mieszkowski 1973).
Goodman and Thibodeau (1998) devise a methodology in which housing submarkets are defined based on the quality of the public school of the district in which a house is located. For every possible contiguous sub-district area combination, they look to see if there is a significant coefficient on the cross product for square footage and school test score. If two contiguous sub-districts are regressed and the cross product is insignificant, that means those two districts are part of the same sub-market. If the coefficient is significant, they are assigned to two different sub-markets. Or, put differently, if the implicit price of square footage is found to vary depending on test scores, it can be determined that two contiguous areas are not part of the same market. Of course, the problem with this methodology is the vast number of possible combinations of tests of contiguous areas that must be performed. For Dallas county, this would require 2282 tests.
This study uses the hedonic method to assess the effects of hazard disclosure both cross-sectionally and longitudinally. The cross sectional analysis is used to compare properties within and outside of designated AB1195 hazard zones, adjusting for all other relevant value factors. In this case, transacted property price was regressed on vectors of locational, neighborhood and structural characteristics, as well as dummy variables representing whether the property lies in designated flood and/or fire zones. Significant negative coefficients on one or both of those variables is an indication that there is some negative association between hazard zones and property values, adjusting for all other included factors. However, such a result would not necessarily assign causation to disclosure under AB 1195. Property values may be lower in response to other indicators of hazard, such as previous occurrence of hazard in the area, brush clearance/fire danger placards, local knowledge of potential hazards, or visual cues. Therefore a longitudinal approach is also needed. By comparing the magnitude of the coefficients on hazard for the before- and after- June 1998 groups, it can be determined whether there is a significant difference in the premium in the before or after groups. This was done by both stratifying regressions and by including period as a dummy variable. Both methods will eventually be compared using F tests.
Previous to desktop GIS technology, hedonic analyses were greatly limited by difficulties in quantifying neighborhood and locational attributes for large data sets. GIS today allows for the systematic wide-scale quantification of location and neighborhood variables across vast geographic areas, without the loss of precision. Additionally, the creation of large digital geo-spatial data sets allows for the relatively easy coding of numerous locational and neighborhood attributes across large spatial extents, rather than having to search for, order and put into digital format data sets available from individual counties and cities. While many important geo-data sets that could significantly improve a hedonic analysis are still only available from local governments, enough data is available universally for the country to make a large-extent hedonic study practicable without having to integrate a variety of inconsistent local data sets.
By automating and simplifying the assigning of attributes, large, geographically disparate areas can be analyzed together and conclusions can be drawn across diverse cross sections of the population and the housing market. Interestingly, though, only very recently have researchers begun to integrate hedonic analysis with GIS. Being at the California state level, this study represents perhaps the largest-extent hedonic study to date. In doing this it has laid out a methodology for looking at the housing price effects of large-extent spatial phenomenon and determining the diversity of these effects on different areas. It has also laid out a preliminary and still in-progress means of stratifying hedonic equations so as to model the pertinent level of housing submarket activity.
This study is intended to draw on a representative cross section of California’s populations and real estate markets so as to get a picture of the diversity of effects of hazard disclosure. Getting such a cross section based on relevant attributes is extremely difficult given the size and heterogeneity of the state. To do this, it was decided to employ a two stage cluster sampling methodology, utilizing a stratified random sample of zip codes, followed by a random sample of transacted properties within those zip codes, stratified by location in or out of hazard zones. Allocation of samples in this way is designed to reduce selection bias as well as the standard errors of the population parameters.
For a state the size of California (with roughly 2000 zip codes), getting a representative cross section is not a trivial task. The intent was to sort the zip codes of California into a matrix based on the hazard zones they contain, population density (urban-suburban-rural) and housing market segment, and sample from the cells of the matrix. Arc View, Access, Excel and S-Plus were chosen as the programs for handling the data.
The first task of this phase was to input TIGER base map layers that would be needed for reference, including zip code boundaries, county boundaries and major highways. The next set of data obtained was on the AB 1195 designated hazard zones. The hazard zones included are the FEMA class A flood zones, state responsibility fire zones (SRAs) and local responsibility fire zones, or Very High Fire Hazard Severity Zones (VHFHSZs). Flood zone outlines came from FEMA’s Q3 digital data set. Statewide VHFHSZ and SRA data files were downloaded from the California Department of Forestry and Fire Protection web site as Arc Info export files.
Next, the needed zip code level information was joined to the existing zip code layer database file. This file already had zip code area and 1997 population. Therefore, it was trivial to create a column assigning average population densities per square mile to each zip code. It was decided that recent median transacted sales price would be a good metric for housing market segment classification. Data of median home sales price and price per square foot for 1999 and 1998 by zip code were obtained from the Rand California web site. About 700 of the 1900 California zip codes, most of them rural, had no median housing price data available, probably because of the low number of transactions. The resulting map can be seen in figure 1. While the map may seem to have large gaps with no data, most of these zip codes are either predominately public land (i.e. national forest), or so sparsely populated as to have few transactions, a quality that is not desired for sample zip codes. The next step was to assign hazard zone metrics to each zip code. Using Arc View’s “tabulate areas” command, the area (in square meters) of flood, SRA and VHFHS zones was calculated for each zip code. This was then divided by the area of the zip code to get a percentage area by zip code for each hazard zone.
Figure 1. Median 1999 Home price by zip code

Categorization intervals were then chosen for the attributes by which zip codes would be stratified. First, a sampling population was defined as the subset of zip codes where data existed for 1999 transacted property price and for which population densities were greater than 60 persons per square mile. This proved to be 1065 zip codes, or about 56% of California's zip codes. This clearly introduced a selection bias against rural, low population regions, but there is a certain utility to this bias because these are the same areas where the data set of recent property transactions is likely to be depauperate. For these 1065 zip codes, the intervals were determined that broke population density and median transacted property into roughly equal quantiles. Intuition was used to adjust these intervals slightly, making them more interpretable. The resulting intervals in Table 1 were decided upon, yielding a nine-cell matrix. The marginal totals in Table 2 show how these intervals broke the data into roughly similar, but not exactly equal divisions. This adjustment was necessary especially in the case of population density, since the quantile approach would have resulted in the lowest density category favoring extremely sparsely populated zip codes, where property transactions are too few to sample. Biasing the categories in favor of more dense areas may result in undersampling of very rural areas, but this is not a problem for this study since there are too few properties in those zip codes to be useful. Clearly there is a desire to see how disclosure impacts suburban and urban-wildland fringe land, but the most rural markets are generally not where development is an issue. The zip codes were also stratified by four hazard categories: only flood zones present, only fire zones present, both present and neither present. In this way, all categories were mutually exclusive. The “presence” of a hazard in a zip code was assigned based on whether the percent of hazardous land in the zip code exceeded a given threshold. For floods and urban fire zones, this threshold was 5% and for wildland fire areas it was 25%. A larger threshold was used for wildland fire areas because these zones are usually mapped in much larger units and because they fall in rural areas where there tends to be far fewer properties and property size tends to be larger. So, to ensure that a large enough number of properties fell in the SRA zone, a large amount of SRA land was needed, while in typically more densely populated floodplain and VHFHSZ land, a lower threshold was needed. The number of zip codes falling into each hazard category is given under “Total” for each 3x3 matrix in Table 1. As shown, the number of zip codes in the fire category is much greater than for flood.
Table 1. Number of Zip Codes Categorized by Population Density and House Price
Table 2 Sum of Matrices Across Hazard Zones with Marginal Totals
With the population of zip codes stratified, samples were pulled from each cell, except for the cells in Table 1 shaded in gray. These strata simply have too few observations and so are considered to be anomalous, or representative of insignificant populations. The next question was how to sample from this matrix so as to draw on a representative sample across the categories. The question was whether to choose by self-weighting proportional allocation (i.e. the sample size of each stratum is proportional to the stratum's population, or every nth sample) or by another rule, such as equal numbers of samples per cell. For the self-weighting method, sampling weights are the same across groups and are equal to the reciprocal of the proportion that is sampled. When appropriate, self-weighting samples are desirable because they yield smaller variances and robust sample statistics (Kish 1992). If proportional allocation is not used, then the unequal sampling rate for each cell can be dealt with by by assigning sampling weights to observations from each group (Lohr 1999). Sampling weights (whj) can be assigned to observations based on their strata, where whj= Nh/ nh , or the population of stratum h, divided by the number of observations sampled from the stratum. Observations in the regression analysis would then be weighted variably by their strata.
Another question was how the four hazard category strata should be represented. Clearly, some zip codes with flood zones and some with fire zones are required. However, it was not clear that zip codes with both hazards or with none needed to be sampled representatively since almost all zip codes with hazardous land also had hazard-free land and since flood and fire zones are usually geographically mutually exclusive. In fact, it was found that even excluding zip codes from the “No hazard” category, only about one transaction in five fell within hazard zones. Therefore, it was decided not to sample from zip codes in the “no hazards” category. This did have the result of undersampling densely urban zip codes, because the “no hazard” stratum contains the highest proportion of zip codes in the highest population density category, relative to those with hazards. This however, is not a problem, since far fewer dense urban zip codes are needed, given their large numbers of transactions relative to suburban and rural zip codes.
Eventually, it was decided to take a proportional sample of zip codes from all but the no hazards grouping of zip codes. The proportional approach was chosen because first, it would eliminate the need to use first stage sampling weights. It was also chosen because, given that most of the dense urban zip codes were in the excluded no hazards grouping, there was a higher number of suburban and rural zip codes relative to dense urban zip codes in the remaining nine cell matrices. This meant that rural and suburban zip codes would be oversampled, which was a desirable outcome, since lower population density areas have a lower number of transactions than dense urban areas, and so far more zip codes are needed to achieve the same statistical power for those types of areas. Based on the estimated number of zip codes needed, it was decided to randomly sample one of every nine zip codes (a .1111 sampling rate) from the population of zip codes within each cell. This resulted in a stratified random sample of 63 zip codes for the whole state.
Once those 63 zip codes were chosen, it was necessary to see if some of these zip codes did not adequately fit the criteria needed for the study. For example, it was hoped to avoid zip codes that were overly clustered geographically, or zip codes where no developed areas fell within the hazard zone. First, these 63 zip codes were plotted in Arc View. In the few cases where several zip codes were overly clustered or adjacent, one or several of those zip codes would be discarded and replaced by another zip code from the same cell group. This was done until there were no significant spatial clusters of zip codes. Next, local street themes were overlaid with the hazard themes for the selected zip codes. Zip codes where few or none of the road intersections overlaid hazard zones were likely to have few property transactions in the hazard zones, the density of road intersections being an indication of the density of habitation. These zip codes were eliminated and another zip code from each respective group cell was randomly selected.
Finally, a list of zip codes was ready from which property transactions would be downloaded. Property data were obtained from Metroscan, an online service of TransAmerica Intellitech. However, the first several downloads indicated another serious problem in the sampling. Property data are organized by county, because it is county assessor’s offices that are responsible for keeping track of those records. It soon became evident that while some counties kept adequate or excellent records, many kept records so poor, they were essentially useless for this study. Many rural counties, especially those in far northern California (e.g. Lassen, Shasta, Humboldt, Tuolumne), had extremely limited data fields available, and many of those fields had few records with data in them. Because the data were so poor, it was clear that the ten or so zip codes that fell within these counties had to be replaced. Therefore, a final replacement sampling was done, in which zip codes from “problem counties” were replaced with zip codes from their respective cell groups, but from counties with useable data. Luckily, numerous rural counties, such as Placer, Stanislaus, El Dorado, Nevada, Kern, Tulare and San Joaquin did have excellent data records, so a sample of rural zip codes could still be obtained. The map of the final zip code sample is shown in Figure 2, below.
Figure 2. Sample Zip Codes Selected for First Stage Sampling

In the next stage, property level attributes were mapped and coded. Some of these coded attributes were used to generate stratifications for data sampling, while most were coded for use in regressions. The sections below describe how the property points were assigned a spatial location, coded with spatial and neighborhood attributes, and then sampled. All attribute names, alternate names and descriptions are given in table 3, in the Results section below.
Finally, with the problem of poor county-based data taken care of, individual transaction data could be downloaded. Since not all counties had the same data fields available, the same fields could not be consistently downloaded. Rather, the fields that were common to most counties were downloaded. Those common to all counties included sales price, number of bathrooms, number of bedrooms, assessed structure value, lot square footage, structure square footage, site address, document date, year built, pool and fireplace. Some attributes were common to most, but had to be imputed for others, included building class, number of stories and total rooms. The group of transactions to download within each zip code was based on time frame, land use and price. Properties were selected for downloading that fell within the time period of January 1997 to the time of downloading (January 2000 and February 2000), that were located in any area zoned residential or vacant land, and that sold for above $10,000. The final criteria was included so as to exclude properties with missing price data and at least some properties that were transferred for a below market nominal price (i.e. intra-family transfers). For most zip codes, all property transactions meeting these criteria were downloaded, but in some denser zip codes with very many transactions, only a percentage of those were downloaded, that percentage being noted for use in weighting the regressions.
Once those data were downloaded as DBF files, each zip code file was standardized so that all had consistent fields, in terms of number, title and data type. Special care was taken to make sure that address fields were consistent and clean, as they would be used for geocoding. All the sample zip codes in any given county were then joined into county files. These county-level DBF files were then imported in Arc View for geocoding.
Once into Arc View, the county level property record files were geocoded using county-level street files, obtained from 1995 Tiger Line Street Files. Once brought into Arc View, geocoding indexes were generated for the street files, specifying the address and zip code preferences. Each county file was then geocoded separately. Geocoding match preferences were set such that a minimum match score of 70% in spelling recognition was required for a batch match. Anything between 70% and 50% had to be interactively rematched, one at a time. Anything below 50% was discarded. In most counties, between 70 and 90% of all households were successfully geocoded. However in several zip codes where recent development has resulted in many new streets in recent years or months, it was found that 1995 street data were insufficient. For five zip codes where match percentages were particularly bad, the latest year 2000 street files were purchased from GDT. Using these street files resulted in dramatically higher geocoding match rates.
Geocoding resulted in the creation of numerous point themes, with each point representing a house. Since each house now had a spatial location assigned to it, it was now possible to derive and assign neighborhood and locational attributes for them. First, it was necessary to join the various county-level address point themes so as to reduce the redundancy of attribute coding work. Since the file resulting from joining all of them would run too slowly on the computer, it was decided to create one join file for addresses in Northern California zip codes(figure 3) and one for Southern California (figure 4). Thus, all spatial attribute operations would only have to be performed twice, and most with far less preparatory work on the second round.
The first task in attribute coding was to quantify the proximity of households to various facilities with perceived negative or positive value. The best source for such facility data was found to be the Geographic Names Information System (GNIS), from the US Geological Survey. For California, the GNIS had a list of over 100,000 places, with their names, type, county and latitude/longitude coordinates. It was chosen to include airports, golf courses/country clubs, hospitals, industrial facilities, libraries, cultural facilities, city parks, shopping centers, schools, marinas and ski areas. Distance was first calculated using a road distance to nearest facility function. This worked well for small numbers of properties, but when attempted for zip code or county level aggregations of data, it generally froze or crashed the computer. Therefore, it was decided instead to rely on straight line distances. These were computed using the point to point spatial join function of the Arc View geoprocessing wizard.
Figure 3. Sample Zip Codes for Northern California
Figure 4. Sample Zip Codes for Southern California
Figure 5. Sample Property points, hazard zones and facilities, near Pacifica

Many other distances were calculated from features represented in diverse data sources. Distances were also calculated from houses to hazardous waste sites. These data were obtained from ESRI’s Business Map Pro, which in turn gets its data from Claritas. Hazardous waste sites were geocoded by address and zip code. Distance to open space preserves was determined using a layer depicting all public land in California. A query on the data layers was conducted for land jurisdictions open to public recreational use. The verteces of the selected polygons were then decomposed into points using the anyshape2points.ave Arc View script. This was done because the geoprocessing wizard will only calculate distances between points or between points and lines, not between points and polygons. Distances from houses to the nearest open space were then calculated. The distance to the coast and to inland water bodies was determined by decomposing 1995 Tiger polygon files of water bodies and coastal areas to points and using the geoprocessing wizard. Distance to redwood forests was coded in a similar way using data from the California Department of Forestry and Fire Protection (CDF). Distance to major highways was coded by querying for all the major highways from the Tiger California major roads theme and creating a new theme. Since this was a line file, the distance to houses could then be calculated using the geoprocessing wizard. Finally, average rainfall at a household was assigned by overlaying the property point theme with an average precipitation isoline map from CDF.
For several of the distance terms, dummy variables were used instead of absolute distances. This was done for spatial occurrences that, intuitively, only affect properties within a certain distance of them. Dummies were coded for the following; properties within 1500 meters of marinas (to account for the premium on dock-side properties); within 20 kilometers of major ski areas (to account for the vacation-ski house premium); within 800 meters of major redwood forests (to account for the premium from the likely presence of redwood vegetation on the property); within 800 meters of inland water bodies, such as lakes and reservoirs (to account for the premium for living on a lake); and within 300 meters of a hazardous waste site or industrial facility. Figure 5 shows the positions of facilities relative to houses in Pacifica.
The next task was to assign some measure of access to businesses and employment opportunities. It seemed intuitive that the traditional method of coding distance to central business district was not appropriate given the dispersion and heterogeneous size of California’s work districts. Rather, an index of business and employment access was developed using the following methods. In ESRI’s Business Map Pro, using the Business Tracker Extension, all California businesses with over $10 million in annual sales were plotted on a map. Business densities were then recorded by drawing circles of varying radii around areas of contiguous businesses. The more dispersed the businesses, the larger the circle needed to contain them, and so the lower calculated business density per area within that circle. In areas of dense contiguous businesses, the circles needed to contain the contiguous businesses were of much smaller radii, and hence yielded much higher densities. The points themselves were then plotted and each point was assigned a density value depending on what type of business district it belonged to. Thus, San Francisco and parts of Silicon Valley were coded as highly dense business areas, while Los Angeles had no highly dense business districts, but many large and moderately dense business districts, due to the more dispersed nature of that city. Secondary business districts, such as Sacramento and Fresno were also included but scored with fairly low-density values because of the low number of large businesses and because of their dispersion. Once these points were plotted, then all business districts were divided up into three categories—A, B and C districts, with A being densest, and C being least dense (figure 6). Distances were then calculated from each household to the nearest point and an A, B and a C district. The following index (q) was then generated:
q = ln(100,000((wi/dk) +(wj/dj )+(wk/dk)))
where: di = straight line distance to nearest A business district, i (> .5 major businesses/ square mile)
dj = straight line distance to nearest B business district, j (>.1 and <.5 major businesses/ square mile)
dk = straight line distance to nearest C business district, k (<.1 major businesses/ square mile)
wi=number of businesses per square
for nearest A business district i
wj=number of businesses per square
for nearest B business district j
wk=number of businesses per square
for nearest C business district k
A similar, alternative index was generated without logging the right side. These indices were compared by plotting each property’s index value on a map. Visually, the log index seemed to track fairly well a property’s access to business and employment opportunities(figure 7). Logging attenuated the extreme influence of proximity to very important business districts such as Silicon Valley, putting diverse properties on a more comparable scale. Without the log scale, the index value for the best placed properties was high enough relative to all other properties that it would have had to be transformed for regression purposes. The plot of the logged index seems to make intuitive sense; properties very close to Silicon Valley and San Francisco have extremely high index values. With distance from those areas, the index decays slowly at first, and then with increasing rapidity at around 30-40 miles, roughly the area of the commute-shed for the Bay Area. The distance-decline is attenuated, though if a secondary or tertiary business is in the vicinity. For instance, those properties along the I-580 corridor decline at a much slower rate with distance from Silicon Valley, and, in one point even experience a smaller spike, indicating the presence of substantial secondary business district. Hence, the properties near Livermore and Pleasanton have relatively high index value not only because they are near an important secondary district, but also because they are within commuting distance, albeit rather far, from Silicon Valley. This is why some of the remote looking properties shown in far eastern Contra Costa county have considerably higher index values than properties in Solano county which have a similar distance to a primary business district, but no major secondary district nearby. Overall, although it is only meaningful as a relative number index, this index seems logical and consistent, with no groupings of properties assigned values that seem egregiously wrong.
Figure 6. Primary, Secondary and Tertiary Central Business District points

Figure 7. Log CBD Index Plotted for Northern California
Several educational attributes were coded for each household. Rather than assigning performance measures based on the nearest school, performance measures were assigned based on the school district in which the property fell. While some areas had Unified School Districts, others had High School, Middle School and Elementary School Districts. Therefore, some average measure had to be derived. Two measures were chosen for school district performance, 1999 Academic Performance Index (API) and Statewide Ranking. The API, which ranges from 1-800, summarizes a school’s students’ performance on the STAR academic achievement test. The statewide rank is a number from 1 to 10 that describes what decile of the state’s schools a given school falls in. Mean API’s and ranks were determined by taking the average of all schools for a school district. In areas where there were elementary and high school districts, the scores were averaged to get a mean API and rank.
Next, a spatial join was done to assign census tract level attributes to properties. Projected 1997 tract level data were used from ESRI’s Business Map Pro (in turn obtained from the Claritas Corporation). Attributes coded included population density, percent vacancy, population change from 1990 to 1997, median age, percent black, percent Asian, percent Hispanic, percent owner occupied, percent university educated, percent unemployed, percent executive level workers, percent professional workers, percent laborers, and median household income. All figures were 1997 projections except the first two, from 1990.
Finally, numerous zip code level housing market data were assigned to properties in Excel. These data were obtained from the Rand Corportation and from the California Association of Realtors. Many of the variables mentioned below are extremely similar to each other, being slightly different quantifications of the same phenomenon. It is important to note that these similar variables were coded so as to compare which performed best in different regression models. Since all of these market data were available for different time periods, it was decided that each property transaction would be assigned market data based on what time period it took place. To assist in this, the date of sale field was recoded by quarter and year. The zip code level variables included median price per square foot (Ppsqft) and median home price(MedprZip), both by zip code. These data were available for 1997, 1998 and 1999, so the median price or price per square foot value assigned to a property depended on what year the transaction happened. Several indices of market activity were created, based on the number of transactions. The first index (SDINDEX1) was simply the ratio of number of transactions per year to population for the zip code. The second (SDind2) was the ratio of SDINDEX1 to the similar ratio at the county level. The third index (SDind3) was the ratio of SDINDEX1 to the similar ratio at the state level. For all three of these, again, the number of transactions figure used was based on the year of the date of sale. The population estimate, however, was only available for 1997. The ratio of price by zip code to price by state, was also coded separately as its own variable, again varying by year of sale (Prratio). While the zip code level indices could be said to be independent of the response in that they are from a different data source, this is likely not the case since a very large proportion of all the property sales for the given time frame are represented in the sample. Therefore, Prratio, MedprZip and Ppsqft are likely to suffer from simultaneously equation bias. Of those, though, only Prratio was used widely in regressions. Prratio would likely suffer from simultaneous equation bias less than MedprZip and Ppsqft because it is a ratio that includes state level information. A median price variable at the city level was also created, in response to the problem of simultaneous equation bias, but the variable did not perform as well as prratio, because of the extreme heterogeneity in home prices that is often found within cities.
Finally, a home price index was coded to adjust for trends in housing prices over the three years across which the data spanned. Again, this variable was based on data from the California Association of Realtors. With the threat of simultaneous equation bias in mind, it was decided to base this price adjustment factor on home prices at the city level. The median price for each city in which one of the sample zip codes fell was taken for every fiscal quarter, from Jan 1997 to December 1999. Therefore, if two sample zip codes fell in the same city, they had the same values. These were entered into a matrix in Excel, with city/zip code in the rows and quarter in the columns. In a second matrix, the price adjustment factor was calculated as the given quarter’s median price over the first quarter of 1997’s median price. Hence, quarter 1 of 1997 had a value of 1, and all subsequent price adjustments were defined in relation to that quarter. Using the LOOKUP function in Excel, these were then assigned to property records in the data, based on the records’ city and quarter.
The previously mentioned tasks resulted in a single data table for all the successfully geocoded records from the sample zip codes in Northern California, with approximately 35,000 records. The field heading names and descriptions are given in table 3. With the data now in place, the records could be prepared for sampling. Before any remediation could take place, data were converted to the data type that would be used in the regression. Hence, all categorical variables were converted to 1-0 dummy variables. Any numerical field with a non-numeric character in it had the offending record deleted.
Table 3. All Attribute Names, Alternate Names and Descriptions
|
Variable |
Alternate name in text |
Description |
|
PRICE |
|
Transacted selling price of property |
|
FLOOD |
|
1= in the FEMA Class A Flood Zone; 0= not
in that zone |
|
FIREBOTH |
FIRE |
1= in either the VHFHSZ or SRA fire zones;
0= not in those zones |
|
AFTER |
|
1= transacted after June 1998; 0=before
June 1998 |
|
ASSDSTCT |
|
Appraised value of structure |
|
BATHTOT |
|
Number of bathrooms |
|
BEDROOMS |
|
Number of bedrooms |
|
BLDGCLS |
|
Building class |
|
Pool |
|
1= pool, 0=no pool |
|
TOTALRMS |
|
Total number of rooms |
|
TOTALSF |
|
Total structure square footage |
|
LOTSQFT |
|
Total lot square footage |
|
NOSTORY_ |
|
Number of stories |
|
yrsold |
|
Years old |
|
antique |
|
1= structure more than 75 years old; 0=
less than 75 years old |
|
VIEW |
|
1= view, 0= no view (only available for
some counties) |
|
FIREPL2 |
|
1= fireplace; 0=no fireplace |
|
D2AIRPORT |
|
Distance to nearest airport (by straight
line) |
|
D2GOLF_CC |
|
Distance to nearest golf course or country
club |
|
D2HOS |
|
Distance to nearest hospital |
|
D2INDTRY |
|
Distance to nearest municipal industrial
facility, stadium or treatment plant |
|
nearfclt |
|
Within 300 meters of a hazardous waste
site or industrial facility |
|
D2LIB |
|
Distance to library |
|
D2PARK |
|
Distance to municipal park |
|
D2SCHL |
|
Distance to nearest school |
|
D2SHOPCNT |
|
Distance to nearest shopping center |
|
NRMARINA |
|
Within 1500 meters of a marina |
|
D2RURFIRE |
|
Distance to nearest CDF rural fire station |
|
D2HAZWST |
|
Distance to nearest hazardous waste site |
|
D2CULTL |
|
Distance to nearest theater, museum or
cultural center |
|
NRSKI |
|
Within 20 km of major ski area |
|
D2OPENSP |
|
Distance to nearest open space |
|
MEAN_API |
|
School Academic Performance Index (1-800) |
|
MEAN_RANK |
|
School state relative ranking (1-10) |
|
T_POPD |
|
1990 population density by tract |
|
T_P_VAC |
|
1990 percent vacancy by tract |
|
POPCHG9097 |
|
Population change between 1990 and 1997
by tract, based on projections |
|
TMEDAGE97 |
|
Projected 1997 median age by tract |
|
T_PBLK97 |
BLACK |
Projected 1997 percentage African American
population by tract |
|
T_PASIA97 |
|
Projected 1997 percentage Asian-American
population by tract |
|
T_PHISP97 |
HISPANIC |
Projected 1997 percentage Hispanic
population by tract |
|
T_POWNOC97 |
|
Projected 1997 percentage owner occupied
housing by tract |
|
T_PUNIV97 |
|
Projected 1997 percentage university
educated population by tract |
|
T_PUNEMP97 |
UNEMP |
Projected 1997 percentage unemployment by
tract |
|
T_PEXEC97 |
|
Projected 1997 percentage of executive
level workers by tract |
|
T_PPROF |
PROF |
Projected 1997 percentage of professional
workers by tract |
|
T_MHHINC97 |
INCOME |
Projected 1997 median household income by tract |
|
D2CBDA |
|
Distance to primary business district |
|
D2CBDB |
|
Distance to secondary business district |
|
D2CBDC |
|
Distance to tertiary business district |
|
CBDINDEX |
|
Central Business District Index |
|
CBDIND2 |
|
Logged Central Business District Index |
|
RAIN |
|
Mean precipitation per year at a given
location |
|
D2COAST |
|
Distance to nearest coastline |
|
D2HIWAY |
|
Distance to nearest highway |
|
tr_af_fr |
|
1= fire occurred within 10 km within last
5 years |
|
tr_af_fl |
|
1= flood occurred within 5 km within last
5 years |
|
Z00M2RDWD |
|
1=within 200 meters of redwood forest |
|
Z00M2WTBD |
|
1= within 200 meters of lake, coastline,
river |
|
Ppsqft |
|
Median price per square foot, by zip code |
|
SDindex1 |
SDINDEX1 |
Number of transaction by zip code over the
population |
|
SDindex2 |
SDINDEX2 |
Sdindex 1 / similar index at the county
level |
|
SDind3 |
SDINDEX3 |
Sdindex1 / similar index at the state
level |
|
prratio |
|
Ratio of median price per zip code to
median state price |
|
medprCty |
PRBYCTY |
Median home price by city |
|
padj-abs |
|
Percentage change in medprCty by quarter
from first quarter price |
|
weight |
|
Regression weights |
Then, all records of vacant parcels were separated from the residential parcels. The status of a parcel was determined from it land use designation. Although a few of the designated “vacant” parcels did have structures on them, the structures were usually an insignificant part of the value of the parcel. A new file of vacant parcels was created with 677 records, to be used in separate regressions. Back in the developed parcel spreadsheet, records with important missing values were eliminated. All those records with no price, no sales date, or with most property information left blank (i.e. no bedroom, bathroom or square footage numbers) were eliminated. Property records where assessed structure value was considerably greater than the selling price were eliminated. These records likely represented intra-family sales or similar non-market transfers and so are useless, or even misleading for the purposes of this study.
Next, values were imputed for records that had less important missing values. Many counties lacked attributes in their assessor’s data sets, such as building class, number of stories, or total rooms. However, enough counties did have them that the values could be somewhat reliably imputed, using regression, for the other counties. Other missing values were from fields available in all counties, but missing for several given records.
It is important to note that even after data remediation, there were still many problems with the data that could not easily be dealt with. For instance, the presence of pools appeared to be erratically recorded. Most counties had a dummy field for “Fireplace,” “Pool” and for “View”, but many counties left the cells of these fields almost entirely blank, clearly under-representing the number of pools and/or views. However, little could be done about this, so the fields were left as is. Views could have been determined by doing a viewshed analysis of the properties on 1:24,000 scale digital elevation models, but this would have been extremely time consuming and likely not worth the effort.
Once an appropriately cleaned subset of property records for Northern California was completed, property records could be sampled. At one point it was considered a possibility that sampling of properties was not needed since there were only about 35,000 records for northern California, a small enough number for the computer to handle. However, this was not done for two reasons. First, with the large number of fields, sorting and regressing 35,000 records was found to be extremely unwieldy. Second, and most importantly, within those 35,000 records, only a small proportion—a little over 5000 of them—were in hazard zones. This meant that the flood or fire variables were effectively “swamped” by other trends—in other words, they could only explain a relatively small proportion of the variance in sales price given their small proportional representation in the population. Rather, it was decided that a relatively equal number of hazard zone and non hazard zone property records were needed in the final sample so that the effects of flood or fire—the main effects in this case—would not be swamped out.
It was decided that in the first set of regressions, data would be stratified by whether the transfer date was before or after the law, and by whether the property was in a flood, fire or no hazard zone. Using the assumption that the number of transactions before and after the law was similar enough that stratifying by this attribute was not worth the effort, it was decided that the weighting of samples would be based on the proportion of samples falling in each zone by zip code. The few properties falling in both flood and fire zones were counted as falling in only one zone.
To deal with this rather complex weighting structure, an Excel spreadsheet was created that automatically calculated the weights of properties within each sample cell (Table 4). In order to use this spreadsheet, however, counts were needed for the number of records within each cell. To do this, a series of SQL queries were conducted on the data in Access. Once the counts were known, the next step was to decide how many samples to take from each cell. Then actual records could be randomly sampled and weights assigned to each sample based on its cell. Deciding the number of samples required a judgment call. It was decided that the sampling rate should vary with the population size of the cell. Since the no hazard cells tended to have very high populations relative to hazard cells, they needed to have much lower sampling rates. The sampling rates by cell population for flood and fire zones are shown in table 5 below.
For the non-hazard cells, the Excel formula below describes
the sampling rule for one particular row, where column K is the number in both
hazard zones for a zip code, L is the number in the non-hazard zones for a zip
code, and M is the number of records sampled within both hazard zones for a zip
code:
=IF(K3>0, IF(AND(L3>K3),M3, (1/(LOG(L3)*2))*L3),
1/(LOG(L3)*2)*L3)+IF(L3>35,20,0)
In words, this says that if the number of records in the total hazard zone for that zip code is greater than zero and if the number of records outside the hazard zone is greater than the number within it, then return the number of records sampled in the total hazard zone plus 20 additional records if the number of records in the non-hazard zone is greater than 35. If both or one of the first two conditions (K>0 or L>K) are not met, then return 1 over two times the log of the number of records in the non-hazard zone, again plus 20 records if there are more than 35 records in the cell. The purpose of this formula is to return roughly an equal number of non-hazard records as there are hazard records. First, though, conditions have to be met: there must be some records in the hazard zone for that zip code and there must be at least as many non-hazard records as hazard records. If both or one of those conditions are not met—that is, if it is not possible to pick a number equal to the hazard samples either because there are no hazard samples or not enough non-hazard samples—then a certain proportion of the available non-hazard samples are picked, based on the log function given. The final terms helps deal with the fact that the sampling rule as mentioned up to this point, would result in fewer non-hazard samples than hazard samples. This happens because there are several zip codes where there are no non-hazard properties, or fewer than the hazard properties. Therefore, additional non-hazard properties would have to be added to the aggregate sample from zip codes in which there is a surplus of no-hazard records. There was approximately a 600 record deficit in this respect—that is 600 more hazard records would have been sampled than non-hazard records due to the zip codes with few or no non-hazard properties. 20 was selected as the number of non-hazard records to add from zip codes with surplus non-hazard properties because it roughly made up for the deficit when multiplied by the number of zip codes with surpluses. In the end there were 3193 hazard zone samples and 3201 non-hazard zone samples. Of the hazard samples, 1611 were from the flood zone and 1583 from the fire zone.
Table 4. Second Stage Sampling Chart
Table 5. Variable Sampling Rate Chart for Flood and Fire Zones
|
Cell population |
0-9 |
10-100 |
105-200 |
205-300 |
305-400 |
405-500 |
500+ |
|
Rate |
1 |
.85 |
.75 |
.65 |
.55 |
.45 |
.4 |
 
 
 
Before the weights could be calculated, one more piece of information was needed. It had to be determined what percentage of each cell’s original sampled population—that is, all houses in the given zip codes fitting the given landuse, time frame and price made it to the final sampling sub-population. First, the percentage of housing records downloaded for a zip code from Metroscan had to be given. In most cases, this was 100%, but in several very large and dense zip codes, 50% or 60% were downloaded. Additionally, the difference between the number in a cell downloaded and the number making it to the final spreadsheet had to be noted. This would account for records lost in geocoding and in data cleaning. Weights could then be calculated by dividing the population of each cell by the number of samples taken from the cell, multiplied by the percentage of property records included in the final cell population, or:
whj= (Nh1/nh1)*(Nh2/Nh1)
Where whj=
the sampling weight of sample member j
in stratum h
Nh1= the original sample population size of stratum h
Nh2= the sample sub-population size of stratum h following geocoding and cleaning
nh1=
the number of samples taken in stratum h
Next, a unique random number field was generated for all records to be used for random sampling. Then, the whole spreadsheet was brought into Access, which is a more efficient query engine. Queries were then built for each cell. Each query specified the zip code, and a 1 or 0 for both flood and fire zones. Each query was used to generate a table of all records in the cell, sorted by the random number field. The first nh records were then selected from the table and pasted into a new table. A new weight column was added to the table, specifying the weight corresponding with the stratum to which the record belonged. The table was now ready to be imported into S-Plus for statistical analysis.
Once the sample data set was available, with over 6500 records, numerous stratifications were created in S Plus. The stratifications are given below, in table 6. The purpose of these stratifications was to see how the models' fits, errors and parameter estimates varied depending on the data segment analyzed. The first seven stratifications estimate models by pooling data across geographic housing submarkets and defining boundaries instead based on the main effects (hazard zones) and by time of transaction before or after the law. In other words, these stratifications ignore housing submarket variability and assume that the factors upon which stratification should depend are the main effects. Flood zone strata (4 and 5) were stratified so as to include properties in non-hazard zones but exclude properties in the fire zone, so as to avoid possible confounding. Likewise, fire zone strata (6 and 7) were stratified so as to include properties in non-hazard zones but exclude those in flood zones.
Stratifications 7 through 13 were created based on two different housing market categorizations and three temporal categories. For 7, 8 and 9, the Northern California data set was broken up into four relatively large sub-regions. Regressions were then run for each strata across the entire time period, and split up before and after the law. For 10, 11 and 12, seven smaller sub regions were defined, with more "splitting up" occurring in dense urban areas, where housing markets can vary radically within a few kilometers.
Figure 8. four- way geographic stratification
Figure 9. Seven- way geographic stratification

The submarket stratifications presented here represent one
possible way of controlling for variation in the housing price function across
geographic regions. Clearly, this method is more ad hoc and less
systematic than the system used by Goodman and Thibodeau (1998), but their
method would be almost impossible for this spatial extent, requiring literally
hundreds of millions of comparisons, as described above. In this case, sample
zip codes were used as the unit of grouping for submarkets. All properties
within those zip codes were then assigned to the designated submarket. The
first submarket stratification grouped sample zip codes into four intuitively
defined regions: the South Bay, the East Bay, the North Bay and the Central
Valley/ Sierra Foothills region. This grouping is shown in figure 8. The second
submarket stratification, shown in figure 9, was slightly finer grained, but cut across some of the original
designations. This series of stratifications was designed to better reflect
housing market and neighborhood differences.
In this stratification Marin and Pacifica were grouped into a single
submarket, with the South Bay classed separately. The poorer communities along
the route-80 corridor between Richmond and Dixon were classed into their own
market. Sonoma and Napa were put into one group. Separate groups were also made
for the Sierra communities and the Sacramento Valley communities. For the two
geographic stratification schemes data were also broken down by time period
(before and after the law) and compared to models with all times periods. This
is an important attribute on which to stratify, because housing markets can
change significantly in relatively short periods.
Table 6. Models and Stratifications Used
|
STRATIFICATIONS |
MODELS |
|
1. All sample data |
1. Linear |
|
2. Before, all zones |
2. Semi-log |
|
3. After, all zones |
3. Square root |
|
4. Flood zone and “No” zone, after |
4. Quadratic structure and distance terms |
|
5. Flood zone and “No” zone, before |
5. 2 and 3 way interaction terms |
|
6. Fire zone and “No” zone, after |
6. Weighted and non-weighted |
|
7. Fire zone and “No” zone, before |
7. Log distance terms |
|
8. Four submarket stratifications, before and after |
|
|
9. Four submarket stratifications, before |
|
|
10. Four submarket stratifications, after |
|
|
11. Seven submarket stratifications, before and after |
|
|
12. Seven submarket stratifications, before |
|
|
13. Seven submarket stratifications, after |
|
While the results of the larger study on the effects of disclosure on property markets are available for Northern California, this paper focuses more on: 1) how well the approach given above modeled housing markets in California and 2) how different geographic and temporal stratification approaches worked in modeling the effects of flood and fire disclosure zones.
Table 7. Comparison of model fits between geographically pooled data and four way geographic stratification
|
geog. stratification |
time strat |
model |
R-sq |
Fstat |
SSE |
df |
|
|
|
|
|
|
|
|
|
Sierra/C. Valley |
before |
linear with 120 interaction
terms |
0.96 |
58.22 |
33110 |
186 and 498 |
|
East Bay |
before |
linear with 120 interaction
terms |
0.99 |
121.3 |
47380 |
186 and 340 |
|
North Bay |
before |
linear with 120 interaction
terms |
0.65 |
9.733 |
475400 |
169 and 876 |
|
South Bay/Penins. |
before |
linear with 120 interaction
terms |
0.97 |
34.27 |
115700 |
169 and 153 |
|
all |
before |
linear with 120 interaction
terms |
0.55 |
15.39 |
382100 |
186 and 2394 |
|
Sierra/C. Valley |
after |
linear with 120 interaction
terms |
0.91 |
43.76 |
60770 |
186 and 823 |
|
East Bay |
after |
linear with 120 interaction
terms |
0.92 |
35.91 |
139800 |
186 and 589 |
|
North Bay |
after |
linear with 120 interaction
terms |
0.81 |
34.23 |
235400 |
169 and 1319 |
|
South Bay/Penins. |
after |
linear with 120 interaction
terms |
0.91 |
17.9 |
204400 |
169 and 285 |
|
all |
after |
linear with 120 interaction
terms |
0.80 |
77.08 |
214200 |
186 and 3543 |
|
Sierra/C. Valley |
all |
linear |
0.84 |
162.6 |
66280 |
54 and 1640 |
|
East Bay |
all |
linear |
0.79 |
84.69 |
183700 |
54 and 1248 |
|
North Bay |
all |
linear |
0.42 |
34.56 |
472900 |
53 and 2481 |
|
South Bay/Penins. |
all |
linear |
0.77 |
46.37 |
264500 |
53 and 724 |
|
all |
all |
linear |
0.53 |
128.1 |
347200 |
54 and 6256 |
|
Sierra/C. Valley |
after |
linear |
0.83 |
87.22 |
76440 |
54 and 955 |
|
East Bay |
after |
linear |
0.84 |
70.97 |
176700 |
54 and 721 |
|
North Bay |
after |
linear |
0.67 |
51.3 |
302400 |
56 and 1432 |
|
South Bay/Penins. |
after |
linear |
0.79 |
27.16 |
272100 |
54 and 400 |
|
all |
after |
linear |
0.7 |
158 |
259200 |
54 and 3675 |
|
Sierra/C. Valley |
before |
linear |
0.91 |
114.5 |
42690 |
54 and 630 |
|
East Bay |
before |
linear |
0.9 |
77.02 |
105400 |
54 and 472 |
|
North Bay |
before |
linear |
0.39 |
11.81 |
591500 |
54 and 991 |
|
South Bay/Penins. |
before |
linear |
0.85 |
27.85 |
211900 |
54 and 268 |
|
all |
before |
linear |
0.42 |
33.98 |
419400 |
54 and 2526 |
Table 8. Comparison OF model fits between geographically pooled data and seven way geographic stratification
|
geog. stratification |
time strat |
model |
R-sq |
Fstat |
SSE |
df |
|
Route 80 Corr. |
after |
linear with 120 interaction
terms |
0.97 |
49.16 |
107100 |
154 and 260 |
|
Peninsula/Marin |
after |
linear with 120 interaction
terms |
0.88 |
19.85 |
188300 |
155 and 435 |
|
South Bay |
after |
linear with 120 interaction
terms |
0.92 |
14.63 |
235200 |
153 and 188 |
|
Sonoma-Napa |
after |
linear with 120 interaction
terms |
0.84 |
21.84 |
182300 |
153 and 617 |
|
East Bay-Alameda |
after |
linear with 120 interaction
terms |
0.90 |
27.21 |
163100 |
153 and 447 |
|
Central Valley |
after |
linear with 120 interaction
terms |
0.94 |
40.56 |
57590 |
152 and 378 |
|
Sierra |
after |
linear with 120 interaction
terms |
0.92 |
23.27 |
53360 |
153 and 325 |
|
all |
after |
linear with 120
interaction terms |
0.80 |
77.08 |
214200 |
186 and 3543 |
|
Route 80 Corr. |
before |
linear with 120 interaction
terms |
0.92 |
11.79 |
438600 |
153 and 161 |
|
Peninsula/Marin |
before |
linear with 120 interaction
terms |
0.92 |
23.92 |
147000 |
151 and 310 |
|
South Bay |
before |
linear with 120 interaction
terms |
0.98 |
29.36 |
129900 |
151 and 88 |
|
Sonoma-Napa |
before |
linear with 120 interaction
terms |
0.94 |
34.85 |
132000 |
151 and 331 |
|
East Bay-Alameda |
before |
linear with 120 interaction
terms |
0.98 |
97.33 |
53610 |
151 and 244 |
|
Central Valley |
before |
linear with 120 interaction
terms |
0.97 |
46.97 |
32600 |
151 and 228 |
|
Sierra |
before |
linear with 120 interaction
terms |
0.96 |
26.97 |
31070 |
151 and 153 |
|
all |
before |
linear with 120
interaction terms |
0.55 |
15.39 |
382100 |
186 and 2394 |
|
Route 80 Corr. |
after |
linear |
0.66 |
13.49 |
289000 |
53 and 361 |
|
Peninsula/Marin |
after |
linear |
0.7 |
24.2 |
261600 |
53 and 537 |
|
South Bay |
after |
linear |
0.78 |
19.74 |
319400 |
52 and 289 |
|
Sonoma-Napa |
after |
linear |
0.69 |
30.69 |
236900 |
53 and 717 |
|
East Bay-Alameda |
after |
linear |
0.84 |
53.96 |
189800 |
53 and 547 |
|
Central Valley |
after |
linear |
0.87 |
65.06 |
75600 |
51 and 479 |
|
Sierra |
after |
linear |
0.84 |
43.8 |
63970 |
52 and 426 |
|
all |
after |
linear |
0.7 |
158 |
259200 |
54 and 3675 |
|
Route 80 Corr. |
before |
linear |
0.48 |
4.629 |
867200 |
52 and 262 |
|
Peninsula/Marin |
before |
linear |
0.77 |
26.67 |
217300 |
52 and 409 |
|
South Bay |
before |
linear |
0.87 |
23.25 |
233800 |
52 and 187 |
|
Sonoma-Napa |
before |
linear |
0.85 |
48.02 |
182500 |
52 and 430 |
|
East Bay-Alameda |
before |
linear |
0.89 |
54.77 |
116000 |
52 and 343 |
|
Central Valley |
before |
linear |
0.95 |
124 |
33880 |
52 and 327 |
|
Sierra |
before |
linear |
0.92 |
57.92 |
35350 |
52 and 252 |
|
all |
before |
linear |
0.42 |
33.98 |
419400 |
54 and 2526 |
Tables 7 and 8 compare R-squared, F statistic and SSE values for several model specifications of the different stratifications. Rather than giving lengthy results from all models, only the simple linear model and a full linear model with 120 interaction terms and ten quadratic terms is given here. As the tables show, both the four way and seven way geographic stratifications significantly increase fit in the form of R-squared. For some strata in the Before group, R squared reached .99. However, this is at the expense of degrees of freedom, which is often reflected in lower F statistics. In the After group, stratifying does not seem to be merited, as the R-squared on the non-geographically stratified data ranges from .7 in the simple linear model to .8 in the full model (120 interaction terms and quadratic terms). In the simple linear model, the four-way stratified data sets are little better than the pooled data and sometimes worse, while in the full model three of the four perform in the mid nineties. However, it is in the Before group that there seems to be a significant difference. For the pooled group, the Before stratum runs from an R squared of .42 in the linear model to .55 in the full model, while two of the four-way strata are in the low nineties for the simple linear model and 3 are in the high nineties for the full model for the Before group, reflecting a very significant increase in fit from pooled to stratified models. This indicates that perhaps markets were more segmented in the mid nineties and became less so as the regional housing market heated up in the last two years.
Three of the four strata in the first geographic stratification scheme performed extremely well, but the North Bay performed extremely poorly, often with worse fit than the pooled model. This probably reflects the fact that this stratum contains several significantly different housing markets with varying relationships between price and attributes. Likely, the rural Sonoma and Napa housing markets have little in common with the exclusive south Marin neighborhoods.
The seven-way stratification was partly devised to remedy the problems in the North Bay market. However, since there were not enough properties in south Marin to make its own stratum, south Marin was joined with Pacifica, which was taken away from the South Bay category. The East Bay was also split up into the wealthier Alameda county part and the poorer I-80 corridor part. The seven-way stratification succeeded in increasing the fit for all models. But still, the Marin group and the Napa-Sonoma group had consistently lower R-squared values than other strata. This is likely because Marin, Napa and Sonoma embody very diverse housing markets, from wealthy, high demand, exclusive suburbs to poor neighborhoods. The time frame still made a difference, though. All strata had good-fitting models in the before group, but in the after group, Sonoma-Napa and Peninsula-Marin's R-squared hovered near .7 for the simple model and in the mid eighties for the full model, while the other strata were in the mid eighties for the simple linear model and the low to mid nineties for the full model.
Despite the better fitting models, geographic stratification made it more difficult to assess the effects of location in the flood or fire zone. As can be seen in tables 9 and 10 below, very few of the geographic strata for the After temporal group (the group that matters, since the group from before the law would be expected to undergo less of an effect) yielded significant coefficients for either FLOOD, FIRE or the interaction term, FLOOD:HISPANIC. That interaction term was chosen through hundreds of regressions on the pooled data. Of the dozens of two and three way interactions tried in these models that included FLOOD, FIRE , FLOOD:HISPANIC was almost always significant, and when put in the model, other interaction terms with FLOOD became insignificant. Moreover FLOOD:HISPANIC's sign and magnitude were robust to model specification for the After group. No interaction terms with FIRE were consistently significant across models.
For the full model with 120 interaction terms, the only After groups that significant coefficients on one of these three terms was Central Valley in the seven way and East Bay in the four way stratification. Neither of the FLOOD terms had the expected sign (negative), but in the Central Valley grouping, the FLOOD:HISPANIC coefficient was negative. A positive coefficient on FLOOD and a negative one on FLOOD:HISPANIC was common within models using the pooled data regardless of functional form and specification and appears to be an important result. FIRE was nowhere significant for either the four- or seven-way strata for the After temporal group. For the After temporal group with the simple linear model, the Peninsula/Marin area does not have a significant coefficient on FLOOD but does have a negative coefficient on FLOOD:HISPANIC with a magnitude that is consistent with other strata. A similar negative coefficient is also found for FLOOD:HISPANIC in the North Bay stratum of the four-way stratification.
With the Before temporal group, the coefficients from the stratified regions have little consistency with the pooled results or with the stratified After group results. For the seven-way full model, the Route 80 corridor has a positive coefficient on FLOOD;HISPANIC and a non-significant one on FLOOD. For Sonoma/Napa and the Sierras, FLOOD is highly negative (although with a large standard error relative to the coefficient) while FLOOD:HISPANIC is positive. For the Central Valley, FLOOD is negative while FLOOD:HISPANIC is not. For the four-way stratification full model, Sierra/Central Valley also has a negative, although smaller coefficient on FLOOD, and a positive one on FLOOD:HISPANIC. For the simple linear model, the seven-way stratification gives a negative coefficient on FLOOD for the Central Valley and a positive one on FLOOD:HISPANIC for the Route 80 corridor.
The geographically stratified results are somewhat inconsistent with the results from the pooled data, which seem more in line with the expected results. For both the full and simple linear models, the geographically pooled After group yields a positive coefficient on FLOOD and a highly significant and negative coefficient on FLOOD:HISPANIC. For the Before group, there is generally a higher positive coefficient on FLOOD and a non-significant coefficient on FLOOD:HISPANIC. This suggests that not only did the positive premium associated with living in a floodplain (say, river views, frontage on water) get reduced, but after the law a negative premium for living in the floodplain appeared in more heavily Hispanic neighborhoods. The meaning behind this finding will not be explored in this paper, but the variable HISPANIC may possibly be correlated with some other factor, such as housing demand.
Table 9. Comparison of flood and fire coefficients Between pooled data and seven way stratifications
|
geographic strat. |
time strat |
model |
FLOOD |
FIRE |
FLOOD: HISPANIC |
|
Route 80 Corr |
after |
120 interaction terms |
NS |
NS |
NS |
|
Peninsula/Marin |
after |
120 interaction terms |
NS |
NS |
NS |
|
South Bay |
after |
120 interaction terms |
NS |
NS |
NS |
|
Sonoma/Napa |
after |
120 interaction terms |
NS |
NS |
NS |
|
East Bay-Alam. |
after |
120 interaction terms |
NS |
NS |
NS |
|
Central Valley |
after |
120 interaction terms |
33115(.01) |
NA |
-1448 (.03) |
|
Sierra |
after |
120 interaction terms |
NS |
NS |
NS |
|
ALL |
after |
120 interaction terms |
41297 (0) |
NS |
-2178 (0) |
|
Route 80 Corr |
before |
120 interaction terms |
NS |
NS |
37652 (.01) |
|
Peninsula/Marin |
before |
120 interaction terms |
NS |
NS |
NS |
|
South Bay |
before |
120 interaction terms |
NS |
NS |
NS |
|
Sonoma/Napa |
before |
120 interaction terms |
-143983 (.05) |
NS |
7113 (.03) |
|
East Bay-Alam. |
before |
120 interaction terms |
NS |
NS |
NS |
|
Central Valley |
before |
120 interaction terms |
-17631 (.05) |
NS |
NS |
|
Sierra |
before |
120 interaction terms |
-223404 (.02) |
NS |
38295 (.01) |
|
ALL |
before |
121 interaction terms |
54301 (.02) |
NS |
NS |
|
Route 80 Corr |
after |
linear |
NS |
NS |
NS |
|
Peninsula/Marin |
after |
linear |
NS |
NS |
-2704 (.02) |
|
South Bay |
after |
linear |
NS |
NS |
NS |
|
Sonoma/Napa |
after |
linear |
NS |
NS |
NS |
|
East Bay-Alam. |
after |
linear |
NS |
NS |
NS |
|
Central Valley |
after |
linear |
NS |
NS |
NS |
|
Sierra |
after |
linear |
NS |
NS |
NS |
|
ALL |
after |
linear |
65691 (0) |
19780 (.02) |
-3892 (0) |
|
Route 80 Corr |
before |
linear |
NS |
NS |
40353 (.02) |
|
Peninsula/Marin |
before |
linear |
NS |
NS |
NS |
|
South Bay |
before |
linear |
NS |
NS |
NS |
|
Sonoma/Napa |
before |
linear |
NS |
NS |
NS |
|
East Bay-Alam. |
before |
linear |
NS |
NS |
NS |
|
Central Valley |
before |
linear |
-14348 (.03) |
NS |
NS |
|
Sierra |
before |
linear |
NS |
NS |
NS |
|
ALL |
before |
linear |
57163 (.01) |
NS |
NS |
NS= not significant, NA= not applicable (variable all=0 for that stratum)
First number is coefficient value. Number in parentheses is P value, where alpha=.05
Table 10. Comparison of flood and fire coefficients between pooled data and four way stratification
|
Sierra/C. Valley |
after |
linear |
NS |
NS |
NS |
|
|
East Bay |
after |
linear |
NS |
NS |
NS |
|
|
North Bay |
after |
linear |
NS |
NS |
-3204 (0) |
|
|
South Bay |
after |
linear |
NS |
NS |
NS |
|
|
ALL |
after |
linear |
65691 (0) |
19780 (.02) |
-3892 (0) |
|
|
Sierra/C. Valley |
before |
linear |
NS |
NS |
NS |
|
|
East Bay |
before |
linear |
NS |
NS |
NS |
|
|
North Bay |
before |
linear |
NS |
NS |
NS |
|
|
South Bay |
before |
linear |
NS |
NS |
NS |
|
|
ALL |
before |
linear |
57163 (.01) |
NS |
NS |
|
|
Sierra/C. Valley |
before |
120 interaction terms |
-16856 (.01) |
NS |
734 (.03) |
|
|
East Bay |
before |
120 interaction terms |
NS |
NS |
NS |
|
|
North Bay |
before |
120 interaction terms |
NS |
NS |
NS |
|
|
South Bay |
before |
120 interaction terms |
NS |
NS |
NS |
|
|
ALL |
before |
121 interaction terms |
54301 (.02) |
NS |
NS |
|
|
Sierra/C. Valley |
after |
120 interaction terms |
NS |
NS |
NS |
|
|
East Bay |
after |
120 interaction terms |
8579 (.03) |
NS |
NS |
|
|
North Bay |
after |
120 interaction terms |
NS |
NS |
NS |
|
|
South Bay |
after |
120 interaction terms |
NS |
NS |
NS |
|
|
ALL |
after |
121 interaction terms |
41297 (0) |
NS |
-2178 (0) |
|
|
|
|
|
|
|
|
|
|
NS= not significant, NA= not
applicable (variable all=0 for that stratum) |
|
|
||||
|
First number is coefficient
value. Number in parentheses is P value, where alpha=.05 |
||||||
This paper outlines a methodology for using hedonics to analyze the price effects of a widely and irregularly distributed environmental phenomenon, in this case presence of a house in a statutorily designated natural hazard zone. Rather than offering an easy cookbook solution to this problem, this paper outlines some of the tradeoffs and potential pitfalls of the different approaches.
The first lesson derived from this undertaking is that Geographic Information Systems (GIS) can be used extremely effectively to code large amounts of locational and neighborhood variables that otherwise could take a lifetime to quantify. The proliferation of fast computers and GIS software is aided by the fact that increasing amounts of geo-spatial data sets are becoming publicly available, including all TIGER layers, the Geographic Names Information System, GDT-2000 streets data, land ownership layers climate layers and many other data sets. Because these data sets are standardized across the country, they can be easily used for large areas, such as the state of California, using minimal numbers of steps. This is in contrast to previous times, where most data, when available, had to be obtained from individual city or county governments, and so were highly inconsistent, requiring a great deal of remediation. One of the great limiting factors, however, as seen here, is that the actual property data can be very inconsistent. While some counties keep excellent records, with large numbers of attribute fields, other counties have next to nothing available. This makes conducting a hedonic that will reach across counties difficult, but in this case, where a given county had insufficient data (all of them northern and rural), another zip code was sampled from another county from the same cell of the stage one sampling matrix.
The second methodological lesson of this project relates to sampling. This project underlines the need to conduct careful, and multi-tier sampling when sampling an entity as big as California. A single stage approach would likely result in serious geographic clustering, and undersampling of rural areas, since the highest density of properties is in spatially small urban areas. For some studies, where rural markets are not an issue this may be fine. However, in this case, where an attempt is being made to determine how something like hazard disclosure will affect properties across the spectrum from urban to rural (so as to get an idea of how it may in the long run affect development patterns), it is extremely important to ensure that areas with low property densities get represented. Doing this means oversampling rural areas, since a more proportional sample of the population of houses would result in a sample with too few rural properties to be significant. This oversampling was achieved in the first tier of sampling, by specifying that a proportional number of urban, suburban and rural zip codes would be sampled, making sure to exclude extremely rurual zip codes where transactions are lacking. Since there are a large number of rural zip codes, the representation of rural areas was boosted at this stage.
Another major sampling problem stemmed from ensuring that the appropriate number of hazard and non-hazard properties were sampled. This presented a significant challenge at the first sampling phase since, lacking geocoded property points at that stage, it was hard to know whether a zip code with large amounts of hazard land would actually have many houses falling in that hazard land or not. The stopgap measure taken was to create a threshold of minimum percentage of land that needed to fall in a hazard zone. A second measure employed was to overlay a streets layer on the hazard layer for all chosen sample zip codes, and where no or very few streets overlaid the hazard zones, they were discarded and resampled from the remaining pool of zip codes in that sampling matrix cell.
This approach led to a serious problem that would become evident at later phases. Since both flood and fire zones were being looked at, it was desired that a permissible zip code either meet the threshold for flood or for fire, not both (which is very rare). Doing this resulted in a situation where there were many zip codes with only flood hazards and many with only fire hazards. In an attempt to make sure that enough properties in and out of hazard zones were sampled, the geographically pooled sample was created so as to get an equal number of hazard zone and non hazard zone property transactions (with half the hazard zone transactions being in the flood zone and half in the fire zone). Unfortunately, this did not address the highly irregular distribution of hazard zone properties in general, nor did it address the irregular distribution of each type of hazard zone property. For geographically pooled regressions, this meant that regressions were comparing flood zone properties to non-flood zone properties in areas far from any flood zone. For geographically stratified regressions, it meant that a whole stratum might have very few property transactions in the flood zone or the fire zone relative to the total number in the stratum's sample, making one or both of those coefficients essentially meaningless. It is recommended that future studies sample based on only one environmental attribute of interest, rather than several. If, say, only flood prone zip codes were sampled, then geographically stratified regressions would be much more meaningful and robust, with higher numbers of observations for that dummy variable.
Overall, the geographically stratified regressions yield extremely accurate and well-fitting models of housing price functions. Unfortunately, F tests have not yet been run to compare stratified to non-stratified models, but judging from the high SSE values of the geographic stratifications, it seems likely that the restricted (pooled) model would be chosen in most circumstances. The geographically stratified models are extremely useful for valuing highly localized externalities or deriving price indices for local housing markets. Interestingly, though, the North Bay seemed not to benefit from geographic stratification in the four-way stratification. Further subdivision of the North Bay in the seven-way stratification into two areas improved the model somewhat, but it seems likely that the appropriate level of geographic stratification for that particular area is much smaller than for other areas, and possibly smaller than is practicable for this study.
These spatially narrow, geographically stratified models are not optimal for picking up the effects of a phenomenon like location in a flood zone, since the distribution of that phenomenon is widespread and heteroskadastic. The results of the pooled regressions appear to be much more robust to change in models and temporal stratification than do the geographically stratified ones. The pooled results indicate fairly strongly that the positive premium on floodplain properties dropped after the law and that a negative premium appeared after the law for floodplain properties in highly Hispanic neighborhoods. Some of the geographic stratifications support this, but overall, the results of those stratifications are ambiguous. Nevertheless, doing the two in conjunction can be useful. While the pooled model yields consistent and fairly robust results on the variable of interest, the geographic stratification yield some clues as to the spatial nature of these effects, suggesting a strong negative FLOOD:HISPANIC interaction after the law in the North Bay and the Central Valley. When it is further broken down into the seven category stratification, we see that of the North Bay areas, the real effect happens in Marin and not in Sonoma or Napa.
This brings up the final question, which is the appropriate spatial and temporal scale for hedonic stratifications. Clearly, there is no easy answer, for the spatial extent of housing markets varies with every state and every region. In some areas, they housing markets encompass entire commute-sheds, and in others, they may extend only a matter of blocks. Therefore, the best approach appears to be to start with regionally pooled data. If the fit is very poor and parameter estimates are extremely sensitive to model change, the data should be broken down into a small number (two to four) of intuitively defined contiguous units. For areas where the fit is dramatically increased, these can be kept as submarkets. For units that do not experience a significant increase in fit, they should be further subdivided. If it takes too many subdivisions to get that increase in fit, than it is likely not worth stratifying. Full versus restricted F tests can be used to quantitatively test the whether further stratification is merited. It is also helpful in some cases to stratify by time period, especially in markets that undergo cyclical booms and cooldowns, for that actual price-attribute relationship could change significantly between these time periods. Overall, it is critical when conducting such a study to keep in mind the underlying workings of housing markets. With an understanding of land and housing economics, as well as an intuitive sense for markets in a region, it is far more likely that the methods will be employed that appropriately and meaningfully model the housing market.
Finally, this issue of time is perhaps one of the most difficult. In this case, it appears that the overall housing market was characterized by significantly different behavior from the period before the law (1997 and early 1998) to the period after the law (late 1998 to 2000). It seems likely increased demand, activity and prices probably resulted in a significantly different relationship between price and marginal attribute quantities between those periods. Because of this, it was of great use to stratify data sets by transaction before or after the law, which, by chance, seemed to coincide rather well with the time at which the market began to heat up. A very interesting result was the fact that the fit on the Before group models was very poor for the pooled group, but extremely high for the geographically stratified groups. This suggests that perhaps regional markets were far more segmented and unsubstitutable while the market was cooler, and when it heated up, the demand resulted in more substitutability between markets then had been present before, represented by better fit in the pooled data models. Such a result underlines the need for considering both time and space when modeling housing markets.
This research was made possible by grants from the California Policy Reseach Center and the Lincoln Institute of Land Policy.
Bender, B., T. Gronberg and H.S. Hwang. 1980. Choice of functional form and the demand for air quality. Review of Economics and Statistics. 62:638-43.
Burby, R. J., S. A. Bollens, J. M. Holway, E. J. Kaiser, D. Mullan, and J. R. Sheaffer. 1988. Cities Under Water: A Comparative Evaluation of Ten Cities' Efforts to Manage Floodplain Land Use. Program on Environment and Behavior, Monograph #47. Boulder, CO: Institute of Behavioral Science, University of Colorado.
Cassel, E. and R. Mendelsohn. 1985. The choice of functional forms for hedonic price equations: Comment. Journal of Urban Economics. 18:135-42.
Cross, J. A., 1985. The effects of flood hazard information disclosure by realtors: the case of the lower Florida Keys. University of Colorado, Institute of Behavioral Science, Natural Hazards Research and Applications Information Center: Boulder, CO, Working Paper No. 52.
Dale-Johnson, D. 1980. Hedonic prices and price indexes in housing markets; the existing empirical evidence and proposed extensions. Program in Real Estate and Urban Economics Working Paper Series: 80-5. Institute of Business and Economic Research, UC Berkeley.
Damianos, D., and L. Shabman. 1976. Land Prices in flood hazard areas: applying methods of land value analysis. Blacksburg, VA: Virginia Water Resources Research Center. Bulletin 95, April.
Davenport, C. W and T.C. Smith. 1985. Geologic hazards, negligence, and real estate sales. California Geology. 1985; 38, (7) 159-160.
Donnelly, W. A. 1989 Hedonic price analysis of the effect of a floodplain on property values. Water Resources Bulletin. 25, (3) 581-586.
Epple, D. 1987. Hedonic prices and implicit markets. Journal of Political Economy 95(1):59.
Freeman, A.M. 1979. The hedonic price approach to measuring demand for neighborhood characteristics. In:The Economics of Neighborhood, David Segal, ed. New York: Academic Press.
Garrod, G and K Willis. 1999. Economic Valuation of the Environment. Northhamptom, MA: Edward Elgar.
Goodman, A.C. and T.G. Thibodeau. 1998. Housing market segmentation. Journal of Housing Economics. 7: 121-143.
Greenberg, M. and J. Hughes. 1993. Impact of hazardous waste sites on property value and land use: tax assessors’ appraisal. Appraisal Journal, January, 42-51.
Griliches, Z. 1971. Price Indexes and Quality Change. Harvard Univ. Press: Cambridge, MA
Halverson, R. and H. Pollakowski. 1981. Choice of functional form for hedonic price equations. Journal of Urban Economics. 10:37-49.
Harrison, D and Rubinfeld, D. 1976. Hedonic housing prices and the demand for clean air. Journal of Environmental Economics and Management. 5, 81-102.
Holway, J. M., and R. J. Burby. 1990. The effects of floodplain development controls on residential land values. Land Economics 66 (3): 259-71.
Holway, J. M., and R.J. Burby. 1993. Reducing flood losses: local planning and land use controls. Journal of the American Planning Association 59 (2):205-217.
King, A.T. and P. Mieskowski. 1973. Racial discrimination, segregation and the price of housing. Journal of Political Economy. 81: 590-606.
Kish, L. 1992. Weighting for unequal Pi. Journal of Official Statistics 8:183-200.
Lohr, S.L. 1999. Sampling: Design and Analysis. Duxbury Press, Pacific Grove.
MacDonald, D.N., J.C. Murdoch and H.L White. 1987. Uncertain hazards and consumer choice: evidence from housing markets. Land Economics. 63(4):361-371
MacDonald, D. N. , H.L. White, P.M. Taube, and W.L. Huth. 1990. Flood hazard pricing and insurance premium differentials: evidence from the housing market. Journal of Risk and Insurance. 57 (4): 654-663.
Montz, B. E. 1992. The impact of hazard area disclosure on property values in three New Zealand communities. Boulder, Colo., University of Colorado, Institute of Behavioral Science, Natural Hazards Research and Applications Information Center, Working Paper No. 76.
Muckleston, Keith W. 1983. The Impact of floodplain regulations on residential land values in Oregon. Water Resources Bulletin 19 (1): 1-7.
Owens, R.W. and J.R. Roberts. 1991. Appraising floodplain properties. Appraisal Journal. 59 (2): 191-197.
Palm, R. I. 1982. Earthquake hazards information: The experience of mandated disclosure. In: Geography and the Urban Environment: Progress in Research and Applications. New York, N.Y., John Wiley & Sons, 1982; Vol. 5, 241-277
Palm, R. I. 1981. Real estate agents and special studies zones disclosure: the response of California home buyers to earthquake hazards information. Boulder, Colo., University of Colorado, Institute of Behavioral Science, Natural Hazards Research and Applications Information Center, Program on Technology, Environment and Man--Monograph No. 32, 145 pp.
Quigley, J.M. and J.F. Kain. 1970. Measuring the value of housing quality. Journal of the American Statistical Society. 65(330): 532-548
Rosen, S. 1974. Hedonic prices and implicit markets: product differentiation in pure competition. Journal of Political Economy 82 (1): 34-55.
Shilling, J.D. J.D. Benjamin and C.F. Sirmans. 1985. Adjusting comparable sales for floodplain location. The Appraisal Journal July: 429-36
Strazheim, M. 1974. Hedonic estimation of housing market prices: A further comment. Review of Economics and Statistics 56: 404-406.
Strazheim, M. 1975. An Econometric Analysis of the Urban Housing Market. New York: National Bureau of Economic Research.
Tobin, G. A. and B.E. Montz,. 1994 Flood hazard and dynamics of the urban residential land market. Water Resources Bulletin. 1994; 30 (4): 673-685