The new method increases the reliability of statistical estimates | MIT News

Suppose an ecologist is investigating whether exposure to air pollution is associated with lower birth weight in a given county.

They can train a machine learning model to estimate the size of this association, because machine learning methods are particularly good at learning complex relationships.

Standard machine learning methods do an excellent job of making predictions, and sometimes provide uncertainty for those predictions, such as confidence intervals. However, they typically do not provide estimates or confidence intervals for determining whether two variables are related. Other methods have been developed specifically to address this association problem and provide confidence intervals. However, under spatial conditions, MIT researchers found that these confidence intervals can be completely incorrect.

When variables such as air pollution levels or rainfall change from one location to another, common methods for generating confidence intervals may require a high level of confidence when in fact the estimate completely missed the true value. These incorrect confidence intervals can mislead the user into trusting a model that has failed.

After identifying this shortcoming, researchers developed a new method designed to generate valid confidence intervals for problems involving spatially varying data. In simulations and experiments with real data, their method was the only technique that consistently produced accurate confidence intervals.

This work could help researchers in fields such as environmental science, economics and epidemiology better understand when to trust the results of specific experiments.

“There are so many problems where people are interested in understanding phenomena in space, such as weather or forest management. We have shown that for this broad class of problems there are more appropriate methods that can give us better performance, a better understanding of what is happening and results that are more reliable,” says Tamara Broderick, an associate professor in MIT's Department of Electrical Engineering and Computer Science (EECS) and a member of the Laboratory for Information and Decision Systems (LIDS) and the Data Institute. Systems and Society, an affiliate of the Computer Science and Artificial Intelligence Laboratory (CSAIL) and lead author of this article test.

Broderick is joined in the article by co-authors David R. Burt, postdoctoral fellow, and Renato Berlinghieri, EECS graduate student; and Stephen Bates, assistant professor at EECS and member of LIDS. The research results were recently presented at a conference on neural information processing systems.

Incorrect assumptions

Spatial association involves the study of how a variable and a specific outcome are related within a geographic area. For example, one could examine how tree cover in the United States relates to elevation.

To solve this type of problem, a scientist can collect observational data from multiple locations and use it to estimate association in another location where he or she does not have the data.

The MIT researchers realized that in this case, existing methods often generate completely wrong confidence intervals. A model can say that it is 95% confident that its estimates reflect the true relationship between tree cover and height, when it has not captured this relationship at all.

After investigating this issue, researchers found that the assumptions behind confidence interval methods do not hold when the data varies spatially.

Assumptions are like rules that must be followed to ensure that the results of a statistical analysis are valid. Common methods for generating confidence intervals work under various assumptions.

First, they assume that the source data, i.e. observational data collected to train the model, are independent and identically distributed. This assumption implies that the chance of one location being included in the data has no effect on whether another location will be included. However, for example, U.S. Environmental Protection Agency (EPA) air sensors are placed with other air sensor locations in mind.

Second, existing methods often assume that the model is completely correct, but this assumption is never true in practice. Finally, they assume that the source data is similar to the target data for which estimates need to be made.

However, under spatial conditions, the source data may be fundamentally different from the target data because the target data is located in a different location than where the source data was collected.

For example, a scientist can use data from EPA pollution monitors to train a machine learning model that can predict health effects in rural areas where there are no monitors. However, EPA pollution monitors are likely located in urban areas with more traffic and heavy industry, so air quality data will differ significantly from air quality data in rural areas.

In this case, linkage estimates from urban data are subject to bias because the target data systematically differs from the source data.

A smooth solution

The new method for generating confidence intervals explicitly takes this potential bias into account.

Instead of assuming that the source and target data are similar, researchers assume that the data changes smoothly across space.

For example, in the case of fine particle air pollution, the level of pollution in one block cannot be expected to be significantly different from the level of pollution in the next block. Instead, pollution levels will gradually decrease as you move away from the source of the pollution.

“For these types of problems, the assumption of spatial smoothness is more appropriate. It better fits what is actually happening in the data,” Broderick says.

When they compared their method to other common techniques, they found that it was the only method that could consistently generate reliable confidence intervals for spatial analyses. Moreover, their method remains robust even when the observational data are distorted by random errors.

In the future, researchers want to apply this analysis to different types of variables and explore other applications where it could provide more reliable results.

This research was funded in part by a seed grant from the MIT Social and Ethical Responsibilities of Computing (SERC), the Office of Naval Research, Generali, Microsoft, and the National Science Foundation (NSF).

LEAVE A REPLY

Please enter your comment!
Please enter your name here