Model Selection and Validation

Problem Statement

The purpose of the representative population points generator is to assist in measuring the adequacy of health insurance networks. If a remote cluster of points has no representative point among them, the ensuing network adequacy analysis will utterly overlook the people who live there. Accordingly, good algorithms for this particular use case should ensure that all points are near to at least one representative point.

Caveat: the Garbage In, Garbage Out principle applies here. If the input data does not accurately reflect where people actually live and work, then any points selected by an algorithm won’t either. For the purposes of this document, we assume that the input is valid and concern ourselves only with the problem of selecting representative points from trustworthy data. See this section for an explanation of the data sources underlying the application.

Facility Location Problems

We can consider this problem as belonging to a class of facility location problems, where the goal is to choose the optimal placements of facilities to minimize a certain objective function (for example, transportation costs). See this paper for an overview of facility location problems in the field of healthcare.

Specifically, consider the Metric k-center Problem: Given n cities with specified locations, build warehouses in k different cities to minimize the maximum distance of a city to the nearest warehouse.

This problem comes with its own built-in validation metric, viz. the largest distance between a city and the nearest city with a warehouse.

Intuitively, we want every input population point (city in the analogy) in a service area to have at least one selected representative population point (warehouse) nearby. If after applying the algorithm even a single input population point does not have a selected point nearby, the maximum distance between the input and the selected points will still be high. This facet of the problem means that solutions must explicitly prioritize remote points, an appropriate feature given the ultimate use of the selected points in measuring network adequacy.

On the other hand, if the maximum distance between the input and selected points is some small value d, we have guaranteed that distances that replace the input point with the selected one will have an error bounded by d. Taken together, these features entail that solutions to this more specific problem make for reasonable solutions to the broader one.

Farthest-First Traversal Algorithm

The metric k-center problem is NP-hard. The literature contains several polynomial-time approximations to make the computation more tractable. For an analysis of the problem's complexity and an overview of optimal algorithms, see [1], [2], and [3].

One straightforward such approximation employs farthest-first traversal. At each stage, this algorithm finds the point that is farthest away from the currently selected points and chooses it as the next point. Essentially the algorithm looks to see what point is the most problematic (in terms of the objective function) and eliminates the problem: By selecting the farthest point next, its distance from the selected points drops to 0. The algorithm greedily puts out the biggest fire at each successive step.

This solution has an approximation factor of 2, ensuring that the objective function is at worst two times the smallest value that could be achieved considering any possible subset of size k from the original input. The algorithm has a runtime of O(kn).

A further advantage of this algorithm is the sequential selection of points: once some number K of representative points are chosen, each initial segment of size k for k < K can be taken as a selection of k representative population points. This means that a single calculation with a high value of K is sufficient to provide users with sets of representative population points of all smaller sizes.

The implementation allows users to specify alternative stopping rules based on the value of the objective function. Given a fixed distance (in miles), once all input points are within this distance of their nearest selected point, the model will stop selecting new points. This feature allows the model to select only a few representative population points in the situation when selecting additional points has diminishing returns. This is especially valuable for the performance of downstream applications, as points will not be needlessly generated in regions of high density.

The implementation selects the first point as the nearest input point to the population-weighted center of mass. This ensures that the algorithm is fully deterministic, in contrast to other implementations using a random initial point. This feature is especially helpful when the number of points for the region is small; e.g., when using a relatively large distance cutoff as described above.

For an illustration of the algorithm applied to the application's underlying data, see this IPython notebook.

Next Steps

Apply Methods from Supervised Learning. We can treat the problem of representative population points as a supervised problem and use standard cross-validation techniques to evaluate the algorithm's expected performance on unseen data. This would help to account for the uncertainty inherent in the source data.

Account for Population Density. The current algorithm doesn't take population information into account when selected the next point, but such information is present in the source data. A weighted version of the algorithm could more accurately reflect where people live on average. Care must be taken to ensure that the modified algorithm does not systematically ignore points in regions of low population density.

Data Sources

We use US Postal Service Every Door Direct Mail (EDDM) data as our source for identifying residential and commercial addresses as a basis for our calculation of representative population points.

Access

The EDDM data can be accessed via ArcGIS REST API. For an example, see this webpage: https://gis.usps.com/arcgis/rest/services/EDDM/selectZIP/GPServer/routes/execute?f=json&env%3AoutSR=102100&ZIP=94601&Rte_Box=R&UserName=EDDM

The data can be retrieved one ZIP code at a time. Since the ArcGIS script is called "selectZIP," it seems unlikely that bulk access is possible through online means. We have initiated a FOIA request for the data; we expect response of USPS to provide valuable information (how they generate the data, what the terms of use are, &c).

Advantages

In addition to the requirements above, this data source has the following advantages:

1. Explicitly accessible by ZIP

Unlike census data, the EDDM data comes with points assigned to ZIP codes. Because ZIPs are explicitly not defined as polygons, we believe it is easier to move in this direction (ZIP → census) than vice versa. Because the source is USPS, we expect this information to be accurate and up-to-date.

2. Reliability

Unlike other open source address data (e.g., Openaddresses.io, OpenStreetMap, DoT National Address Database), the EDDM data appears to closely correspond to the truth "on the ground." When the EDDM tool has a point, there typically is a building there (as validated by satellite imagery).

3. Has population estimates

The USPS has implemented a crosswalk with census data to obtain these estimates. Using this data source obviates the need to reinvent the wheel and perform the crosswalk manually.

Limitations

No data source is perfect. The following are some potential concerns with the using data scraped from the EDDM tool.

1. Missing ZIP codes

Not every ZIP code is present in the EDDM tool. For example, zip code 92055 corresponds to Camp Pendleton, a Marine Corps base. It makes sense that the USPS would not allow for commercial mailers to target this ZIP for advertising campaigns.

There are ZIP codes found in other sources such as Google Maps that are missing from the EDDM tool entirely. Examples include 94615, 94617, 94622, 94649, 94659. These could be deprecated ZIPs no longer in use, or PO boxes for which EDDM is unavailable for whatever reason (similar to Camp Pendleton).

2. ZIP codes with one point

In certain rural ZIPs, every resident gets their mail at the nearest post office: There are no mailing routes. This means that the EDDM tool will have only a single point (corresponding to the PO Box) for the entire ZIP. Examples include 95915, 93627, 95426, 96006, 95701.

These ZIPs can cover large areas, so representing them with a single point is problematic when performing adequacy calculations.

3. Lack of census information

Because the points come from USPS, they do not come tagged with relevant census-type information (e.g., county, census block, etc.). Such information has to be determined by intersecting geometries.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Table of Contents

Model Selection and Validation

Problem Statement

Facility Location Problems

Farthest-First Traversal Algorithm

Next Steps

Data Sources

Access

Advantages

1. Explicitly accessible by ZIP

2. Reliability

3. Has population estimates

Limitations

1. Missing ZIP codes

2. ZIP codes with one point

3. Lack of census information

Files

README.md

Latest commit

History

README.md

File metadata and controls

Table of Contents

Model Selection and Validation

Problem Statement

Facility Location Problems

Farthest-First Traversal Algorithm

Next Steps

Data Sources

Access

Advantages

1. Explicitly accessible by ZIP

2. Reliability

3. Has population estimates

Limitations

1. Missing ZIP codes

2. ZIP codes with one point

3. Lack of census information