Skip to content

Latest commit

 

History

History
380 lines (290 loc) · 41.2 KB

XX14b-spatial-mechanics.asciidoc

File metadata and controls

380 lines (290 loc) · 41.2 KB

Mechanics of Spatial Data

We kicked off the chapter with two examples that didn’t require too many new concepts, but it’s time to backtrack a bit and properly cover the mechanics of working with spatial data.

The data types and operations are extremely well standardized by the Open Geospatial Consortium. Nearly all of the operations below have identical behavior within Oracle, PostGIS, SQL Server, and all industrial-strength geospatial systems. In fact, the geospatial toolkits for Pig (Pigeon) and Hive (Esri-SFFH) are particularly sympatico as they both use Esri’s wonderful Esri Geometry API under the hood.

Spatial Data Types

  • Point — a single location in space, given by its horizontal, then vertical coordinates. That’s an easy convention to swallow when you think in terms x, y — but also means you should always list coordinates in the order longitude first then latitude. Get in the habit of always using that ordering.

  • LineString — a single continuous path, described as an ordered sequence of points. To describe a closed path, repeat the line’s start point as its end point. A path is 'simple' if it does not cross or touch itself; a path is a 'ring' if it is both simple and closed.

  • Polygon  — a connected surface in space, described by at least one closed simple path defining its exterior, and zero one or many non-crossing rings defining any interior holes. The exterior ring is always listed first, and no ring is permitted to cross or touch itself or any other ring.

  • MultiPoint — a collection of points regarded as a single shape.

  • MultiLineString — a collection of lines regarded as a single shape. Although a Polygon also has multiple chains of coordinates, a Polygon is not a MultiLineString. Most importantly, a Polygon represent a 2-D shape with an interior; a MultiLineString represents a collection of 1-D shapes. What’s more, the line strings defining a polygon must be non-intersecting rings, while the elements of a LineString or MultiLineString are permitted to be either open or closed, and may cross or touch.

  • MultiPolygon — you guessed it, a collection of polygons regarded as a single shape. These polygons are allowed to overlap, lie within each other, or anything else they want to do.

  • Envelope — an axis-aligned rectangle depicting the minimum and maximum extent of a shape in each coordinate. Since its sides are aligned with the axes, we only have to give the coordinates of two of its corners. From the perspective of the geometry libraries this does not live in the same type hierarchy as the geometry objects above, but it’s easy enough to generate the polygon corresponding to an envelope or the envelope of any shape. Any time you’re specifying a bounding box, follow the convention of numerically-lowest-coordinates then numerically-highest-coordinates, i.e. ( (min_x, min_y), (max_x, max_y) ). Like the longitude-then-latitude convention, it’s violated just often enough to drive you crazy.

Those are the essential data types used by geospatial libraries everywhere. However, when adapting geospatial methods to Pig there are really only two families of shapes to consider:

  • Points, which lack spatial extent

  • Regions (i.e. all geometries that are not of type Point), which span more than one location in space

The first example in this chapter, we were careful to clarify, only covered spatial aggregations of points. As soon as we’ve walked through the core mechanics, we’ll demonstrate the same pattern but for spatial aggregations of regions, and you’ll see the important difference that causes.

Note
Some terminology notes: We’ll use the term 'geometry' to mean any of our internal data structures: Point, LineString, Polygon, MultiPoint, MultiLineString, MultiPolygon. The term 'shape' refers to the geometric shape it describes: one or many points, lines or polygons. Don’t read too much into the distinction; mostly, it gives us another word to use because otherwise our sentences are just "blah blah geo geometry geo blah geogeometry blahgeo the geoblah". We’ll refer to Point, LineString, and Polygon collectively as singular-types, and MultiPoint, MultiLineString, MultiPolygon as multi-types. And to say "either a point or a multi-point", we’ll write '(multi)point' (or similarly '(multi)line', '(multi)polygon', '(multi)geometry') — "the dimension of a (multi)line is one, and the dimension of a (multi)polygon is two". (REVIEWER: which is better, '(multi)point', or 'point/multi-point'?)

Spatial Operations in Pig

Fully explaining and exploring each of the core spatial operations would be a waste of your time and shelf space. For a deep understanding of how they work, you’re better off consulting traditional GIS resources — what we’re after here is to show you how to use the map/reduce framework to coordinate those core operations. On the other hand, once you know that the operation for "take an object and make a new shape that covers all the points within 100 meters of it" is spelled GeoBuffer(geom), you’re 75% of the way there. And reading other resources, we found that the more explanation an author supplied, the larger the amount of topology in the explanation and the less we understood it.

So we’re including here what worked for us: a semi-formal description of the operation, pretty pictures demonstrating it, and the list of important specific cases spelled out. If it feels like some bullet points repeat the same description in a different way, that’s our intention, as some people latch onto a mathematical statement and others to a physical description. We won’t use more than a handful of the operations below in the rest of the book, so you’re welcome to just skim this section now (skipping ahead to "Spatial Aggregations on Regions" (REF)) and refer back to it as a reference.

Shape Transforming Operations

We’ll start with the operations that transform a shape on its own to produce a new geometry. Some operations return a related but dissimilar object: a shape’s bounding box, the point at its "center", and so forth. Others feel more like modifications of their input: reducing the level of detail, finding the

  • GeomEnvelope — Bounding box for the given shape: an axis-aligned rectangle spanning the shape’s minimum and maximum extent in each dimension. Working with rectangles is dramatically simpler and faster than working with arbitrary shape. This is our frontline tool for performing the rough-carpentry work of figuring what might be relevant.

  • GeomCentroid(geom) — The point at the geometric center of a geometry. The centroid lies at the arithmetic mean of all the input coordinates, weighted by area for (multi)polygon, by line segment length for (multi)lines, or equally for (multi)points. The centroid of a point is itself; the centroid of a single line segment is its midpoint; and the centroid of an empty geometry is empty. This is used in some geometric algorithms, and on occasions where you’d like to informally represent a region as a point at its center — for example, visualizing data about a region as clickable "pushpins" on a map. Be careful, however: the centroid of a polygon is not guaranteed to lie within its interior.

  • GeomPointOnSurface(poly), PointOnSurface(multi_poly) — An arbitrary point guaranteed to lie on the surface of a given Polygon or MultiPolygon.

  • StartPoint(line), EndPoint(line), PointN(line, idx) — the first, last or n’th point on the given line. StartPoint and EndPoint are sugar for PointN(line, 0) and PointN(line, NumPoints(line)) respectively.

  • ExteriorRing(poly), InteriorRingN(poly, idx) — A LineString giving the polygon’s outermost (ExteriorRing) or n-th innermost ring (InteriorRingN), counting inward.

  • GeomBoundary(geom) — The shape separating the object’s interior from its exterior. The boundary will always have one fewer dimensions than its input: a point’s boundary is empty; a line’s boundary is its start and end point; and a polygon’s boundary is a MultiLineString of its rings. The boundary of any multi-type geometry is the combined boundaries of its parts.

  • GeomConvexHull(geom) — The convex polygon (i.e. no PacMan-like indentations) that minimally covers all points in its input. Think of this as the shape a rubber band would make if you stretched it around all parts of the input geometry. The ConvexHull is always a single point, line or (usually) polygon — it is never a multi-point/line/polygon.

  • GeomSimplify(geom, tolerance) — intelligently reduces the number of points in the input by washing out deviations smaller than the given tolerance. For example, an intensely simplified circle will become a triangle. You may call Simplify with any geometry type, but it only affects (multi)lines and (multi)polygons — it will not eliminate points from a MultiPoint input.

  • GeomBuffer(geom, distance) — Shape covering the area within a given distance from the input. A Polygon for singular inputs, a MultiPolygon for multi-inputs. This is useful for doing a "within x distance" spatial join, as you saw above. However, you must be careful to use geodetic ("great circle") distances if your points are on a sphere. // IMPROVEME: explain a bit better

Constructing and Converting Geometry Objects

Somewhat related are operations that change the data types used to represent a shape.

Going from shape to coordinates-as-numbers lets you apply general-purpose manipulations

As a concrete example (but without going into the details), to identify patterns of periodic spacing in a set of coordinates [1] you’d quite likely want to extract the coordinates of your shapes as a bag of tuples, apply a generic UDF implementing the 2-D FFT (Fast Fourier Transform) algorithm

. The files in GeoJSON, WKT, or the other geographic formats described later in this Chapter (REF) produce records directly as geometry objects,

There are functions to construct Point, Multipoint, LineString, …​ objects from coordinates you supply, and counterparts that extract a shape’s coordinates as plain-old-Pig-objects.

  • Point / MultiPoint / LineString / MultiLineString / Polygon / MultiPolygon — construct given geometry.

  • GeoPoint(x_coord, y_coord) — constructs a Point from the given coordinates

  • GeoEnvelope( (x_min, y_min), (x_max, y_max) ) — constructs an Envelope object from the numerically lowest and numerically highest coordinates. Note that it takes two tuples as inputs, not naked coordinates.

  • GeoMultiToBag(geom) — splits a (multi)geometry into a bag of simple geometries. A MultiPoint becomes a bag of Points; a Point becomes a bag with a single Point, and so forth.

  • GeoBagToMulti(geom) — combines a bag of geometries into a single multi geometry. For instance, a bag with any mixture of Point and MultiPoint geometries becomes a single MultiPoint object, and similarly for (multi)lines and (multi)polygons. All the elements must have the same dimension — no mixing (multi)points with (multi)lines, etc.

  • FromWKText(chararray), FromGeoJson(chararray) — converts the serialized description of a shape into the corresponding geometry object. We’ll cover these data formats a bit later in the chapter. Similarly, ToWKText(geom) and ToGeoJson(geom) serialize a geometry into a string

Properties of Shapes

  • GeoArea(geom)

  • MinX(geom), MinY(geom), MaxX(geom), MaxY(geom) — the numerically greatest and least extent of a shape in the specified dimension.

  • GeoX(point), GeoY(point) — X or Y coordinates of a point

  • GeoLength(geom)

  • GeoLength2dSpheroid(geom) — Calculates the 2D length of a linestring/multilinestring on an ellipsoid. This is useful if the coordinates of the geometry are in longitude/latitude and a length is desired without reprojection.

  • GeoDistance(geom) — the 2-dimensional cartesian minimum distance (based on spatial ref) between two geometries in projected units.

  • GeoDistanceSphere(geom) — Returns minimum distance in meters between two lon/lat geometries. Uses a spherical earth and radius of 6370986 meters.

There are also a set of meta-operations that report on the geometry objects representing a shape:

  • Dimension(geom) — This operation returns zero for Point and MultiPoint; 1 for LineString and MultiLineString; and 2 for Polygon and MultiPolygon, regardless of whether those shapes exist in a 2-D or 3-D space

  • CoordDim(geom) — the number of axes in the coordinate system being used: 2 for X-Y geometries, 3 for X-Y-Z geometries, and so on. Points, lines and polygons within a common coordinate system will all have the same value for CoordDim

  • GeometryType(geom) — string representing the geometry type: 'Point', 'LineString', …​, 'MultiPolygon'.

  • IsGeoEmpty(geom) — 1 if the geometry contains no actual points.

  • IsGeoClosed(line) — 1 if the given `LineString’s end point meets its start point.

  • IsGeoSimple — 1 if the geometry has no anomalous geometric aspects, such intersecting or being tangent to itself. A multipoint is 'simple' if none of its points coincide.

  • IsLineRing — 1 if the given LineString is a ring — that is, closed and simple.

  • NumGeometries(geom_collection)

  • NumInteriorRings(poly)

Operations that Combine Shapes

The power players of our toolkit are operations that combine shapes to produce new ones, most prominently set operations such as intersection or union. These behave similarly to the set operations on elements in a bag that we explored in chapter (REF), because the underlying mathematics are the same. But whereas the sets in those operations were the elements in two given bags, these operations apply to the topological point sets that our geometry objects represent.

  • GeoUnion(geom_a, geom_b) — geometry representing the merger of the two shapes. A region is within the result if and only if it is within either input.

  • GeoIntersection(geom_a, geom_b) — geometry representing the intersection of the two shapes. A region is within the result if and only if it is within both inputs.

  • GeoDifference(geom_a, geom_b) — geometry representing the portion of the first shape excluding the extent of the second shape. A region is within the result if and only if it is within the first input but not the second.

  • GeoSymmDifference(geom_a, geom_b) — geometry representing the portion of the either shape that is not within the other shape. A region is within the result if and only if it is within one but not both inputs.

Testing the Relationship of Two Shapes

The geospatial toolbox has a set of precisely specified spatial relationships. They each represent a set of constraints on how the boundary, interior, and exterior of one geometry relates to the boundary, interior, and exterior of the other geometry. Our caveat at the top of the chapter about the difficulty of describing these operations correctly without explaining them into incoherence is especially true here. For best results, grab the scripts from the sample code repository (REF) and try various cases.

  • Equals(geom_a, geom_b) — 1 if the shapes are equal.

  • OrderingEquals(geom_a, geom_b) — 1 if the shapes are equal and their coordinates have the same ordering

  • Intersects(geom_a, geom_b) — 1 if the shapes intersect: at least one point from the boundary or interior of one shape lies on the boundary or interior of the other.

  • Disjoint(geom_a, geom_b) — 1 if the shapes do not intersect. This operation is sugar for (GeoIntersects(sa, sb) == 0 ? 1 : 0).

  • EnvIntersects(geom_a, geom_b) — 1 if the bounding envelope of the two shapes intersect.

  • Contains(geom_a, geom_b) — 1 if geom_a completely contains geom_b: that is, the shapes' interiors intersect, and no part of geom_b lies in the exterior of geom_a. If two shapes are equal, then it is true that each contains the other. Contains(A, B) is exactly equivalent to Within(B, A).

  • Within(geom_a, geom_b) — 1 if geom_a is completely contained by geom_b: that is, the shapes' interiors intersect, and no part of geom_a lies in the exterior of geom_b. If two shapes are equal, then it is true that each is within the other.

  • Covers(geom_a, geom_b) — 1 if no point in geom_b is outside geom_a. CoveredBy(geom_a, geom_b) is sugar for Covers(geom_b, geom_a). (TODO: verify: A polygon covers its boundary but does not contain its boundary.)

  • Crosses(geom_a, geom_b) — 1 if the shapes cross: their geometries have some, but not all, interior points in common; and the dimension of the intersection is one less than the higher-dimension of the two shapes. That’s a mouthful, so let’s just look at the cases in turn:

    • A MultiPoint crosses a (multi)line or (multi)polygon as long as at least one of its points lies in the other shape’s interior, and at least one of its points lies in the other shape’s exterior. Points along the border of the polygon(s) or the endpoints of the line(s) don’t matter.

    • A Line/MultiLine crosses a Polygon/MultiPolygon only when part of some line lies within the polygon(s)' interior and part of some line lies within the polygon(s)' exterior. Points along the border of a polygon or the endpoints of a line don’t matter.

    • A Line/MultiLine crosses another Line/MultiLine only when the intersection of their interiors consists of one or more points, but no line segments. The endpoints of the lines don’t matter.

    • A Point is never considered to cross any another shape, since you need part of one shape to lie outside the other.

    • A Polygon/MultiPolygon is never considered to cross a Polygon/MultiPolygon, since if their interiors intersect anywhere it is necessarily in a Polygon (and thus not of lower dimension).

  • Overlaps(geom_a, geom_b) — 1 if the shapes overlap: their intersection has the same dimension as, but is not equal to, either of the given objects.

  • Touches(geom_a, geom_b) — 1 if the shapes touch: their interiors do not intersect but the boundary of one object intersects the interior or boundary of the other.

Warning
The Pig and Hive libraries are fairly new — in fact, a large part of the Pig methods described here were contributed by your authors — so don’t be surprised to find functionality that hasn’t been implemented yet. In particular, 3-D and higher geometries are poorly supported; CRS (coordinate reference system) awareness is weak and the catalogue of map projections is small; and many opportunities for optimization remain.

Data Formats

Let’s take a moment to look at the different file formats used for geographic data. Each has particular tradeoffs that may lead you to make different choices than we have.

GeoJSON

GeoJSON is a new but well-thought-out geodata format, able to represent arbitrary-dimensional geometries in a way that translates nicely to other leading geospatial formats. Its principal advantage is that it is in all respects a JSON file, compatible by any system capable of reading and writing JSON (which is by now most systems). A GeoJSON-aware system will load the coordinates field as a shape, but to anything else it’s still recognizable as a regular old array. Ironically, the place you’re most likely to have a compatibility fail is with traditional GIS systems; due to its young age many GIS systems will lack GeoJSON drivers. However, it’s quite easy to convert to and from GeoJSON (see "Converting Among Geospatial Data Formats" below (REF)). The other drawbacks of GeoJSON are those common to any JSON format: it’s not particularly space-efficient, and you must parse the whole object before using any part of it. As we’ll mention several times, data compression makes the space-efficiency matter less than you think, and compared to disk throughput and the cost of geospatial operations parsing JSON is faster than you think.

If you won’t always want to use the geometry data, however, you may also choose to use a TSV format with embedded JSON as we did with WKT/TSV above. Serialize the feature’s raw geometry object into a field as GeoJSON, and its properties into individual fields as usual. Since no legal JSON document can contain a raw tab or newline, it is perfectly safe to serialize the geometry into a TSV field. The FromGeoJson UDF (note capitalization) in Pig will convert a GeoJSON geometry into a geometry object.

In all, GeoJSON makes an excellent interchange format among different data analysis systems and is a sound choice for development and exploratory analytics.

A GeoJSON geometry defines only the shape’s type, and coordinates. A GeoJSON feature simply contains both a geometry object and a properties object holding string-key / arbitrary-value pairs according to any scheme you design. Additionally, any GeoJSON object can optionally specify a string identifier (id), its bounding box (bbox) and the Coordinate Reference System (crs) that should be used to interpret its coordinates. Wrap an array of features in a FeatureCollection object and you’re ready to map! Here is an example GeoJSON feature collection:

  {
    "type": "FeatureCollection",
    "features": [
      { "type":       "Feature",
        "properties": {"prop0": "value0"},
        "geometry":   {"type": "Point", "coordinates": [102.0, 0.5]}
      },
      { "type":       "Feature",
        "properties": {"prop0": "value0"},
        "geometry":   {"type": "LineString", "coordinates": [[10.0, 2.0],[102.0, 0.5]]}
      },
      { "type":       "Feature",
        "properties": {
          "prop0":    "value0",
          "prop1":    {"this": "that"}
        },
	"bbox":       [0.0,0.0,8.0,20.0]
        "geometry": {
          "type":     "Polygon",
          "coordinates": [
	    [ [0.0,0.0],[0.0,20.0],[8.0,20.0],[8.0,0.0],[0.0,0.0] ],
	    [ [3.0,9.0],[3.0,11.0],[5.0,11.0],[5.0,9.0],[3.0,9.0] ]
            ]
	}
      }
    ]
  }

We pretty-printed that example along multiple lines, but in practice you will want to treat each GeoJSON feature as an independent JSON object, each on its own line. You can restore such a file to the status of GeoJSON feature collection by replacing all newlines that precede a record with a comma, then stapling {"type": "FeatureCollection","features":[ and ]} to the file’s front and back.

GeoJSON geometries encode the full set of primitives we like to use:

  • For Point geometries, just supply an array in x,y order: [longitude, latitude]

  • For LineString paths, supply an array of points in order. A LineString having the same initial and final coordinates will be interpreted as a closed path.

  • Polygon objects are specified using an array of rings, each of which is an array of points. You must repeat the first point in each ring to make it a closed path. The Polygon in the example above describes a rectangle from (0,0) to (8,20), with a 2x2 hole in its center. The first array is the outer ring; other paths in the array are interior rings or holes. For example, South Africa’s outer border would be the first ring in the array, followed by the coordinates of the inner ring delimiting Lesotho (an independent country lying completely within South Africa). Regions with multiple parts such as Hawaii or Denmark require a MultiPolygon instead.

  • The MultiPoint/MultiLineString/MultiPolygon types expect an array of the coordinates as appropriate for the singular type. It’s fine to supply a Multi type with only one element, but you must have it enclosed in an array: {"type": "Point", "coordinates": [102.0, 0.5]} and {"type": "MultiPoint", "coordinates": [[102.0, 0.5]]} should behave equivalently.

  • For ease of processing, you can attach a Bounding Box (bbox) annotation to any GeoJSON object. Supply the coordinates in [left, bottom, right, top] order — that is, [xmin, ymin, xmax, ymax]. A bbox is not an independent geometry: it is an annotation on another geometry.

The GeoJSON standard is as readable a specification as you’ll see, so refer to it for anything deeper than we cover here.

Well-Known Text + Tab-Separated Values

At this point in the book you’ve long since either quit reading in disgust, or you’ve gotten used to the idea that until performance concerns demonstrate otherwise, in most cases the best data format is the silly-seeming TSV (tab-separated values) scheme. To restate its tradeoffs, a TSV file is easily inspectable, travels anywhere, and can be manipulated from the commandline as plain-text. It’s restartable (the damage from a corrupt record lasts only until the next newline), doesn’t require special quoting or escaping, and is trivial to parse. Representing numbers in decimal gives mediocre space-efficiency, but keep in mind that a well-configured hadoop cluster (REF) compresses most data as it hits the disk, and so the overhead is nowhere near as large as it might appear. By now you’re quite comfortable working around the lack of complex types and need for an explicit schema.

Given that, we’ll continue to tax your credulity and advise that until performance concerns demonstrate otherwise, the primitive but oh-so-simple Well-Known Text format is the right choice for geospatial data.

WKT encodes our familiar primitives — Point, LineString, Polygon and MultiPoint, MultiLineString, MultiPolygon. (There are additional geometries for specifying circles, triancle surfaces, meshes and parameterized curves, but we won’t get into those.) A WKT object is given by simply stating the geometry type followed by its comma-separated coordinates within parentheses. Whitespace is ignored and any other content is disallowed. Here’s an example (the newlines and spacing around the braces is purely ornamental):

POINT (102.0 0.5)
LINESTRING (10.0 2.0, 102.0 0.5)
POLYGON (
    (0.0 0.0, 0.0 20.0, 8.0 20.0, 8.0 0.0, 0.0 0.0),
    (3.0 9.0, 3.0 11.0, 5.0 11.0, 5.0 9.0, 3.0 9.0) )
MULTIPOLYGON ( (
    (0.0 0.0, 0.0 20.0, 8.0 20.0, 8.0 0.0, 0.0 0.0),
    (3.0 9.0, 3.0 11.0, 5.0 11.0, 5.0 9.0, 3.0 9.0) ) )
MULTIPOLYGON (
    ((0.0 0.0, 0.0 20.0, 8.0 20.0, 8.0 0.0, 0.0 0.0)),
    ((3.0 9.0, 3.0 11.0, 5.0 11.0, 5.0 9.0, 3.0 9.0)) )

Coordinate pairs are given in longitude/latitude (x/y) order — hopefully you’ve begun to internalize that convention — with spaces in between. Each string of points — a linestring path or a ring in a polygon — is comma-delimited and wrapped in its own set of parenthesis

All the objections to TSV weigh in against WKT as well — it is unsophisticated, not terribly space efficient, and seems clunky at first use. But all the advantages carry over too — it’s commandline friendly, travels anywhere, can be manipulated even in the absence of a parser. For development use, we generally like to work with TSV files holding shapes as embedded WKT fields.

Well-Known Binary (WKB)

WKT is easily translated into Well-Known Binary (WKB) format, a straightforward binary encoding of Well-Known Text (WKT). You’ll give up the direct access and commandline friendliness, but WKB is nearly as widely understood as WKT and has exactly the same capabilities. Since WKB is more space-efficient and somewhat faster to decode, you may wish to move to it for production work or high-scale applications.

Other Important Formats

The preceding sections describe all the file formats we’ve found worthwhile for use within Hadoop, but Wikipedia lists several dozen other geospatial file formats. It’s worth calling out a few others you’ll encounter.

The Shapefile (aka Esri Shapefile or Arcview Shapefile) format is a complex and powerful geospatial vector data format, ubiquitous in the traditional GIS world. Like GeoJSON, it can represent both shapes and metadata, shares the same (Multi)Point/(Multi)Line/(Multi)Polygon primitives, and can handle two- or three-dimensional data [2]. Unlike GeoJSON, it’s not a useful interchange format — although every GIS system will have Shapefile facilities, few applications outside of that realm will. Don’t go near the specification — it’s incredibly complex, only mostly-specified, and there are excellent open-source libraries for working with shapefiles that will give far better results than anything you should attempt. A shapefile is actually a collection of multiple files: a .shp, a .shx and a .dbf file and potentially others as well. Each collection is intended to represent a single layer of data and so can only contain a single geometry type: you cannot combine airports (point), flight paths (lines) and air traffic control zones (polygons) in the same file. Those limitations — multiple files and homogenous layers — make it a poor choice for representing data on your cluster.

TopoJSON is a companion format to GeoJSON, specifically optimized for data visualization in the browser. (The "Map Viewer for Chimps" tool that we’ve distributed uses it for reference data). While the other data formats described here represent regions independently — closed squarish polygons for Colorado, Utah, Arizona, etc.. — TopoJSON instead maintains the mesh of edges that define those polygons, along with metadata to recover the original regions. Those duplicate paths cause excess storage size and redundant data processing; in the worst case, numerical error can cause borders that should be coincident to stray, leading to visual artifacts and incorrect results. TopoJSON’s pre-constructed mesh avoids those problems and makes many tasks possible or simpler, such as cartograms (independent rescaling of each shape based on an attribute) or geometry simplification (eliminating fine-grained detail for rendering). At present, its principal adoption is limited to the d3 Javascript library. Everything we see indicates that d3 is emerging as the best toolkit for lightweight data visualization primitives, and we expect increasing adoption of its byproducts. Having just described how the mesh representation is great for rendering purposes, it’s exactly wrong for our use. We want to be able to peel shapes apart and send them to the correct context group. We want to store a shape and its associated data in the same record on disk (as GeoJSON and TSV+WKT do), both to accomodate file splitting and to enable processing as plain data. The github.com/mbostock/topojson TopoJSON project has tools for converting TopoJSON to and from GeoJSON, ESRI Shapefiles, and a few other formats.

Keyhole Markup Language is the XML-based format used by Google Earth. Keyhole, a company acquired by Google, built both the core of Google’s online geographic offering and an internet community of enthusiasts who curate geolocation and 3-D models of earth features. The signal-to-noise ratio is often low (and Google occasionally gaslights file locations), but with patience you can find some fairly remarkable data sets under open licenses through Google Earth or the surviving Keyhole community. KML files are distributed with either a .kml extension (plaintext XML) or with a .kmz extension (a ZIP 2.0-compressed bundle containing that .kml file). You should not build your data flows around KML. It’s first of all a bad choice for high-scale data analysis in all the ways that any XML-based format is a bad choice — see our "Airing of Grievances" in chapter (REF). KML is even less compact than GeoJSON and lacks the ability to specify a coordinate reference system. Though some traditional GIS applications will import KML, they’re just as likely to accept GeoJSON; outside of the GIS world, generic JSON is far superior to generic XML. Seek out and import, yes, but otherwise avoid working with KML.

Open Street Maps (OSM) is one of the crown jewels of the open data movement: a massive database of places, roads and spatial data, community-generated and available under an open license. Together with GeoCommons and Natural Earth, anyone with an internet connection can freely access geospatial data sets that used to cost millions of dollars if available at all. OSM distributes their data in a variety of formats documented on their wiki, none of them useful for data analysis at any scale. See the instructions given by Michal Migurski or Jacob Perkins for how to convert the data directly in Hadoop.

Converting Among Geospatial Data Formats

As always, our advice is to pick one data format to work with and tolerate no others. Convert all foreign data formats immediately upon receipt, and produce exports (where necessary) in a separate and final step. As we advised above with WKB (REF), you may judiciously choose two formats: one for efficiency and one for interchange with other applications. But do yourself a favor and prove that the interchange format actually costs you enough money to deal with the hassle.

The open-source (MIT License) GDAL Library is a superb tool for converting among all the data formats you’ll encounter in practice. It handles not only the vector formats we’ve focused on in this chapter but also raster data, such as satellite imagery from the National Geodetic Survey, .PNG files from Google Maps and other tileservers, and so on. Following closely Mike Bostock’s bost.ocks.org/mike/map/[Let’s make a Map!] presentation, here’s a brief demonstration of using GDAL to translate an ESRI shapefile to both GeoJSON (for use in Hadoop) and TopoJSON (for efficient rendering in D3).

Install GDAL using your system’s package manager (for Mac OSX users running homebrew it’s brew install gdal) or download binaries from the GDAL website. If running ogr2ogr --help from the commandline dumps a bewildering soup of options to the screen you’ve probably installed it correctly.

# Go to where your data will live
datadir=~/data
mkdir -p $datadir/{ripd,rawd,out}
mkdir -p $datadir/rawd/natural_earth/shp
#
file=ne_10m_admin_1_states_provinces

# yeah, the link has the http part repeated...
wget -nc http://www.naturalearthdata.com/http//www.naturalearthdata.com/download/10m/cultural/$file.zip -O $datadir/ripd/$file.zip
# extract the zip file
(cd $datadir/rawd/natural_earth/shp ; unzip $datadir/ripd/$file.zip)
#
ogr2ogr -f GeoJSON \
  $datadir/rawd/natural_earth/great_britain_subunits.json \
  $datadir/rawd/natural_earth/shp/$file.shp
------

It's not immediately apparent how to export a TSV file containing WKT (Well-Known Text); you need to use the CSV driver with options as shown:

------
ogr2ogr -f CSV \
  -lco GEOMETRY=AS_WKT -lco SEPARATOR=TAB -lco CREATE_CSVT=YES \
  /tmp/great_britain_subunits \
  $datadir/rawd/natural_earth/shp/$file.shp
mv /tmp/great_britain_subunits/$file.csv \
  $datadir/rawd/natural_earth/great_britain_subunits.wkt.tsv
------

Although the output file will be in a subdirectory with a `.csv` extension, it is nonetheless a tab-separated file. The above code block exports it to a temporary location and then renames it.

Incidentally, `ogr2ogr` also offers a simple set of predicates for extracting only selected layers. The following command chooses only the states within Ireland and Great Britain:

------
ogr2ogr -f CSV \
  -lco GEOMETRY=AS_WKT -lco SEPARATOR=TAB -lco CREATE_CSVT=YES \
  -where "ADM0_A3 IN ('GBR', 'IRL')" \
  $datadir/rawd/natural_earth/great_britain_subunits.wkt.tsv \
  $datadir/rawd/natural_earth/shp/$file.shp
------

Whenever you meet a new data set listing data for the United Kingdom, check whether "admin-1" covers state-level units (Northumberland, Liverpool, etc) and not the intermediate subdivisions of Great Britain, Wales, Scotland, etc. (The same caution applies to Greece, Canada and a few other countries.) We're safe here, because the Natural Earth dataset maintains separate fields for that information.

1. The methodical rows of trees in an apple orchard will appear as isolated frequency peaks oriented to the orchard plan; an old-growth forest would show little regularity and no directionality
2. but not higher, as opposed to the arbitrary dimensions available to GeoJSON