Skip to content

Latest commit

 

History

History
299 lines (192 loc) · 19.7 KB

XX12-event_streams.asciidoc

File metadata and controls

299 lines (192 loc) · 19.7 KB

Analyzing Logs and Event Streams

////This intro here is a good model for other chapters - this one’s rough, but the bones are here. Amy////

Much of Hadoop’s adoption is driven by organizations realizing they the opportunity to measure every aspect of their operation, unify those data sources, and act on the patterns that Hadoop and other Big Data tools uncover.////Share a few examples of things that can be measured (website hits, etc.) Amy////

For

e-commerce site, an advertising broker, or a hosting provider, there’s no wonder inherent in being able to measure every customer interaction, no controversy that it’s enormously valuable to uncovering patterns in those interactions, and no lack of tools to act on those patterns in real time.

can use the clickstream of interactions with each email ; this was one of the cardinal advantages cited in the success of Barack Obama’s 2012 campaign.

This chapter’s techniques will help, say, a hospital process the stream of data from every ICU patient; a retailer process the entire purchase-decision process from or a political campaign to understand and tune the response to each email batch and advertising placement.

Hadoop cut its teeth at Yahoo, where it was primarily used for processing internet-sized web crawls(see next chapter on text processing) and

Quite likely, server log processing either a) is the reason you got this book or b) seems utterly boring.////I think you should explain this for readers. Amy//// For the latter folks, stick with it; hidden in this chapter are basic problems of statistics (finding histogram of pageviews), text processing (regular expressions for parsing), and graphs (constructing the tree of paths that users take through a website).

Webserver Log Parsing

We’ll represent loglines with the following model definition:

link:code/serverlogs/models/logline--fields.rb[role=include]

Since most of our questions are about what visitors do, we’ll mainly use visitor_id (to identify common requests for a visitor), uri_str (what they requested), requested_at (when they requested it) and referer (the page they clicked on). Don’t worry if you’re not deeply familiar with the rest of the fields in our model — they’ll become clear in context.

Two notes, though. In these explorations, we’re going to use the ip_address field for the visitor_id. That’s good enough if you don’t mind artifacts, like every visitor from the same coffee shop being treated identically. For serious use, though, many web applications assign an identifying "cookie" to each visitor and add it as a custom logline field. Following good practice, we’ve built this model with a visitor_id method that decouples the semantics ("visitor") from the data ("the IP address they happen to have visited from"). Also please note that, though the dictionary blesses the term 'referrer', the early authors of the web used the spelling 'referer' and we’re now stuck with it in this context. ///Here is a great example of supporting with real-world analogies, above where you wrote, "like every visitor from the same coffee shop being treated identically…​" Yes! That is the kind of simple, sophisticated tying-together type of connective tissue needed throughout, to greater and lesser degrees. Amy////

Simple Log Parsing

////Help the reader in by writing something quick, short, like, "The core nugget that you should know about simple log parsing is…​" Amy////

Webserver logs typically follow some variant of the "Apache Common Log" format — a series of lines describing each web request:

154.5.248.92 - - [30/Apr/2003:13:17:04 -0700] "GET /random/video/Star_Wars_Kid.wmv HTTP/1.0" 206 176250 "-" "Mozilla/4.77 (Macintosh; U; PPC)"

Our first task is to leave that arcane format behind and extract healthy structured models. Since every line stands alone, the parse script is simple as can be: a transform-only script that passes each line to the Logline.parse method and emits the model object it returns.

link:code/serverlogs/parser--processor.rb[role=include]
Star Wars Kid serverlogs

For sample data, we’ll use the webserver logs released by blogger Andy Baio. In 2003, he posted the famous "Star Wars Kid" video, which for several years ranked as the biggest viral video of all time. (It augments a teenager’s awkward Jedi-style fighting moves with the special effects of a real lightsaber.) Here’s his description:

I’ve decided to release the first six months of server logs from the meme’s spread into the public domain — with dates, times, IP addresses, user agents, and referer information. …​ On April 29 at 4:49pm, I posted the video, renamed to "Star_Wars_Kid.wmv" — inadvertently giving the meme its permanent name. (Yes, I coined the term "Star Wars Kid." It’s strange to think it would’ve been "Star Wars Guy" if I was any lazier.) From there, for the first week, it spread quickly through news site, blogs and message boards, mostly oriented around technology, gaming, and movies. …​

This file is a subset of the Apache server logs from April 10 to November 26, 2003. It contains every request for my homepage, the original video, the remix video, the mirror redirector script, the donations spreadsheet, and the seven blog entries I made related to Star Wars Kid. I included a couple weeks of activity before I posted the videos so you can determine the baseline traffic I normally received to my homepage. The data is public domain. If you use it for anything, please drop me a note!

The details of parsing are mostly straightforward — we use a regular expression to pick apart the fields in each line. That regular expression, however, is another story:

link:code/serverlogs/models/logline--parse.rb[role=include]

It may look terrifying, but taken piece-by-piece it’s not actually that bad. Regexp-fu is an essential skill for data science in practice — you’re well advised to walk through it. Let’s do so.

  • The meat of each line describe the contents to match — \S+ for "a sequence of non-whitespace", \d+ for "a sequence of digits", and so forth. If you’re not already familiar with regular expressions at that level, consult the excellent tutorial at regular-expressions.info.

  • This is an 'extended-form' regular expression, as requested by the x at the end of it. An extended-form regexp ignores whitespace and treats # as a comment delimiter — constructing a regexp this complicated would be madness otherwise. Be careful to backslash-escape spaces and hash marks.

  • The \A and \z anchor the regexp to the absolute start and end of the string respectively.

  • Fields are selected using 'named capture group' syntax: (?<ip_address>\S+). You can retrieve its contents using match[:ip_address], or get all of them at once using captures_hash as we do in the parse method.

  • Build your regular expressions to be 'good brittle'. If you only expect HTTP request methods to be uppercase strings, make your program reject records that are otherwise. When you’re processing billions of events, those one-in-a-million deviants start occurring thousands of times.

That regular expression does almost all the heavy lifting, but isn’t sufficient to properly extract the requested_at time. Wukong models provide a "security gate" for each field in the form of the receive_(field name) method. The setter method (requested_at=) applies a new value directly, but the receive_requested_at method is expected to appropriately validate and transform the given value. The default method performs simple 'do the right thing'-level type conversion, sufficient to (for example) faithfully load an object from a plain JSON hash. But for complicated cases you’re invited to override it as we do here.

link:code/serverlogs/models/logline--requested_at.rb[role=include]

There’s a general lesson here for data-parsing scripts. Don’t try to be a hero and get everything done in one giant method. The giant regexp just coarsely separates the values; any further special handling happens in isolated methods.

link:code/serverlogs/output-parser-map.log[role=include]

Then run it on the full dataset to produce the starting point for the rest of our work:

TODO

Pageview Histograms

////Ease the reader in with something like, "Our goal here will be…​" Amy////

Let’s start exploring the dataset. Andy Baio

link:code/serverlogs/old/logline-02-histograms-mapper.rb[role=include]

We want to group on date_hr, so just add a 'virtual accessor' — a method that behaves like an attribute but derives its value from another field:

link:code/serverlogs/old/logline-00-model-date_hr.rb[role=include]

This is the advantage of having a model and not just a passive sack of data.

Run it in map mode:

link:code/serverlogs/old/logline-02-histograms-02-mapper-wu-lign-sort.log[role=include]

TODO: digression about wu-lign.

Sort and save the map output; then write and debug your reducer.

link:code/serverlogs/old/logline-02-histograms-full.rb[role=include]

When things are working, this is what you’ll see. Notice that the …​/Star_Wars_Kid.wmv file already have five times the pageviews as the site root (/).

link:code/serverlogs/old/logline-02-histograms-03-reduce.log[role=include]

You’re ready to run the script in the cloud! Fire it off and you’ll see dozens of workers start processing the data.

link:code/serverlogs/old/logline-02-histograms-04-freals.log[role=include]

User Paths through the site ("Sessionizing")

We can use the user logs to assemble a picture of how users navigate the site — 'sessionizing' their pageviews. Marketing and e-commerce sites have a great deal of interest in optimizing their "conversion funnel", the sequence of pages that visitors follow before filling out a contact form, or buying those shoes, or whatever it is the site exists to serve. Visitor sessions are also useful for defining groups of related pages, in a manner far more robust than what simple page-to-page links would define. A recommendation algorithm using those relations would for example help an e-commerce site recommend teflon paste to a visitor buying plumbing fittings, or help a news site recommend an article about Marilyn Monroe to a visitor who has just finished reading an article about John F Kennedy. Many commercial web analytics tools don’t offer a view into user sessions — assembling them is extremely challenging for a traditional datastore. It’s a piece of cake for Hadoop, as you’re about to see.

////This spot could be an effective place to say more about "Locality" and taking the reader deeper into thinking about that concept in context. Amy////

NOTE:[Take a moment and think about the locality: what feature(s) do we need to group on? What additional feature(s) should we sort with?]

spit out [ip, date_hr, visit_time, path].

link:code/serverlogs/old/logline-03-breadcrumbs-full.rb[role=include]

You might ask why we don’t partition directly on say both visitor_id and date (or other time bucket). Partitioning by date would break the locality of any visitor session that crossed midnight: some of the requests would be in one day, the rest would be in the next day.

run it in map mode:

link:code/serverlogs/old/logline-02-histograms-01-mapper.log[role=include]
link:code/serverlogs/old/logline-03-breadcrumbs-02-mapper.log[role=include]

group on user

link:code/serverlogs/old/logline-03-breadcrumbs-03-reducer.log[role=include]

We use the secondary sort so that each visit is in strict order of time within a session.

You might ask why that is necessary — surely each mapper reads the lines in order? Yes, but you have no control over what order the mappers run, or where their input begins and ends.

This script will accumulate multiple visits of a page.

TODO: say more about the secondary sort. ////This may sound wacky, but please try it out: use the JFK/MM exmaple again, here. Tie this all together more, the concepts, using those memorable people. I can explain this live, too. Amy////

Web-crawlers and the Skew Problem

In a

It’s important to use real data when you’re testing algorithms: a skew problem like this

Page-Page similarity

What can you do with the sessionized logs? Well, each row lists a visitor-session on the left and a bunch of pages on the right. We’ve been thinking about that as a table, but it’s also a graph — actually, a bunch of graphs! The serverlogs_affinity_graph describes an affinity graph, but we can build a simpler graph that just connects pages to pages by counting the number of times a pair of pages were visited by the same session. Every time a person requests the /archive/2003/04/03/typo_pop.shtml page and the /archive/2003/04/29/star_war.shtml page in the same visit, that’s one point towards their similarity. The chapter on [graph_processing] has lots of fun things to do with a graph like this, so for now we’ll just lay the groundwork by computing the page-page similarity graph defined by visitor sessions.

link:code/serverlogs/old/logline-04-page_page_edges-full.rb[role=include]
link:code/serverlogs/old/logline-04-page_page_edges-03-reducer.log[role=include]
Affinity Graph

First, you can think of it as an affinity graph pairing visitor sessions with pages. The Netflix prize motivated a lot of work to help us understand affinity graphs — in that case, a pairing of Netflix users with movies. Affinity graph-based recommendation engines simultaneously group similar users with similar movies (or similar sessions with similar pages). Imagine the following device. Set up two long straight lengths of wire, with beads that can slide along it. Beads on the left represent visitor sessions, ones on the right represent pages. These are magic beads, in that they can slide through each other, and they can clamp themselves to the wire. They are also slightly magnetic, so with no other tension they would not clump together but instead arrange themselves at some separated interval along the wire. Clamp all the beads in place for a moment and tie a small elastic string between each session bead and each page in that session. (These elastic bands also magically don’t interfere with each other). To combat the crawler-robot effect, choose tighter strings when there are few pages in the session, and weaker strings when there are lots of pages in the session. Once you’ve finished stringing this up, unclamp one of the session beads. It will snap to a position opposit the middle of all the pages it is tied to. If you now unclamp each of those page beads, they’ll move to sit opposite that first session bead. As you continue to unclamp all the beads, you’ll find that they organize into clumps along the wire: when a bunch of sessions link to a common set of pages, their mutal forces combine to drag them opposite each other. That’s the intuitive view; there are proper mathematical treatments, of course, for kind of co-clustering.

TODO: figure showing bipartite session-page graph

Geo-IP Matching

You can learn a lot about your site’s audience in aggregate by mapping IP addresses to geolocation. Not just in itself, but joined against other datasets, like census data, store locations, weather and time. [1]

Maxmind makes their GeoLite IP-to-geo database available under an open license (CC-BY-SA)[2]. Out of the box, its columns are beg_ip, end_ip, location_id, where the first two columns show the low and high ends (inclusive) of a range that maps to that location. Every address lies in at most one range; locations may have multiple ranges.

This arrangement caters to range queries in a relational database, but isn’t suitable for our needs. A single IP-geo block can span thousands of addresses.

To get the right locality, take each range and break it at some block level. Instead of having 1.2.3.4 to 1.2.5.6 on one line, let’s use the first three quads (first 24 bits) and emit rows for 1.2.3.4 to 1.2.3.255, 1.2.4.0 to 1.2.4.255, and 1.2.5.0 to 1.2.5.6. This lets us use the first segment as the partition key, and the full ip address as the sort key.

    lines            bytes  description                  file
15_288_766   1_094_541_688  24-bit partition key         maxmind-geolite_city-20121002.tsv
 2_288_690     183_223_435  16-bit partition key         maxmind-geolite_city-20121002-16.tsv
 2_256_627      75_729_432  original (not denormalized)  GeoLiteCity-Blocks.csv

Range Queries

////Gently introduce the concept. "So, here’s what range queries are all about, in a nutshell…​" Amy////

This is a generally-applicable approach for doing range queries.

  • Choose a regular interval, fine enough to avoid skew but coarse enough to avoid ballooning the dataset size.

  • Whereever a range crosses an interval boundary, split it into multiple records, each filling or lying within a single interval.

  • Emit a compound key of [interval, join_handle, beg, end], where

    • interval is

    • join_handle identifies the originating table, so that records are grouped for a join (this is what ensures If the interval is transparently a prefix of the index (as it is here), you can instead just ship the remainder: [interval, join_handle, beg_suffix, end_suffix].

  • Use the

In the geodata section, the "quadtile" scheme is (if you bend your brain right) something of an extension on this idea — instead of splitting ranges on regular intervals, we’ll split regions on a regular grid scheme.

Using Hadoop for website stress testing ("Benign DDos")

Hadoop is engineered to consume the full capacity of every available resource up to the currently-limiting one. So in general, you should never issue requests against external services from a Hadoop job — one-by-one queries against a database; crawling web pages; requests to an external API. The resulting load spike will effectively be attempting what web security folks call a "DDoS", or distributed denial of service attack.

Unless of course you are trying to test a service for resilience against an adversarial DDoS — in which case that assault is a feature, not a bug!

elephant_stampede
        require 'faraday'

        processor :elephant_stampede do

	  def process(logline)
	    beg_at = Time.now.to_f
	    resp = Faraday.get url_to_fetch(logline)
	    yield summarize(resp, beg_at)
	  end

	  def summarize(resp, beg_at)
	    duration = Time.now.to_f - beg_at
	    bytesize = resp.body.bytesize
	    { duration: duration, bytesize: bytesize }
	  end

	  def url_to_fetch(logline)
	    logline.url
	  end
	end

	flow(:mapper){ input > parse_loglines > elephant_stampede }

You must use Wukong’s eventmachine bindings to make more than one simultaneous request per mapper.


1. These databases only impute a coarse-grained estimate of each visitor’s location — they hold no direct information about the persom. Please consult your priest/rabbi/spirit guide/grandmom or other appropriate moral compass before diving too deep into the world of unmasking your site’s guests.
2. For serious use, there are professional-grade datasets from Maxmind, Quova, Digital Element among others.