Skip to content

Latest commit

 

History

History
599 lines (443 loc) · 36.5 KB

XX15-text_analysis.asciidoc

File metadata and controls

599 lines (443 loc) · 36.5 KB

Analyzing Text

  1. article → wordbag

  2. join on page data to get geolocation

  3. use pagelinks to get larger pool of implied geolocations

  4. turn geolocations into quadtile keys

  5. aggregate topics by quadtile

  6. take summary statistics aggregated over term and quadkey

  7. combine those statistics to identify terms that occur more frequently than the base rate would predict

  8. explore and validate the results

  9. filter to find strongly-flavored words, and other reductions of the data for visualization

Geographic Flavor

There’s no better example of data that is huge, unruly, organic, highly-dimensional and deeply connected than Wikipedia. Six million articles having XXX million associated properties and connected by XXX million links are viewed by XXX million people each year (TODO: add numbers). The full data — articles, properties, links and aggregated pageview statistics — is free for anyone to access it. (See the [overview_of_datasets] for how.)

The Wikipedia community have attach the latitude and longitude to more than a million articles: not just populated places like Austin, TX, but landmarks like the University of Texas' Memorial Stadium, Snow’s BBQ ("The Best Texas BBQ in the World") and the TACC (Texas Advanced Computer Center, home of the world’s largest academic supercomputer). This lets us put not just each article, but the cloud of data around it, in geographical context.

What happens if we apply this context to not just the article, but those articles' words? Barbeque is popular all through Texas and the Southeastern US — is the term "Barbeque" overrepresented in articles from that region? We can brainstorm a few more terms with strong place affinity, like "beach" (the coasts) or "wine" (France, Napa Valley), and ones without, like "hat" or "couch". Hadoop, combined with the Wikipedia dataset, will let us rigorously identify words with this kind of geographic flavor, along with the locales they have affinity for.

At a high level, what we’ll do is this:

  • Divide the world into grid cells, and group all the words in wikipedia onto their article’s grid cell

  • Determine the overall frequency of each word in the wikipedia corpus

  • Identify prominent (unusually frequent) words on each grid cell

  • Identify words that are prominent on a large number of grid cells — these have strong geographic "flavor"

Decompose Wikipedia Articles

Next, we will summarize each article by preparing its "word bag" — a simple count of the terms on its wikipedia page. This is an application of "Exploding One Records into Many" (REF), but in this case we’ve written a Pig UDF (User-Defined Function) that uses a proper Linguistics toolkit (OpenNLP (?)) to perform the splitting.

REGISTER path/to/udf ...;
wordbags = FOREACH articles {
    wds = WORDBAG(text)
      AS tuple(tot_usages:long, num_terms:long, num_onces:long,
         wordbag:bag{tuple(article_term_usages,term)});
    GENERATE page_id, quadcell, wds.tot_usages, wds.num_terms, wds.num_onces, wds.wordbag;
};
term_article_freqs = FOREACH wordbags GENERATE
    page_id, quadcell, tot_usages, num_terms, FLATTEN(wordbag);
STORE term_article_freqs INTO '/data/work/geo_flavor/term_article_freqs;

Here’s the output:

Wordbag for "Lexington, Texas"
Lexington,_Texas 023130130 TODO:tot_usages TODO:terms TODO:NUMONCES {(4,"texas")(2,"lexington"),(2,"best"),(2,"bbq"),(1,"barbecue"), ...}

And the output after the flatten:

"term_cell_freqs" result
Lexington,_Texas 023130130 tot_usages num_terms 4   "texas"
Lexington,_Texas 023130130 tot_usages num_terms 2   "lexington"
Lexington,_Texas 023130130 tot_usages num_terms 2   "best"
Lexington,_Texas 023130130 tot_usages num_terms 2   "bbq"
Lexington,_Texas 023130130 tot_usages num_terms 1   "barbecue"

Determining Frequency, Range and Dispersion of Terms within a Corpus

  • Doc-term-usages-ct

  • doc-term-usages-rate,

  • doc-term-frac

  • Doc-all-terms-ct, doc-all-usages-ct

  • Raw Rate in doc; adjusted rate in doc:

  • Docs-ct — count of documents

  • Terms-ct — count of terms

  • Usages-ct — count of usages

  • (doc_id, term, usages-ct, doc-term-usages-rate, all-term-usages-freq)

The term "however" crops up at a rate of XXX ppm across XXX articles In contrast, the word "function" XXX ppm of terms, XXX %of articles — relatively fewer articles employ the term, but they tend to do at a concentrated rate. (TODO: histogram of usage counts) We can get a sense of this "lumpiness" using the dispersion function [1]

map decorate

Group all count star

Term Statistics by Grid Cell

Now comes the fun part.

taf_g = GROUP term_article_freqs BY quadcell, term;
cell_freqs = FOREACH taf_g GENERATE
      group.quadcell AS quadcell,
      group.term AS term,
      SUM(term_article_freqs.article_term_usages) AS cell_term_usages;
cf_g = GROUP cell_freqs BY quadcell;
term_cell_freqs = FOREACH cf_g GENERATE
    group AS quadcell,
    SUM(cell_term_usages) AS cell_usages
    FLATTEN(cell_term_usages, term);
"cell_freqs" result
023130130 7 "bbq"
023130130 20 "texas"
"cf_g" result
023130130 {(7,"bbq"),(20,"texas"),...}
"term_cell_freqs" result
023130130 95 7 "bbq"
023130130 95 20 "texas"

Term Statistics

We will be defining the prominence of a term on a grid cell by comparing its local frequency to the overall frequency of the term. The occurrence frequency of the term "the" is XX parts per million (ppm), while that of "barbeque"'s is XX ppm. However, on the quadcell surrounding Lexington, Texas, "the" occurs at XX ppm and "barbeque" at XX ppm — a significantly elevated rate.

Let’s now prepare those global statistics.

all_terms = GROUP term_article_freqs BY term;
term_info_1 = FOREACH all_terms GENERATE
    group AS term,
    COUNT_STAR(term_article_freqs) AS num_articles,
    SUM(article_term_usages) AS term_usages;
global_term_info_g = GROUP term_info BY ALL;
global_term_info = FOREACH global_term_info_g GENERATE
    COUNT_STAR(term_info) AS num_terms,
    SUM(term_usages) AS global_usages;
STORE global_term_info INTO '/data/work/geo_flavor/global_term_info';

(The actual code is somewhat different from what you see here — we’ll explain below)

  1. (TODO describe term_info)

Pattern: Re-injecting global totals

We also extract two global statistics: the number of distinct terms, and the number of distinct usages. This brings up one of the more annoying things about Hadoop programming. The global_term_info result is two lousy values, needed to turn the global counts for each term into the global frequency for each term. But a pig script just orchestrates the top-level motion of data: there’s no intrinsic way to bring the result of a step into the declaration of following steps. The proper recourse is to split the script into two parts, and run it within a workflow tool like Rake, Drake or Oozie. The workflow layer can fish those values out of the HDFS and inject them as runtime parameters into the next stage of the script.

We prefer to cheat. We instead ran a version of the script that found the global count of terms and usages, then copy/pasted their values as static parameters at the top of the script. This also lets us calculate the ppm frequency of each term and the other term statistics in a single pass. To ensure our time-traveling shenanigans remain valid, we add an ASSERT statement which compares the memoized values to the actual totals.

DEFINE memoized_num_terms XXX;
DEFINE memoized_global_usages XXX;
all_terms = GROUP term_cell_freqs BY term;
term_info_1 = FOREACH all_terms GENERATE
    group AS term,
    COUNT_STAR(term_cell_freqs) AS num_articles,
    SUM(article_term_usages) AS term_usages,
    1000000 * SUM(article_term_usages)/memoized_global_usages AS term_ppm:double
    ;
-- Validate the global term statistics
global_term_info_g = GROUP term_info BY ALL;
global_term_info = FOREACH global_term_info_g GENERATE
   COUNT_STAR(term_info) AS num_terms,
   SUM(term_usages) AS global_usages;
STORE global_term_info INTO '/data/work/geo_flavor/global_term_info';
ASSERT(global_term_info.num_terms = memoized_num_terms);
ASSERT(global_term_info.global_usages = memoized_global_usages);

(TODO: just realized the way we’ve done this finds global term stats on only geolocated articles. To find them on all articles will complicate the script: we have to do a left join and then filter, or we’d have to do wordbags first then join on geolocations.)

A pause, to think

Let’s look at the fundamental pattern that we’re using. Our steps:

  1. transform each article individually into its wordbag

  2. augment the wordbags with their geo coordinates by joining on page ID

  3. organize the wordbags into groups having the same grid cell;

  4. form a single combined wordbag for each grid cell.

//// Consider adding some text here that guides the reader with regard to the findings they might expect to result. For example, "…​if you were to use the example of finding symptoms that intersect with illness as part of an epidemic, you would have done x, y, and z…​" This will bring the activity to life and help readers appreciate how it applies to thier own data at hand. Amy////

It’s a sequence of transforms (operations on each record in isolation: steps 1 and 4) and pivots — operations that combine records, whether from different tables (the join in step 2) or the same dataset (the group in step 3).

In doing so, we’ve turned articles that have a geolocation into coarse-grained regions that have implied frequencies for words. The particular frequencies arise from this combination of forces:

  • signal: Terms that describe aspects of the human condition specific to each region, like "longhorns" or "barbecue", and direct references to place names, such as "Austin" or "Texas"

  • background: The natural frequency of each term — "second" is used more often than "syzygy" — slanted by its frequency in geo-locatable texts (the word "town" occurs far more frequently than its natural rate, simply because towns are geolocatable).

  • noise: Deviations introduced by the fact that we have a limited sample of text to draw inferences from.

Our next task — the sprint home — is to use a few more transforms and pivots to separate the signal from the background and, as far as possible, from the noise.

Match Wikipedia Article Text with Article Geolocation

Let’s start by assembling the data we need. The wikipedia dataset has three different tables for each article: the metadata for each page (page id, title, size, last update time, and so); the full text of each article (a very large field); and the article geolocations. Below are snippets from the articles table and of the geolocations table:

Wikipedia article record for "Lexington, Texas"
Lexington,_Texas  Lexington is a town in Lee County, Texas, United States. ... Snow's BBQ, which Texas Monthly called "the best barbecue in Texas" and The New Yorker named "the best Texas BBQ in the world" is located in Lexington.
Article coordinates
Lexington,_Texas -97.01 30.41 023130130

Since we want to place the words in each article in geographic context, our first step is to reunite each article with its geolocation: a Direct Inner Join (pattern REF). While we’ve got the data streaming by, we also attach its quadtile sorting key ("Partition Points onto a Spatial Grid", REF) w

article_text = LOAD ('...');
article_geolocations = LOAD('...');
articles = JOIN article_geolocations BY page_id, article_text BY page_id;
articles = FOREACH articles GENERATE article_text::page_id, QUADKEY(lng, lat) as quadcell, text;

Here’s the result:

Wordbag with coordinates
Lexington,_Texas 023130130 Lexington is a town in Lee County, Texas ...

Pulling signal from noise

To isolate the signal, we’ll pull out a trick called "Pointwise Mutual Information" (PMI). Though it may sound like an insurance holding company, in fact PMI is a simple approach to isolate the noise and background. It compares the following:

  • the rate the term 'barbecue' is used

  • the rate that terms are used on grid cell 023130130

  • the rate the term 'barbecue' is used on grid cell 023130130

Just as above, we can transform and pivot to get those figures:

  • group the data by term; count occurrences

  • group the data by tile; count occurrences

  • group the data by term and tile; count occurrences

  • count total occurrences

  • combine those counts into rates, and form the PMI scores.

Rather than step through each operation, I’ll wave my hands and pull its output from the oven:

023130130 {(("texas",X),...,("longhorns",X),...("bbq",X),...,...}

As expected, in [baldridge_bbq_wine] you see BBQ loom large over Texas and the Southern US; Wine, over the Napa Valley[2].

Takeaway #1: Start with a Question

We accomplished an elaborate data exploration, yet at no point did we do anything complex. Instead of writing a big hairy monolithic program, we wrote a series of simple scripts that either transformed or pivoted the data.

As you’ll see later, the scripts are readable and short (none exceed a few dozen lines of code). They run easily against sample data on your desktop, with no Hadoop cluster in sight; and they will then run, unchanged, against the whole of Wikipedia on dozens or hundreds of machines in a Hadoop cluster. ////This sounds hard to believe. Consider saying more here, as it comes off as a bit over-simplified. Amy////

That’s the approach we’ll follow through this book: develop simple, maintainable transform/pivot scripts by iterating quickly and always keeping the data visible; then confidently transition those scripts to production as the search for a question becomes the rote production of an answer.

The challenge, then, isn’t to learn to "program" Hadoop — it’s to learn how to think at scale, to choose a workable series of chess moves connecting the data you have to the insight you need. In the first part of the book, after briefly becoming familiar with the basic framework, we’ll proceed through a series of examples to help you identify the key locality and thus the transformation each step calls for. In the second part of that book, we’ll apply this to a range of interesting problems and so build up a set of reusable tools for asking deep questions in actual practice.

Exemplars and Touchstones

There are three touchstones to hit in every data exploration:

  • Confirm the things you know:

  • Confirm or refute the things you suspect.

  • Uncover at least one thing you never suspected.

Things we know: First, common words should show no geographic flavor. Geographic features — "beach", "mountain", etc — should be intensely localised. * compared to other color words, there will be a larger regional variation for the terms "white" and "black" (as they describe ra You don’t have to stop exploring when you find a new mystery, but no data exploration is complete until you uncover at least one.

We will jointly discover two things taking as a whole the terms that have a strong geographic flavor, we should largely see cultural terms (foods, sports, etc) Next, we’ll choose some exemplars: familiar records to trace through "Barbeque" should cover ;

stream do |article|
  words = Wukong::TextUtils.tokenize(article.text, remove_stopwords: true)
  words.group_by(&:to_s).map{|word, occurs|
    yield [article.id, word, occurs.count]
  end
end

Reading it as prose the script says "for each article: break it into a list of words; group all occurrences of each word and count them; then output the article id, word and count."

Snippet from the Wikipedia article on "Barbecue"

Each Southern locale has its own particular variety of barbecue, particularly concerning the sauce. North Carolina sauces vary by region; eastern North Carolina uses a vinegar-based sauce, the center of the state enjoys Lexington-style barbecue which uses a combination of ketchup and vinegar as their base, and western North Carolina uses a heavier ketchup base. Lexington boasts of being "The Barbecue Capital of the World" and it has more than one BBQ restaurant per 1,000 residents. In much of the world outside of the American South, barbecue has a close association with Texas. Many barbecue restaurants outside the United States claim to serve "Texas barbecue", regardless of the style they actually serve. Texas barbecue is often assumed to be primarily beef. This assumption, along with the inclusive term "Texas barbecue", is an oversimplification. Texas has four main styles, all with different flavors, different cooking methods, different ingredients, and different cultural origins. In the June 2008 issue of Texas Monthly Magazine Snow’s BBQ in Lexington was rated as the best BBQ in the state of Texas. This ranking was reinforced when New Yorker Magazine also claimed that Snow’s BBQ was "The Best Texas BBQ in the World".

=== Pointwise Mutual Information

Pointwise Mutual Information sounds like an Insurance holding company, but is in fact a simple way // to expose signal from background.

Let’s pick up the example from [first_exploration]

  • rate the word 'barbecue' is used

  • rate that words are used on grid cell 023130130

  • rate the word 'barbecue' is used on grid cell 023130130

    pmi(x; y) := log[ p(x, y) / (p(x)*p(y))
    <math>
    \operatorname{pmi}(x;y) \equiv \log\frac{p(x,y)}{p(x)p(y)} = \log\frac{p(x|y)}{p(x)} = // \log\frac{p(y|x)}{p(y)}.
    </math>

==== Smoothing the counts ====

The count of each word is an imperfect estimate of the probability of seeing that word in the context of the given topic. Consider for instance the words that would have shown up if the article were 50% longer, or the cases where an author chose one synonym out of many equivalents. This is particularly significant considering words with zero count.

We want to treat "missing" terms as having occurred some number of times, and adjust the probabilities of all the observed terms.

Note
Minimally Invasive

It’s essential to use "minimally invasive" methods to address confounding factors.

What we’re trying to do is expose a pattern that we believe is robust: that it will shine through any occlusions in the data. Occasionally, as here, we need to directly remove some confounding factor. The naive practitioner thinks, "I will use a powerful algorithm! That’s good, because powerful is better than not powerful!" No — simple and clear is better than powerful.

Suppose you were instead telling a story set in space - somehow or another, you must address the complication of faster-than-light travel. Star Wars does this early and well: its choices ("Ships can jump to faraway points in space, but not from too close to a planet and only after calculations taking several seconds; it happens instantaneously, causing nearby stars to appear as nifty blue tracks") are made clear in a few deft lines of dialog.

A ham-handed sci-fi author instead brings in complicated machinery requiring a complicated explanation resulting in complicated dialogue. There are two obvious problems: first, the added detail makes the story less clear. It’s literally not rocket science: concentrate on heros and the triumph over darkness, not on rocket engines. Second, writing that dialog is wasted work. If it’s enough to just have the Wookiee hit the computer with a large wrench, do that.

But it’s essential to appreciate that this also introduces extra confounding factors. Rather than a nifty special effect and a few lines shouted by a space cowboy at his hairy sidekick, your junkheap space freighter now needs an astrophysicist, a whiteboard and a reason to have the one use the other. The story isn’t just muddier, it’s flawed.

We’re trying to tell a story ("words have regional flavor"), but the plot requires a few essential clarifications ("low-frequency terms are imperfectly estimated"). If these patterns are robust, complicated machinery is detrimental. It confuses the audience, and is more work for you; it can also bring more pattern to the data than is actually there, perverting your results.

The only time you should bring in something complicated or novel is when it’s a central element of your story. In that case, it’s worth spending multiple scenes in which Jedi masters show and tell the mechanics and limitations of The Force.

There are two reasonable strategies: be lazy; or consult a sensible mathematician.

To be lazy, add a 'pseudocount' to each term: pretend you saw it an extra small number of times For the common pseudocount choice of 0.5, you would treat absent terms as having been seen 0.5 times, terms observed once as having been seen 1.5 times, and so forth. Calclulate probabilities using the adjusted count divided by the sum of all adjusted counts (so that they sum to 1). It’s not well-justified mathematically, but is easy to code.

Consult a mathematician: for something that is mathematically justifiable, yet still simple enough to be minimally invasive, she will recommend "Good-Turing" smoothing.

In this approach, we expand the dataset to include both the pool of counts for terms we saw, and an "absent" pool of fractional counts, to be shared by all the terms we didn’t see. Good-Turing says to count the terms that occurred once, and guess that an equal quantity of things would have occurred once, but didn’t. This is hand-wavy, but minimally invasive; we oughtn’t say too much about the things we definitionally can’t say much about.

We then make the following adjustments:

  • Set the total count of words in the absent pool equal to the number of terms that occur once. There are of course tons of terms in this pool; we’ll give each some small fractional share of an appearance.

  • Specifically, treat each absent term as occupying the same share of the absent pool as it does in the whole corpus (minus this doc). So, if "banana" does not appear in the document, but occurs at (TODO: value) ppm across all docs, we’ll treat it as occupying the same fraction of the absent pool (with slight correction for the absence of this doc).

  • Finally, estimate the probability for each present term as its count divided by the total count in the present and absent pools.

  • Tokenize Articles

  • Wordbag articles

  • Corpus term stats

    • document+term: (all rate figures are ppb)

    • doc_id, bag:{(term, doc-term-n-usages, doc-term-rate, doc-term-rate-squared, range=1)}, doc-n-terms, doc-n-usages

    • doc-max-term-usages: usages-ct of term with highest usages-ct

    • doc-term-tf_1 = 0.5*(1 + (doc-term-n-usages/doc-max-term-usages))

    • doc-term-tf = 1 + log(doc-term-n-usages) (or 0 if d-t-n-usages == 0)

    • doc-term-tf_3 = log(doc-term-n-usages)/(1+log(doc-avg-term-usages))

    • doc-n-singletons: number of terms in document used exactly once

    • removing stopwords

    • all:

    • tot-usages: sum(doc-term-n-usages) for all docs&terms

    • n-terms = count(distinct terms): count of distinct terms

    • n-docs = count(docs): count of docs

    • avg-doc-terms: avg(doc-n-terms) — average number of distinct terms in a doc

    • avg-doc-usages = avg(doc-n-usages) = (tot-usages/n-docs): average number of usages in a doc

    • sum of doc-term-rate on all docs and terms should be same as docs ct — see if we get an underflow situation? Do Knuth trick?

    • term:

    • term-n-usages: sum of doc-term-n-usages

    • term-rate: 1e9*sum(doc-term-n-usages)/tot-usages

    • term-range: count of docs with word ie. sum of range column

    • idf: ln(n-docs/term-range)

    • term-avg-doc-rate: avg(doc-term-rate)

    • term-sdv-doc-rate: sqrt((n-docs * sum(doc-term-rate-squared) - sum(doc-term-rate)^2)/n-docs

    • rank-term-rate

    • ?rank of dispersion*term-rate

    • term-doc:

    • term

    • doc-term-n-usages

    • doc-term-rate

    • term-rate

    • doc-term-tf = 0.5*(1 + (doc-term-n-usages/doc-max-term-usages))

    • doc-term-tf-idf = tf*idf: significance of document for this term in global corpus

    • doc-term-rate-sm = (1 - doc-unseen-rate)*(doc-term-n-usages / doc-n-usages)

    • doc-n-terms

    • doc-n-usages

    • doc-unseen-rate = clamp(doc-n-singletons/doc-n-usages, 1/doc-n-usages, 0.5)

    • doc-unseen-term-rate = ( doc-unseen-rate * (term-rate/(1 - (doc-n-usages/tot-usages))) ) — rate of a specific unseen word

    • all

    • n-grids

    • avg-docs-per-grid

    • grid

    • term

    • grid-term-n-usages — with grid-n-usages, the term rate if longer docs count more. In effect, this treats the whole grid as one document.

    • grid-avg-doc-term-rate — term rate if every document weights the same

    • term-rate — global term rate, for normalization

    • grid-term-tf: 0.5*(1 + (doc-term-n-usages/doc-max-term-usages))

    • grid-term-tf-idf = tf*idf: significance of document for this term in global corpus

    • avg-doc-term-tf-idf

    • grid-n-usages

    • grid-n-terms

    • grid-n-docs

    • n-grids

    • n-docs

The way to think about it is as…​ You won’t believe it…​ As if there were a chimpanzee at a typewriter sitting at each grid cell littering the floor with documents. Whatever the coherence of their output, they do generally use words at the same rate as normal prose — but each one has words they’ve grown particularly fond of.

=== How to weight tiles

We can think of a few ways to look at it:

10/1000 2/100 1/100 0/1000 0/300

  • grid as doc: use grid PMI, grid tf-idf.

  • average of doc stats: use average of doc PMI, doc tf-idf. A term used 2 times out of 200 in a smallish document counts the same as a term used 10 times out of 1000 in a larger document. The number of documents mentioning a term doesn’t matter if the same average rate is hit: you get the same outcome for a term that appears 1% of the time in both A and B as if it appeared 2% of the time in A and never in B.

  • RMS averaged docs: use sum-squared average (sqrt(sum(x^2))/ct). A term appearing 1% of the time in both A and B contributes about 30% less than if it appeared 2% of the time in A and never in B.

  • thresholded doc stats: a term is prominent if many docs feature it at some level. Three docs featuring a term at any above-cutoff rate contributes much more than one doc which shouts about it. A grid with three docs which use the term just under threshold doesn’t is the same as one that doesn’t mention it.

  • an ad-hoc mixture of the above.

=== Count-min-sketch

  • Count-min-sketch filter for bigrams

    • hash

    • choosing a hash function; watch out for deanonymization

    • choosing number, length of hash; using parts of a hash as hash. (2^21 = 2 million, getting three from a 64-bit hash; 2^25 = 33 million, getting five from two 64-bit hashes; 2^26 = 66 million, getting six from five 32-bit hashes / 2 64-bit and a 32-bit; 2^27 = 134 million, getting seven from three 64-bit hashes.)

    • http://web.stanford.edu/class/cs166/lectures/12/Small12.pdf

    • w is the number of possible hash outcomes. Eg for 21-bit hash get 2 million slots; for 25-bit hash get 33 million slots

    • dh is the number of hash functions we use

    • a_i is the usages count for term a

    • U = |a|_1 (the l1 norm of a) is sum(a_i): the total usage count

    • |a|_2 (l2 norm) is sqrt(sum(a_i^2)); it is strictly less than |a|_1.

    • the naive background error with one hash function will be U/w: scattering the U observations evenly across w outcomes. If I have 4 million articles averaging 100 words each, and 25 bits giving 33 million slots, the expected slop is 12 counts. For 21 bits, it’s 400m/2m= 191 counts.

    • with many hash functions, if we take

    • eps = e * U / w = 2.718 * U / 2^hashbits

    • delta = 1/exp(dh) — with 4 hashes it’s 1/55, 5 hashes it’s ~1/150, with seven it’s ~1/1100

    • then with probability delta, the error in count will be less than eps.

    • summary

    • w = 2^hbits

    • data size = dh*w

    • background bin count = U/w

    • eps = e * U/w

    • delta = 1/exp(dh)

    • for 400m usages, 21 bits,

  • BNC stats: #1 "the": 62k ppm; #10, "was": 9.2k ppm; 36 ppm: "at once", empire, sugar; 17 ppm: "as opposed to", horrible, measurements; 10 ppm "relative to", dirt, sexually

===== notes on figures

The inverse document frequency is a measure of how much information the word provides, that is, whether the term is common or rare across all documents. It is the logarithmically scaled fraction of the documents that contain the word, obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient.

Range (Ra) is a simple count of how many segments include the tag in question. Dispersion (Disp) is a statistical coefficient (Juilland’s D) of how evenly distributed a tag is across successive segments of the corpus. This is useful, because many segments and texts are made up of a number of smaller, relatively independent units – for example, sectors and stories in newspapers. It may be that, even within a text, certain word classes are overused in a given part – e.g. the football-reporting sector of a newspaper. Juilland’s D is more sensitive to this degree of variation. It was calculated as follows: V where x is the mean sub-frequency of the tag in the subcorpus (i.e. its frequency in each segment averaged) and s is the standard deviation of these sub-frequencies. We selected Juilland’s D as it has been shown by Lyne (1985) to be the most reliable of the various dispersion coefficients that are available. It varies between 0 and 1, where values closer to 0 show that occurrences are focussed in a small number of segments, and values closer to 1 show a more even distribution across all segments.

D =1- V/sqrt(n-1)

where n is the number of segments in the subcorpus. The variation coefficient V is given by:

V = s / x
http://www.linguistics.ucsb.edu/faculty/stgries/research/2008_STG_Dispersion_IJCL.pdf

Let us assume this is our corpus of length l = 50, where letters represent words and the pipes the division of the corpus into different, here, n=5 equally-sized parts.3
bamnibeupk|basatbewqn|bcagabesta|baghabeaat| ba h a a b e a x a
The percentages that each of the parts makes up of the whole corpus — in this case all 0.2 — are denoted as s1 to s5. Let’s assume we are interested in the word a in the corpus. The frequencies with which a occurs in each part are denoted as v1, v2, etc.; as you can see, a occurs once in the first part, twice in the second part, and so on, such that v1 =1, v2 =2, v3 =3, v4 =4, v5 =5. The vector with all observed frequencies (1, 2, 3, 4, 5) is referred to as v and the sum of all observed frequencies, i.e., the number of occurrences of a is referred to as f (f=Σv=15). Note also some other words’ distributions: x occurs only once in the whole corpus; e occurs once in each segment.

The first are general stats and not
specifically geared to the dispersion of linguistic items in texts but more often used as general indices of
variation/dispersion:

* range: number of parts containing term x times) = 5 (x is usually, but need not be, 1)
* max-min difference: max(v) - min(v) = 4 f
* standard deviation sd = sqrt( N*sum(x^2) - sum(x)^2 )/N
* Chi-squared χ : i=1
* variation coefficient vcn: The coefficient of variation (CV) is defined as the ratio of the standard deviation sig to the mean mu  http://en.m.wikipedia.org/wiki/Coefficient_of_variation
* chi-squared χ : ∑ n − 1 ≈ 3.33 where expected

Next, a few well-known “classics” from the early 1970s: 1

* Juilland et al.’s (1971) D: 1 −⎝ i =1⎝ in=1 ⎠ ≈ ⎠0.736 ⎝ ⎛⎝1⎛n−1⎠⎞⎞⎠1
* Rosengren’s (1971) S: ⎜ ⎜ vcv ⎟ ⎟ ⋅
* Carroll’s (1970) D : lo⎝g
* idf
* Juillands D adjusted for differing corpus sizes
    * = D: : ≈ 7.906 ≈ 7.906 where v=

To determine the degree of dispersion DP of word a in a corpus with n parts, one
needs to take three simple steps.
i. Determine the sizes s1−n of each of the n corpus parts, which are normalized against the overall corpus size and correspond to expected percentages which take differently-sized corpus parts into consideration
ii. Determine the frequencies v1−n with which a occurs in the n corpus parts, which are normalized against the overall number of occurrences of a and cor- respond to an observed percentage.
iii. Compute all n pairwise absolute differences of observed and expected per- centages, sum them up, and divide the result by two. The result is DP, which can theoretically range from approximately 0 to 1, where values close to 0 indicate that a is distributed across the n corpus parts as one would expect given the sizes of the n corpus parts. By contrast, values close to 1 indicate that a is distributed across the n corpus parts exactly the opposite way one would expect given the sizes of the n corpus parts
* Divide DP by 1 − (1/n) to yield DPnorm.


Minimal DP’s
Word DP
Broad Dispersion: a 0.08 to 0.103 and 0.106 with 0.155 but 0.158 in 0.159 not 0.165 this 0.166 the 0.168 have 0.178 be 0.207 are 0.223 that 0.227 there 0.243 of 0.249

Lumpy Dispersion: macari 0.999 10 mamluks 0.998 10 lemar 0.996 10 sem 0.994 10 hathor 0.994 10 tatars 0.989 10 scallop 0.989 10 malins 0.988 10 ft 0.986 102 defender 0.98 10 scudamore 0.98 10 pre 0.945 10 diamond 0.941 102 carl 0.938 102 proclaimed

Medium Dispersion: includes thousands plain formal anywhere properly excuse hardly er each lot house tell came

1. there are several competing formulas to capture the concept we have in mind; this version, "Juilland’s D", is easy to calculate
2. This is a simplified version of work by Jason Baldrige, Ben Wing (TODO: rest of authors), who go farther and show how to geolocate texts based purely on their content. An article mentioning barbecue and Willie Nelson would be placed near Austin, TX; one mentioning startups and trolleys in San Francisco. See: Baldridge et al (TODO: reference)