Tw4Years-Filters

Code to accompany: Masis, Eggleston, Green, Jones, Armstrong, and O'Connor. "Large-scale Twitter data reveals systematic geographic and social variation in African American English." Currently under peer-review. 2023.

Contact: Chloe Eggleston, Tessa Masis (tmasis@cs.umass.edu), Brendan O'Connor (brenocon@cs.umass.edu)

This software was used to filter an archive of Gardenhose messages to create the "Twitter4Years" dataset described as follows.

See also: https://github.com/slanglab/aae-twitter

Description (copied from paper draft)

The Twitter4Years dataset is a corpus of approximately 227M tweets authored from May 2011 through April 2015 that are geolocated with latitude/longitude coordinates in the U.S., and with various metadata filters to help focus on "everyday" people (attempting to exclude, for example, celebrities and automated accounts). Messages’ locations have also been augmented with geographical shapefile lookups to TIGER/Line blockgroups distributed by the U.S. Census, and thus integrable with demographic information from Census data products at any level of its geographical hierarchy (blockgroups, tracts, counties, etc).

This dataset was constructed by filtering an archive of Twitter Gardenhose messages that were collected in realtime during the period 2011-2015; we call this the "initial archive." During this period, Twitter called the service "Gardenhose", a version of Twitter’s Sample Stream API whose documentation stated gave access to a 10% sample of all publicly posted tweets.\footnote{Gardenhose may be technically similar to a later service called "Decahose"; see also discussion of limitations of sample stream randomness in \citet{morstatter2013sample, kergl2014endogenesis, pfeffer2018tampering}.} We used no query filters when obtaining data from the API.

A subset of this initial archive was used to create the 56M tweet corpus described in \citet{blodgett16}, which consisted of US geolocated tweets authored in the year 2013. Other selections of this initial archive have been analyzed in earlier work as well, such as \citet{o2010tweets, eisenstein2010latent, eisenstein2014diffusion}.

We now discuss the filtering and analysis to create Twitter4Years, which was performed in service of this work, derived from the initial archive. The collection decisions in this work follow broadly similar motivations as \citet{blodgett16}, with a number of differences and intended improvements. (It contains 92.8% of messages in \citet{blodgett16}’s corpus.) From the initial archive, we filtered to messages fulfilling these criteria:

Geotagged using latitude/longitude coordinates (often called “GPS coordinates”).
Author had less than 5,000 followers when the tweet was authored.
Message is not a retweet and its text does not contain any of the following words (case insensitive matching): rt, follow, mention.
Message text has less than three hashtags.
Message was authored from May 2011 through April 2015, a 48 month period. We chose a multiple of 12 months as the duration in order to eliminate anomalous effects from annually occurring holidays or other cultural events. The specific range was chosen as one with a high prevalence of geotagged tweets within the initial archive. Outside this time range, users less frequently posted their location (perhaps due to software and user interface changes).
The tweet’s authoring platform (Twitter client, the \emph{source} field in the API's tweet object) was limited to one of approximately two dozen typically used by everyday people, as opposed to automated marketing and other uses that sometimes publish to a particular location. The list was manually selected by examining the 125 most frequent authoring platforms, which make up approximately 99% of the dataset. Our whitelist includes, for example, popular clients from the time that were used to access Twitter from the Web, iPhone, Blackberry, and Android, and excludes cross-posts from other social media platforms like Instagram and FourSquare, as well as social media hiring platforms like CareerArc.

This results in 227,253,198 messages (including duplicates); 226,631,796 deduplicated messages; and 7,414,332 unique authors.

Whitelist description

The file whitelist contains the sources (Twitter client IDs) that are included. The following describes these sources, with descriptions of each platform and their line number(s) in the whitelist file:

Official Client/Browser (1-15)
- The official Twitter client, devices include iPhone, Android, Blackberry, Windows Phone, Nokia n40, Windows, iPad, Android Tablets, and Browsers
TweetCaster (16,21,27)
- A third party Twitter client. Website Archive Google Play Store Archive
UberSocial (17,24,25,34,44,55)
- A third party Twitter client. Includes versions for Blackberry, iOS, Android, as well as stand-alone skins for various media platforms (Sherlock Holmes, Tower Heist). Website Archive
Tweetbot (18,20,36,39)
- A third party Twitter client. Wikipedia
TweetDeck (19)
- A third party Twitter client. Acquired by Twitter in 2011. Wikipedia
NightfoxDuo (22,53)
- A Japanese game that uses Twitter for a free-form social networking component. Description
Echofon (23,50)
- A third party Twitter client. Website Archive
Plume (26,30,37)
- A third party Twitter client. Website Archive
Endomondo (28)
- A fitness platform that allows users to update their exercise status to Twitter. Wikipedia
- Note: Upon further review, this source does not strictly fit criteria for entry and should have been excluded. However, it only makes up 0.008% of the dataset, so we do not anticipate downstream analysis would be affected by this error.
UberTwitter (29,35)
- Former version of UberSocial before rebranding. Rebranding coverage, additional coverage
Gravity (31,38)
- A third party Twitter client. Website Archive
Tweetlogix (32)
- A third party Twitter client. Website Archive
Hootsuite (33,46)
- A social media management platform that allows for managing socia media accounts across platforms. Wikipedia
Twittelator (40)
- A third party Twitter client. Website Archive
Twitbird (41,52)
- A third party Twitter client. News Coverage, additional coverage
Twidroyd (42,43)
- A third party Twitter client, eventually acquired by UberMedia and rebranded as UberSocial for Android. Website Archive
A.plus (45)
- A third party Twitter client for Desktop, eventually integrated into UberSocial as a theme. Website Archive.
Seesmic (47)
- A social media management platform, acquired by HootSuite in 2012. Wikipedia
twicca (48)
- A third party Twitter client. Play Store Archive
Fenix (49)
- A third party Twitter client. Website Archive
Tweetie (51)
- A third party Twitter client. Acquired by Twitter in 2010. Wikipedia
Tweedle (54)
- A third party Twitter client. News Coverage

Usage

The pipeline is run in this order. These notes describe data file locations on our internal server during development.

Filter 1: Geographical coordinates
	Script: geo.sh (originally at /home/ceggleston/geo.sh)
	Source Directory: /data/tweets/all
	Output Directory: /home/ceggleston/extracted

Filter 2: USA Located
	Script: usaify.py (originally at /home/ceggleston/usaify/usaify.py)
	Source Directory: /home/ceggleston/extracted
	Output Directory: /home/ceggleston/usa_extracted

Filter 3: Source Whitelist + RT/Follow/Mention + Followers Count + Hashtags Count
	Script: filters.py (originally at /home/ceggleston/samples/filters.py)
	Source Directory: /home/ceggleston/usa_extracted
	Output Directory: /home/ceggleston/dataset/tweets

Script 4: Attaching GEOIDs to each tweet for use in ACS data
	Script: gen_geoids.py (originally at /home/ceggleston/dataset/gen_geoids.py)
	Source Directory: /home/ceggleston/dataset/tweets
	Output Directory: /home/ceggleston/dataset/geoids

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

filters.py

filters.py

gen_geoids.py

gen_geoids.py

geo.sh

geo.sh

usaify.py

usaify.py

whitelist

whitelist

Repository files navigation

Tw4Years-Filters

Description (copied from paper draft)

Whitelist description

Usage

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
README.md		README.md
filters.py		filters.py
gen_geoids.py		gen_geoids.py
geo.sh		geo.sh
usaify.py		usaify.py
whitelist		whitelist

slanglab/Tw4Yr-Filters

Folders and files

Latest commit

History

Repository files navigation

Tw4Years-Filters

Description (copied from paper draft)

Whitelist description

Usage

About

Resources

Stars

Watchers

Forks

Languages