Minute | Explanation |
0:00 | load data |
0:12 | load data clicked |
0:15 | metadata loaded; 1.66 billion rows; no data displayed |
0:20 | display timezone column |
0:25 | data displayed, sorted on timezone column. There are 438M rows
with no timezone, 5.6M rows with timezone Abu Dhabi, etc. |
0:34 | displayed histogram of CreatedAt column. Some tweets have a date
of 1801; start zooming in into last bar using mouse |
0:58 | zoomed in into January 1 and 2 2013 |
1:01 | assigned each bucket a different color based on date |
1:02 | drag-and-drop yellow intersection sign from CreatedAt histogram to
spreadsheet; choose "intersection". |
1:11 | intersection data has 1.63 billion rows |
1:21 | histogram AdultScore column (values between 0 and 1) |
1:24 | drag-and-drop colors from AdultScore histogram to CreatedAt Histogram |
1:27 | result is a 2D histogram of CreatedAT, where each bar is
divided into colors according to AdultScore |
1:41 | sort descending on AdultScore (second sort column remains
Timezone, not visible on screen) - lexicographic sort on 2 columns |
1:43 | grouped visible columns to the left |
1:44 | there are 1996 rows with an adult score of 1 and an empty
timezone; 2 rows with an adult score of 1 and an Abu Dhabi
timezone, etc. |
1:55 | add SpamScore column to sort order, on first position (data
sorted now on 3 columns)
There are 402 rows with SpamScore 0, AdultScore 1 and no
TimeZone. |
2:04 | Draw heatmap of AdultScore vs Timezone |
2:19 | Heatmap drawn; color shows denisity |
2:25 | chosen logarithmic colors for density. Each pixel shows count
of points; count vary between 2 (cyan) and 48M (orange) |
2:30 | zoom into lower-left corner of scatterplot (A-H time-zones,
0-0.2 AdultScore) |
2:35 | New heatmap drawn; density between 2 and 33M/pixel |
2:58 | show tweet Text column; loading takes 25 seconds - most data
is in this column
(some video excised; resorted text alphabetically on tweet text). |
3:24 | first tweets shown have funny unicode characters |
3:28 | atd a new computed column to spreadsheet (Map computation)
Name: Length, Type: Integral, Code: row.text.Length (C# code) |
4:03 | New column computed and atded |
4:10 | histogram Length column |
4:13 | Length histogram displayed; Length goes up to 510 characters! |
4:17 | Zoom into tweets with long length; histogram of tweets with
length > 500 displayed (2884 tweets) |
4:19 | Intersect these tweets with spreadsheet to see text of long
tweets |
4:22 | Long tweets displayed: they all have quoted XML characters |
4:27 | back button pressed: displau previous set of 1.63 tweets in
spreadsheet (instantaneous redisplay of cached rendering) |
4:30 | back button for Length spreadsheet |
4:35 | Zoomed into tweets with length 0-150 |
4:54 | In CreatedAt window zoom into tweets on Jan 2 only |
5:02 | drag-and-drop color from CreatedAt to Length. Grey bars show
tweets that have length displayed but are not on Jan 2. |
5:19 | zoom into Lengths 0-140, display with 35 buckets |
5:41 | intersect Jan 2 dataset with Length dataset to display only
lengths for Jan 2; 1.186 billion tweets left |
5:38 | normalize histogram bars to discover correlation between
Lenght and CreatedAt. No strong correlation. |
5:52 | drag-and-drop colors from AdultScore onto Lenght to discover
AdultScore/Length correlation |
5:53 | Short tweets tend to have smaller adult scores |
6:00 | end |