Pydata Q2 Meetup Notes
The Q2 Pydata Triangle Meetup featured Peter Baumgartner from RTI. I have shared my notes below.
— Ginny Ghezzo (@GinnyGhezzo) May 2, 2018</blockquote>
News agencies noted that the US government was only capturing about half of actual arrest related deaths. Starting in 2011 there was a push to record the data more accurately.
RTI (David’s employer) worked with Burea of Justice Statistics to redo their data collection.
They scrape data from news sorces to gather the raw data.
- some sources
- National News
- De-Dupe as much as possible
- Text similarity
- Relavancy Classifier
- Human web front end for final classification
Stack is fully open source
- 1.2M Keyword Matched Articles
- 245K Uniques
- 8750 Relevant Articles
- 135 Unique Events
Word embedding module ([Word2Vec(https://deeplearning4j.org/word2vec.html)])
Term Frequency Inverse Document Frequently to reduce the number of articles/dedupe them.
Use jaquared similarity to decide which one of them to throw out.
Machine Learning to help classify articles that areally are valid. With each pass they take a random sample and classify in or out.