Pydata Q2 Meetup Notes

The Q2 Pydata Triangle Meetup featured Peter Baumgartner from RTI. I have shared my notes below.

Thank you @pmbaumgartner for a fantastic presentation #MachineLearning pic.twitter.com/1OSlcCMFoa

— Ginny Ghezzo (@GinnyGhezzo) May 2, 2018</blockquote>

News agencies noted that the US government was only capturing about half of actual arrest related deaths. Starting in 2011 there was a push to record the data more accurately.

RTI (David’s employer) worked with Burea of Justice Statistics to redo their data collection.

They scrape data from news sorces to gather the raw data.

Exclude

some sources

National News

De-Dupe as much as possible

Text similarity

Relavancy Classifier

Human web front end for final classification

Stack is fully open source

Filtering Down

1.2M Keyword Matched Articles

245K Uniques

8750 Relevant Articles

135 Unique Events

Word embedding module ([Word2Vec(https://deeplearning4j.org/word2vec.html)])

Term Frequency Inverse Document Frequently to reduce the number of articles/dedupe them.

Use jaquared similarity to decide which one of them to throw out.

Machine Learning to help classify articles that areally are valid. With each pass they take a random sample and classify in or out.

Additional Reading

Demo Github Notebooks

Written on May 2, 2018