Beckerfuffle

Go fuffle yourself!

PyData NYC: The Really Short Version

Here are my notes from PyData with links for more details. This isn’t a complete list, and in some cases my notes don’t really do justice to the actual talks, but I hope that these will be helpful to anyone who’s feeling PyData FOMO until the videos are released.

Disclaimer: I took almost no notes on the second day so a bunch of my favorite talks are missing.

High Performance Text Processing with Rosetta

This library is a highly optimized NLP library with a focus on memory efficiency.

  • TextFileStreamer - provides a streaming tokenizer. Good for memory efficiency.
  • DBStreamer - Ditto but for data in a DB.
  • Can be easily combined with online learning methods.
  • An IPython notebook with example code can be found here.

Python in the Hadoop/Spark Ecosystem

  • Bcolz - A columnar data container that can be compressed (supported by Blaze).
  • Impyla - Python client and Numba-based UDFs for Impala.
  • Streamparse - lets you run Python code against real-time streams of data. Integrates with Apache Storm.
  • Kafka-python - Kafka protocol support in Python.
  • Anaconda cluster - Bringing the Python ecosystem to Hadoop and Spark.
  • Libcloud - Python library for interacting with many of the popular cloud service providers using a unified API.

Data warehouse and conceptual modelling with Cubes 1.0

Light-weight Python framework and OLAP HTTP server for easy development of reporting applications and aggregate browsing of multi-dimensionally modeled data. The slides for this talk are already online.

  • Works best with aggregating categorical data.
  • Cubes visualizer - Cubes Visualizer is an application for browsing and visualizing data from a cubes Slicer server.
  • Has Google analytics support built-in (good way to drill into google analytics data?).
  • Cubesviewer - Visual tool for exploring and analyzing OLAP databases.
  • http://checkgermany.de/ - example application.

How to Make Your Future Data Scientists Love You

Excellent talk with common mistakes made by many companies, and how to avoid making Data Science hard or impossible in the future. My notes on this talk don’t really do it justice, so please see Sasha’s blog for more details.

  • Is your data set complete?
  • Is your data correct?
  • Is your data connectable?

Command-line utilities for exploring your data:

Recalling with precision

Awesome talk about measuring and tracking predictive model performance. The speaker Julia open sourced the web app they developed at Stripe called “top model” right before her talk.

Simple Machine Learning with SKLL 1.0

SKLL is a wrapper around scikit-learn that makes prototyping predictive algorithms as easy as creating a CSV and running a python script. I got a chance to talk extensively with the speaker and it seems like they’ve done a good job of handling most of the typical gotchas of scikit-learn.

That’s all for now, I’ll send out an update once the videos are live.


Data Science With Python: Part 1

This is the first post in a multi-part series wherein I will explain the details surrounding the language prediction model I presented in my Pycon 2014 talk. If you make it all the way through, you will learn how to create and deploy a language prediction model of your own.

Realtime predictive analytics using scikit-learn & RabbitMQ
Realtime predictive analytics using scikit-learn & RabbitMQ

OSEMN

I’m not sure if Hilary Mason originally coined the term OSEMN, but I certainly learned it from her. OSEMN (pronounced awesome) is a typical data science process that is followed by many data scientists. OSEMN stands for Obtain, Scrub, Explore, Model, and iNterpret. As Hilary put it in a blog post on the subject: “Different data scientists have different levels of expertise with each of these 5 areas, but ideally a data scientist should be at home with them all.” As a common data science process, this is a great start, but sometimes this isn’t enough. If you want to make your model a critical piece of your application, you must also make it accessible and performant. For this reason, I’ll also discuss two more steps, Deploy and Scale.

Obtain & Scrub

In this post, I’ll cover how I obtained and scrubbed the training data for the predictive algorithm in my talk. For those who didn’t have a chance to watch my talk, I used data from Wikipedia to train a predictive algorithm to predict the language of some text. We use this algorithm at the company I work for to partition user generated content for further processing and analysis.

Pagecounts

So step 1 is obtaining a dataset we can use to train a predictive model. My friend Rob recommended I use Wikipedia for this, so I decided to try it out. There are a few datasets extracted from Wikipedia obtainable online at the time of this writing. Otherwise you need to generate the dataset yourself, which is what I did. I grabbed hourly page views per article for the past 5 months from dumps.wikimedia.org. I wrote some Python scripts to aggregate these counts and dump the top 50,000 articles from each language.

Export bot

After this, I wrote an insanely simple bot to execute queries against the Wikipedia Special:Export page. Originally, I was considering using scrapy for this since I’ve been looking for an excuse to use it. A quick read through of the tutorial left me feeling like scrapy was overkill for my problem. I decided a simple bot would be more appropriate. I was inspecting the fields of the web-form for the Special:Export page using Chrome Developer Tools when I stumbled upon a pretty cool trick. Under the “Network” tab, if you ctrl click on a request, you can use “Copy as cURL” to get a curl command that will reproduce the exact request made by the Chrome browser (headers, User-Agent and all). This makes it easy to write a simple bot that just interacts with a single web-form. The bot code looks a little something like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
from subprocess import call
from urllib.parse import urlencode

curl = """curl 'https://{2}.wikipedia.org/w/index.php?title=Special:Export&action=submit' -H 'Origin: https://{2}.wikipedia.org' -H 'Accept-Encoding: gzip,deflate,sdch' -H 'User-Agent: Mozilla/5.0 Chrome/35.0' -H 'Content-Type: application/x-www-form-urlencoded' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8' -H 'Cache-Control: max-age=0' -H 'Referer: https://{2}.wikipedia.org/wiki/Special:Export' -H 'Connection: keep-alive' -H 'DNT: 1' --compressed --data '{0}' > {1}"""

data = {
    'catname': '',
    'pages': 'Main_Page\nClimatic_Research_Unit_email_controversy\nJava\nundefined',
    'curonly': '1',
    'wpDownload': '1',
}

enc_data = urlencode(data)
call(curl.format(enc_data, filename, lang), shell=True)

The final version of my bot splits the list of articles into small chunks since the Special:Export page throws 503 errors when the requests are too large.

Convert to plain text

The Special:Export page on Wikipedia returns an XML file that contains the page contents and other pieces of information. The page contents include wiki markup, which for my purposes are not useful. I needed to scrub the Wikipedia markup to convert the pages to plain text. Fortunately, I found a tool that already does this. There was one downside to this tool which is that it produces output in a format that looks strikingly similar to XML, but is not actually valid XML. To address this, I wrote a simple parser using a regex that looks something like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import bz2
import re

article = re.compile(r'<doc id="(?P<id>\d+)" url="(?P<url>[^"]+)" title="(?P<title>[^"]+)">\n(?P<content>.+)\n<\/doc>', re.S|re.U)

def parse(filename):
  data = ""
  with bz2.BZ2File(filename, 'r') as f:
    for line in f:
      line = line.decode('utf-8')
      data += line
      if line.count('</doc>'):
        m = article.search(data)
        if m:
          yield m.groupdict()
        data = ""

This function will read every page in filename and return a dictionary with the id (an integer tracking the version of the page), the url (a permanent link to this version of the page), the title, and the plain text content of the page. Going through the file one article at a time, and using the yield keyword makes this function a generator which means that it will be more memory efficient.

What’s next?

In my next post I will cover the explore step using some of Python’s state-of-the-art tools for data manipulation and visualization. You’ll also get your first taste of scikit-learn, my machine learning library of choice. If you have any questions or comments, please post them below!


PyCon 2014: The Long Journey North

PyCon 2014 Logo

The Journey Begins

DataPhilly

A little over a year ago I was frustrated with the lack of data meetups in the Philadelphia area, so I started DataPhilly. I quickly learned that when you start a tech meetup you’re going to have to do some public speaking to get the ball rolling. I was DataPhilly’s first speaker because no one else volunteered to present at the first meeting. Even though I’d had the opportunity to practice public speaking several times prior to this as part of our weekly Tutorial Tuesday series at AWeber, my first talk for DataPhilly felt really rough. Despite this, DataPhilly quickly gained steam, and we’ve had a ton of excellent speakers and fantastic talks. Since then I’ve had the opportunity to speak at both DataPhilly and PhillyPUG.

PyData Logo

Last July I was fortunate enough to speak at PyData in Boston where I learned an important lesson; do not commit to delivering more than one talk at a conference. Giving two talks back to back was a bit much, but like my first talk at DataPhilly, I made it through and learned a lot in the process.

PyCon

After my experience at PyData, I decided to submit a talk to PyCon. I became anxious waiting to see if my talk would be accepted, so I decided to hack the process a bit. I wanted to gain insight into the PyCon planning process, and so I joined the PyCon Program Committee. One really cool thing about the PyCon planning process is that anyone can get involved! With the number of submissions growing each year, I encourage anyone interested to volunteer. It doesn’t take a lot of time (really you can devote as little or as much time as you can spare), and it is a great way to give back to the community. On top of this, you’ll help decide which talks get accepted at the next PyCon!

Only one in seven talks submitted were selected by the program committee for PyCon this year. A lot of great talks were rejected, and I feel very lucky that my submission made the cut. Being picked to speak at such a selective conference is truly an honor!

The Speaking Experience

Speaking at PyCon was made really simple by the awesome staff of volunteers and A/V crew. A green room is provided for speakers so they can hide from their fans and prepare for their talks. One cool side-effect of this is that, as a speaker, you get to hang out with all these famous people from the Python scene. I got to personally thank Fernando Pérez (the creator of IPython, not the baseball player) for his hard work on IPython. Fernando was very humble about his work and gave credit to the rest of the IPython community rather than taking credit for himself. This was a common theme at PyCon. All of the “big name” people I talked to seemed equally as humble. It seems that the Python community is not a place for egos, and that’s really refreshing. I suppose part of this is the nature of open source projects. Those projects started by people who seek inclusiveness are the most successful. An important part of this is sharing credit with others. So it only makes sense that many successful open source projects have a BDFL whom is humble and honest about their own part in the effort.

Keynotes

All the of the keynotes were excellent but I especially enjoyed the keynotes by Guido Van Rossum, Jessica McKellar and Fernando Pérez.

Diversity in Tech

Jessica McKellar

Jessica gave an excellent talk about teaching computer science to the next generation and how to fix the diversity gap in the tech community. She talked about how so few high school students are taking the AP computer science exam. “There are entire states in the United States where no African-American students take the exam at all. There are states where no Hispanic students have taken the exam. And, despite being 50 percent of the population, there are even states where no girls took the exam”. Despite the fact that the President of the United States of America believes that it is important for children to learn to code, in many school districts, there isn’t even a computer science class offered to high school students. Jessica’s talk continued with a series of suggestions on how to help change the status quo. She also pointed out that there isn’t much incentive for Computer Scientists to become teachers. They will make far more money in the software industry than teaching. This is a serious problem. Why can’t some of our top tech companies help fix this problem? They certainly can afford to!

Guido’s Q&A

Guido Van Rossum

Guido’s talk was very entertaining. His entire talk was an extended Q&A. He started off by live coding a random question chooser to pick questions from twitter. He then took questions from the audience, but only women, “because throughout the conference, I’ve been attacked by people with questions, and they were almost all men.” Overall, I found this keynote much more entertaining than his keynote last year, so I would definitely recommend watching it.

Fernando’s Keynote

Fernando Pérez

In his keynote, Fernando told the story of how he started the IPython project. “It began as me trying to procrastinate a little bit on finishing my dissertation” he said. (See “Set your code free” for some useful information on running your own open source projects.). In addition to the IPython project, Fernando covered a lot of the progress made in the SciPy community as a whole over the last year.

Recommended Talks

The full list of talks can be found on pyvideo.org, but I’d like to highlight a few of my favorites.

Technical Onboarding

Kate Heddleston & Nicole Zuckerman on technical onboarding

One great talk I attended was by Kate Heddleston & Nicole Zuckerman on technical onboarding. This talk was chock-full of practical advice. I’ll definitely be re-watching this video and taking detailed notes.

Moar Data!

Maybe I’m biased, but one of the most exciting things about PyCon was all of the data related talks! I still have a ton of them left to watch, but of the ones I’ve seen, my favorites are:

Diving into Open Data with IPython Notebook & Pandas

Diving into Open Data with IPython Notebook & Pandas

Enough Machine Learning to Make Hacker News Readable Again

Enough Machine Learning to Make Hacker News Readable Again

How to Get Started with Machine Learning

How to Get Started with Machine Learning

Realtime predictive analytics using scikit-learn & RabbitMQ

Realtime predictive analytics using scikit-learn & RabbitMQ

Yes, I really like my own talk ;-).

I also highly recommend watching the lightning talks. They were all very high quality and packed with lots of great insights. I hope they expand the lightning talks next year; they’re an excellent use of time!

The Hallway Track

On the second day of the conference, I met in one of the open spaces with a bunch of other data people (Thanks Julia Evans for arranging this!). The guys at Plotly demoed their product, and Cameron Davidson-Pilon showed off his new project lifelines, a library for survival analysis, which I’ll definitely be having a closer look at in the future.

Sprinting

scikit-learn Logo

Finally my PyCon experience ended with 2 days of sprinting on the scikit-learn project. I started by updating the Travis-CI build system from Python 3.3 to Python 3.4. A simple task but one which was perfect for my first commit. Then Olivier Grisel helped me fix an issue I had found while working on my talk. Contributing to scikit-learn was made really easy by both scikit-learn’s solid test coverage/CI system, and by the help of Olivier. Overall it seems like an extremely well run project which I can recommend getting involved in if you have the opportunity. If you’re interested in getting involved, I recommend checking out the issue tracker and looking for “Easy” issues. Another good place to look is pull requests that are ready to merge. Reviewing open pull requests and testing them out in your environment is always helpful.

See You Next Year!

Montreal is an awesome town and a great place for a conference. I ate lots of poutine and met a ton of awesome people! To all the awesome people I met, I hope to see you next year in Montreal!


Pycon 2014: A Preview

Pycon 2014 Logo

After my successful talks at PyData Boston in July, I decided to submit one of my talks to Pycon. I’m happy to say my talk was accepted! This will be my first Pycon and I’m really excited! Montréal is an amazing city with some awesome cuisine (Un poutine végétarien s’il vous plaît) and the best micro-brew pubs I’ve ever been to (sorry Philly)! Besides the awesome location, I’m really psyched about some of the talks this year! Here’s some of my favorites.

Discovering Python (David Beazley)

Looking for an entertaining way to learn about various built-in python libraries? Look no further! Imagine if you had access to python, but couldn’t install any 3rd party modules! Maybe this is you? If so, then you won’t want to miss this talk!

So, what happens when you lock a Python programmer in a secret vault containing 1.5 TBytes of C++ source code and no internet connection? Find out as I describe how I used Python as a secret weapon of “discovery” in an epic legal battle.

Enough Machine Learning to Make Hacker News Readable Again (Ned Jackson Lovely)

Machine Learning talk, `nuff said!

It’s inevitable that online communities will change, and that we’ll remember the community with a fondness that likely doesn’t accurately reflect the former reality. We’ll explore how we can take a set of articles from an online community and winnow out the stuff we feel is unworthy. We’ll explore some of the machine learning tools that are just a “pip install” away, such as scikit-learn and nltk.

Diving into Open Data with IPython Notebook & Pandas (Julia Evans)

I’m glad to see a Pandas talk at PyCon, even if it’s not being given by Wes Mckinney ;-).

I’ll walk you through Python’s best tools for getting a grip on data: IPython Notebook and pandas. I’ll show you how to read in data, clean it up, graph it, and draw some conclusions, using some open data about the number of cyclists on Montréal’s bike paths as an example.

The Sorry State of SSL (Hynek Schlawack)

These days it’s important not just to use encryption, but to also configure it properly. Having spent many hours reading about cyphers and theoretical attacks in the past, this looks like it should be a good talk even for coders who know something about security.

Those web pages with shiny lock icons boasting that your data is safe because of “256 bit encryption”? They are lying. In times of mass surveillance and commercialized Internet crime you should know why that’s the case. This talk will give you an overview that will help you to assess your personal security more realistically and to make your applications as secure as possible against all odds.

Ansible - Python-Powered Radically Simple IT Automation (Michael DeHaan)

Having experienced both puppet and chef I’m looking forward to learning more about this python alternative.

Learn about Ansible – a radically simple way to deploy applications, configure operating systems, and orchestrate IT operations including zero downtime rolling updates. Let’s bring about SkyNet faster.

Modern Web Services, Lessons Learned and Why REST is not the Best (Armin Ronacher)

Talk about writing APIs from the creator of Flask? Yes please!

A few years of experiences writing RESTful APIs, especially my experiences working for Fireteam’s online services. What worked, what did not work so well and about how to avoid making mistakes with REST.

Games for Science: Creating interactive psychology experiments in Python with Panda3D (Jessica Hamrick)

Applying games to science?! How can you lose?

Have you ever wanted to play video games while also contributing to science? In psychology experiments developed by myself and Peter Battaglia, participants are immersed in an interactive 3D world which is experimentally well-controlled, yet also extremely fun. This talk will explain how we created these “game-like” experiments in Python using the Panda3D video game engine.

Cache me if you can: memcached, caching patterns and best practices (Guillaume Ardaud)

Caching is an important tool for any developers toolkit. When to use memcache vs. a SQL database? This talk should help answer this question and more!

Memcached is a popular, blazing fast in-RAM key/object store mainly used in web applications (although it can be used in virtually any software). You will walk out of this talk with a solid understanding of memcached and what it does under the hood, and become familiar with several patterns and best practices for making the most of it in your own Python applications.

Postgres Performance for Humans (Craig Kerstiens)

Probably the most important tool in a developers toolkit is the database. Knowing how to manage your databases is an important skill (often learned when something starts to go wrong).

To many developers the database is a black box. You expect to be able to put data into your database, have it to stay there, and get it out when you query it… hopefully in a performant manner. When its not performant enough the two options are usually add some indexes or throw some hardware at it. We’ll walk through a bit of a clearer guide of how you can understand and manage DB performance.

Getting Started Testing (Ned Batchelder)

This talk looks like the end-all and be-all of Python testing talks. You’ll learn pretty much everything there is to know about testing Python code.

If you’ve never written tests before, you probably know you should, but view the whole process as a bureaucratic paperwork nightmare to check off on your ready-to-ship checklist. This is the wrong way to approach testing. Tests are a solution to a problem that is important to you: does my code work? I’ll show how Python tests are written, and why.

Technical on-boarding, training, and mentoring (Kate Heddleston)

After onboarding a few new devs myself, I know that I could use some guidance in this area. This talk looks like it should have some great take-aways.

This is a talk about how to make junior and new engineers into independent and productive members of your engineering team faster and cheaper. We will focus on python specific resources and libraries that will help you create a simple but effective on boarding program, and talk about case studies of companies that have had success using these techniques.


Elephant Enlightenment: Part 1

For some light vacation reading, I started reading Hadoop Beginner’s Guide. I made it through about half of the book, and I wanted to share some random facts that I found particularly enlightening.

Data Serialization: Compression and Splitting

Splitting refers to the ability of Hadoop to split input files into chunks for input into the map phase of a MapReduce job. Splitting is important for 2 reasons:

  1. It allows the map phase to be parallelized. The more splits you can make of the data, the more map processes that can be run simultaneously.
  2. It allows for data locality. It helps ensure the data being processed by your map process is available on the node where the data lives. Hadoop parallelizes data storage, if the data is stored on the same node the map task is being run on, the map phase will be more efficient.

When choosing a “container format” (a.k.a serialization format) for your data, you need to make sure that you pick a format that is both splittable, compressible, and fast. There are a few container file formats these include Sequence File, RCFile, and Avro. These formats all support both splitting and compression. Of these, Avro seems the most promising as it has good cross language support. The main issue with using these formats is that you probably need a pre-processing phase where you convert your data into this format.

If you don’t want to use one of the container formats, but you want your data to be splitable, and you want your data to be compressed, you have 2 options.

  1. Use bz2 compression, this is the only compression format that supports splitting out of the box.
  2. Manually split your data into chunks and compress each chunk You can find a great cheat sheet for compression & splitting in Table 4-1 of Hadoop: The Definitive Guide

Data Loss

Data in Hadoop is replicated, but there are many ways you can lose data in Hadoop, so it’s not an alternative to backups. Here are just a few ways you can lose data in Hadoop.

Parallel node failure

“As the cluster size increases, so does the failure rate and having three node failures in a narrow window of time becomes less and less unlikely. Conversely, the impact also decreases but rapid multiple failures will always carry a risk of data loss.” [1]

Cascading failures

A failure on one node will cause under-replicated data to be replicated to other nodes, which could result in additional failures, cascading to other machines, and so on. While this scenario is unlikely, it can occur.

Human Error

Data is not backed up or check-pointed in Hadoop. If someone accidentally deletes data, it’s gone.

High Availability

With Hadoop 1.0, there is a single point of failure, the NameNode. The NameNode contains the fsimage file which tracks where all the data lives in the Hadoop cluster. If you lose your NameNode, you won’t be able to use your cluster, and if you don’t properly back up the fsimage, you will experience data loss. ”Having to move NameNode due to a hardware failure is probably the worst crisis you can have with a Hadoop cluster.“ [2] This issue has been addressed in Hadoop 2.0, where NameNode High Availability has been implemented.

Bloom Filters

Often times in a map reduce job you want to logically combine two data sources together, or “join” them. There are a couple of methods for doing this; one way of doing this is by joining the data during the map portion of the MapReduce. This is more efficient that doing it in the reduce portion. To accomplish the join in the map portion of the job, you must be able to store one of your data sources in the memory of every cluster. But what if you can’t fit all of the data into memory? “In cases where we can accept some false positives while still guaranteeing no false negatives, a Bloom filter provides an extremely compact way of representing such information.” [3] “The use of Bloom filters is in fact a standard technique for joining in distributed databases, and it’s used in commercial products such as Oracle 11g.” [4] More information on Bloom filters & Hadoop can be found in the book Hadoop in Action in section 5.2. An example application of this can also be found here.

That’s all for now. Check back in the future for further Elephant Enlightenments!


Working with email content

When it comes to tokenization, email content presents some unique challenges. Some messages have a plain text version, some have a HTML version, and some have both. Before you can do cool things with this data like natural language processing or predictive analysis, you have to convert the data into a uniform format (sometimes referred to as scrubbing) prior to tokenization. In my case, I wanted all of my data to be plain text.

If you have a plain text version of the email, it is probably safe to use it without scrubbing. However if you encounter an e-mail without a plain text version, you’ll need to convert the HTML version to text. Searching the web, you’re likely to find a myriad of solutions for converting HTML to text. Python is my language of choice, and a few suggestions I found used lxml, BeautifulSoup, and nltk to convert HTML to text.

lxml and soupparser, an exercise in futility

lxml has a Cleaner class which “cleans the document of each of the possible offending elements.” The biggest problem with using lxml is it doesn’t handle malformed HTML gracefully. To handle these edge cases, you can use the lxml soupparser to parse malformed HTML. While in most cases this will work without error, it doesn’t produce the desired results for all input. For example, in the following case soupparser will produce an empty HTML document even though there is clearly text in the data:

1
2
3
4
5
6
7
8
from lxml.etree import tostring
from lxml.html.soupparser import fromstring

data = '</form all my text is at the end of this malformed html'
root = fromstring(data)
print tostring(root)

'<html/>'

This is because the default HTML parser used by BeautifulSoup is the built-in HTMLParser. As the BeautifulSoup4 docs point out, in older versions (before 2.7.3 or 3.2.2) “Python’s built-in HTML parser is just not very good”. Now there are work arounds to this. If you’re using BeautifulSoup4, you can specify a different parser to use, which will provide better results. This all seems like a lot of work to convert HTML to text, isn’t there a better way?

nltk.util.clean_html is full of win!

Enter nltk.util.clean_html. clean_html uses regular expressions to strip HTML tags from text. This approach helps avoid the issues found with lxml and BeautifulSoup. Looking at our previous example, we don’t lose our text data using clean_html:

NLTK is full of win!

1
2
3
4
5
6
from nltk import clean_html

data = '</form all my text is at the end of this malformed html'
print clean_html(root)

'</form all my text is at the end of this malformed html'

Looking at the implementation, it almost seems too simple. There are six regular expressions, and that’s it! I’ve tested all three solutions against several million real e-mail messages, and in all cases nltk provided the best results.

Sifting through all the possible solutions for converting HTML to text and testing each of them was pretty time consuming. If your goal is to scrub the HTML for further analysis, nltk clean_html is definitely the way to go!