PyData NYC: The Really Short Version

Nov 24, 2014

Here are my notes from PyData with links for more details. This isn’t a complete list, and in some cases my notes don’t really do justice to the actual talks, but I hope that these will be helpful to anyone who’s feeling PyData FOMO until the videos are released.

Disclaimer: I took almost no notes on the second day so a bunch of my favorite talks are missing.

High Performance Text Processing with Rosetta

This library is a highly optimized NLP library with a focus on memory efficiency.

TextFileStreamer - provides a streaming tokenizer. Good for memory efficiency.
DBStreamer - Ditto but for data in a DB.
Can be easily combined with online learning methods.
An IPython notebook with example code can be found here.

Python in the Hadoop/Spark Ecosystem

Bcolz - A columnar data container that can be compressed (supported by Blaze).
Impyla - Python client and Numba-based UDFs for Impala.
Streamparse - lets you run Python code against real-time streams of data. Integrates with Apache Storm.
Kafka-python - Kafka protocol support in Python.
Anaconda cluster - Bringing the Python ecosystem to Hadoop and Spark.
Libcloud - Python library for interacting with many of the popular cloud service providers using a unified API.

Data warehouse and conceptual modelling with Cubes 1.0

Light-weight Python framework and OLAP HTTP server for easy development of reporting applications and aggregate browsing of multi-dimensionally modeled data. The slides for this talk are already online.

Works best with aggregating categorical data.
Cubes visualizer - Cubes Visualizer is an application for browsing and visualizing data from a cubes Slicer server.
Has Google analytics support built-in (good way to drill into google analytics data?).
Cubesviewer - Visual tool for exploring and analyzing OLAP databases.
http://checkgermany.de/ - example application.

How to Make Your Future Data Scientists Love You

Excellent talk with common mistakes made by many companies, and how to avoid making Data Science hard or impossible in the future. My notes on this talk don’t really do it justice, so please see Sasha’s blog for more details.

Is your data set complete?
Is your data correct?
Is your data connectable?

Command-line utilities for exploring your data:

csvkit
bitly data_hacks - histogram.py

Recalling with precision

Awesome talk about measuring and tracking predictive model performance. The speaker Julia open sourced the web app they developed at Stripe called “top model” right before her talk.

Simple Machine Learning with SKLL 1.0

SKLL is a wrapper around scikit-learn that makes prototyping predictive algorithms as easy as creating a CSV and running a python script. I got a chance to talk extensively with the speaker and it seems like they’ve done a good job of handling most of the typical gotchas of scikit-learn.

That’s all for now, I’ll send out an update once the videos are live.