Bridging the divide between big data and (big) algorithms

Posted by Alice Zheng on Feb 25, 2014 1:33:00 PM

Topics: Big Data Analytics

Earlier this month, GraphLab took a road trip to Strata Santa Clara, a Big Data conference organized by O'Reilly. It was a gathering of more than 3,100 attendees--engineers, business folks, industry evangelists, and data scientists. We had a lot of fun meeting and socializing with our peers and customers.  Amidst all the conference excitement, we presented two talks. Carlos Guestrin, our intrepid CEO, held a tutorial on large-scale machine learning. I gave a talk in the Hardcore Data Science track.  (Check out the highlights from Ben Lorica's blog post.)

Disconnect between data structures and machine learning

Given the diversity of the audience, this was a difficult talk to pin down. After banging my head against the wall for some time, I decided to go with what interests me. As a machine learning researcher and an industry observer, I've always puzzled over these questions: What exactly is Big Data? What kind of tools do we really need to build?  How?  Big Data discussions often span a bewildering spectrum of topics. At one end of the spectrum, people talk about Big Data, data processing, data cleaning, and simple analytics. At the other end, people talk about complex machine learning models.  There is a disconnect. There is something in between that is seldom talked about, and yet is crucial for efficient analysis: data structures.

Data structures are the glue between data and algorithms. Raw data must be turned into data structures--whether in memory or on disk--before they can be operated on.  Algorithms depend on the underlying data structures to support their computation needs. An efficient implementation of the right data structure can be the key to efficient analysis.  GraphLab is known for its distributed graphs. But graphs are not the whole story. Many algorithms are indeed naturally situated on top of graphs: PageRank, label propagation, and Gibbs sampling are but a few examples. But many other algorithms, such as stochastic gradient descent and decision tree learning, are more amenable to flat tables.  Furthermore, raw data often comes in the form of logs, which can be easily translated into flat tables. With GraphLab's upcoming offering of SFrames, we are now handling large-scale flat tables as well as graphs.

Alice Zheng presents at Strata Santa Clara

So that was my talk. (Here are the slides). I talked about data, I talked about algorithms, and I talked about what it takes to go from data to analysis using algorithms. It felt supremely satisfying to unite the two ends of the spectrum. Apparently I wasn't the only one. The talk struck a chord with the audience. Many people came up afterwards, eager to learn more. What algorithms are more suitable for graphs? How should one pick between the two? What metrics might one use? It was great to see people becoming interested in the messy details of tool building.

To be honest, data structures was one of my least favorite subjects in college. It seemed so dry and abstract…and complicated! But when we take the perspective of the interplay of raw data and algorithms, the subject comes alive.  One person came up to me afterwards and said “I’m just getting started with data science. Thanks for making a difficult subject accessible!” That comment alone made all it all worth the effort. At GraphLab, this is the kind of stuff that we live and breathe everyday. For each algorithm and each data set, we weigh the alternatives and implement the most suitable data structures. We do the dirty work so that others don't have to.

GraphLab Create beta will be released in early March. Come play with us and learn more!


Subscribe to Dato Blog notifications