Due to its ability to support a wide variety of data engineering tasks across a growing range data sources, Apache Spark has become an integral part of the Hadoop eco-system. In this post, we introduce the new spark-sframe package which unites the data ingestion and processing capabilities of Apache Spark with the sophisticated machine learning tools of GraphLab Create enabling simplified development of rich machine learning models on a wide variety of data sources.
Often the most challenging part of machine learning is getting the right data in the right form. Apache Spark provides rich Java, Scala, SQL, and Python APIs for bulk data and leverages fault tolerant distributed processing to accelerate IO and CPU intensive operations. However, once the data has been cleaned and transformed, the process of training models is often most efficiently achieved using specialized ML tools that leverage the structure of ML algorithms.
Over the past several years we have been developing a column based data frame that is specifically optimized for ML algorithms called SFrame. A few weeks ago, we announced the open source release of SFrame and today we are excited to announce the open source release of the spark-sframe package. The spark-sframe package unifies the bulk data processing capabilities of Apache Spark with the optimized open-source machine learning SFrame data-structure by providing a simple and efficient API to move between SFrame and RDD respresentations of data.