Today's data analytics tools are slow in answering even simple queries, as they typically require to sift through huge amounts of data stored on disk, and are even less suitable for complex computations, such as machine learning algorithms. To address these challenges, for the past four years we have been developing Berkeley Data Analytics Stack (BDAS), an open source data analytics stack. At the core of BDAS is Spark, an in-memory parallel execution engine, which enables us to provide unified support for batch, streaming, and interactive computations, as well as support sophisticated graph based and machine learning algorithms. Today, Spark and other BDAS components are used in production by tens of companies and institutions. In this talk, I'll present the architecture and the main design decisions we made in Spark, as well our future plans.
Ion Stoica is a Professor in the EECS Department at University of California at Berkeley. He received his PhD from Carnegie Mellon University in 2000. He does research on cloud computing and networked computer systems. Past work includes the Dynamic Packet State (DPS), Chord DHT, Internet Indirection Infrastructure (i3), declarative networks, replay-debugging, and multi-layer tracing in distributed systems. His current research focuses on resource management and scheduling for data centers, cluster computing frameworks, and network architectures. He is an ACM Fellow and has received numerous awards, including the SIGCOMM Test of Time Award (2011), and the ACM doctoral dissertation award (2001). In 2006, he co-founded Conviva, a startup to commercialize technologies for large scale video distribution, and in 2013, he co-founded Databricks as startup to commercialize, technologies for Big Data processing.