The ability to store, aggregate, and analyze large amounts of data has transformed nearly every industry. Whether finance, medicine, entertainment, government, or technology; the dream is the same: use enormous amounts of data to understand problems, predict outcomes, and take effective action. While many advances make the dream of “big data” possible, one of the most important components of the technical stack is the engine that provides the “distributed computing.”
In many organizations, Apache Spark is the computational engine that powers big data. A general purpose unified analytics engine built to transform, aggregate, and analyze large amounts of information; Spark has become the de-facto brain behind large scale data processing, machine learning, and graph analysis. It has seen rapid adoption by companies such as Netflix, Google, eBay, and others to analyze at massive scale processing petabytes of data on clusters of thousands of nodes.
In this course, we will explore how Apache Spark can be used for data processing. We will cover the fundamentals of Spark including the architecture and internals, the core APIs and data structures, and how Spark can be used for machine learning and analyzing streaming data sets. Throughout the course, you will:
- Understand when and where to use Spark.
- Leverage strategies to create data-driven questions that can provide scientific or business value.
- Learn how to use Apache spark to load, summarize, query, and visualize structured and semi-structured data.
- Introduce common machine learning techniques that can be used to solve supervised and unsupervised problems inside of Spark.
- Learn how to analyze streaming data using Spark Streams.
- Gain hands-on experience with techniques for deploying Spark as part of larger software system.