Big Data Infrastructure

Course Overview

In recent years, the amount of data produced, stored, and tracked has increased exponentially. What are the tools and techniques that are used to work with that information and how can it be brought to provide insight and inform decision making? This course will introduce the fundamental infrastructure of working with Big Data: containers, clouds, storage, compute, and search.

4 days (with an optional day to dive into a case study) days
    • Explain “Big Data” and the challenges which can arise when trying to gain insights from it: volume, velocity, variety, variability, and complexity.
    • Introduce the open source systems commonly used to work with data at scale: Cassandra, ElasticSearch, Hadoop/HDFS, Spark, and Kafka.
    • Provide overviews of each member of the big data ecosystem and examples of healthcare problems they are intended to address.
    • Deep dive into Kafka and see how it can be used to build pipelines for managing streams of data.
    • Deep dive into Spark to show how it can be used to analyze different types of data stored in relational databases, Cassandra, ElasticSearch, and Hadoop/HDFS.
  • Getting Started

    • Introduce the course and the scope of the material be covered.
    • Demonstrate tools for keeping software environments organized.
    • Apply Docker as a way for packaging applications.
    • Introduce Jupyter and Jupyter notebooks
    • Introduce Kubernetes as a way of managing complex workflows and environments

    Lab: Tools, Toolboxes, Techniques, and Workbenches

    • Introduce Docker and show how it can be used to package tools into portable containers
    • Show how containers can be used to solve development challenges in addition to their role as an operations technology.
    • Create a custom Docker Image with Python 3 and Jupyter; show common Docker runtime options that will be used throughout the course.

    Lab: Kubernetes

    • Demonstrate how to deploy applications to Kubernetes
    • Show how services can be used to expose applications (pods) and manage their interactions

    Data: Big and Little

    • The Challenges of Big Data Data
      • Understand some of the challenges posed by data:
        • Volume: Modern business operations generate a very large volume of data, which creates problems in its storage and management.
        • Velocity: Not only is the data generated in large quantities, but it is also generated quickly. Moreover, it often needs to be acted upon in a timely fashion.
        • Veracity: Much of the data in modern systems is redundant, incomplete, or misleading; which can make it difficult to know what to trust.
        • Variability: There is a large degree of variation between different sources of data. Some sources may be comprehensive but have a high degree of duplication or redundancy. Others are more concise but lack important context.
        • Complexity: Given its volume and variability, working with healthcare data is complex. Analyzing it to gain insights often involves complex algorithms, logic, and specialized tooling or techniques.
    • What is “Big Data”?
      • Learning objectives:
        • What do we mean by “Big Data”?
        • What are “Big Data” systems and how do they differ from traditional data storage technologies?
    • Technology to the Rescue
      • Learning objectives:
        • Discuss technology trends which are making it easier to work with large sets of data.
        • Understand Virtualization, Cloud, and Containers.
        • Everything as a Service (EaaS): Increasingly sophisticated resources are available for managing complex computational tools. It is possible to manage infrastructure and platform in much the same way that we manage software resources.
        • Distributed Storage and Computation: Increasingly sophisticated data systems make it possible to capture, store, and perform computation on data in larger volumes.
        • Powerful automation technologies which allow for data to be processed in batches (bulk) or as individual records (streaming).
    • The Properties of a Big Data Solution
      • Learning objectives:
        • Understand the properties of a big data solution.
        • Availability of storage mechanisms at scale: Distributed filesystems and databases.
        • Distributing computational engines: Process and query the data at scale.
        • Using tools to manage the resources and services used to implement the systems.
        • What role does clustering technology play in big data? How are the systems similar at the architectural level?

    Big Data Platforms

    • Overview of the Big Data Landscape
      • What are the main players in the Big Data World and how are they used?
      • Kafka: Reimagining data processing as streams.
      • Hadoop and HDFS: Distributing computation and analytic processes.
      • Spark: A general engine for large-scale data processing.
      • ElasticSearch: Storage versus Search.
      • NoSQL and unstructured data stores (Cassandra).
    • Introduction to the Hadoop Ecosystem
    • Hadoop HDFS: Distributed Storage and Analysis
      • Learning objectives:
        • What is Hadoop and HDFS? Explore the role that they play in the world of big data and discuss what they do well (and not so well).
        • Introduce MapReduce and how it works.
        • Talk about the extended Hadoop ecosystem and its members: Hive, Pig, and HBase.
        • Discuss the HDFS API and how to access data from processing scripts.
      • Lab: Working with Hadoop/HDFS (Hello Big Data)
        • Demonstrate installation and configuration of Hadoop/HDFS within DC/OS.
        • Show how to work with HDFS using the Hadoop command line tools.
        • Use the native HDFS client APIs in Python and Java to create data ingestion programs.
        • Ingest a small set of clinical notes into Hadoop for later analysis.
    • Hadoop MapReduce: A Powerful Data Analysis Paradigm
      • Learning objectives:
        • Explain the core MapReduce algorithm and provide examples of how it works
        • Explore the Hadoop MapReduce implementation and describe its architecture
        • Demonstration: MapReduce in Action
      • Lab: Hadoop MapReduce Operations
      • Show how common data operations such as joining data, calculating batch metrics, and common analytics algorithms are implemented.
    • Introduction to Spark: A General Engine for Large Scale Data Processing
      • Learning objectives:
        • What is Spark? How does it compare to Hadoop/MapReduce?
        • How is it used in practice and how can we build systems that play to its strengths?
      • Lab: Working with Spark
        • Demonstrate installation and configuration of Spark; look at how jobs are launched.
        • Introduce the Spark REPL environment and show how it can be used to interactively explore data.
        • Rite of passage: Count the number of words in a set of clinical notes previously uploaded to HDFS.
    • Introduction to Kafka
      • Provide a guided tutorial to how Kafka functions as a queue.
      • Demonstrate practical examples of how data can be put into topics, retrieved for enrichment or caching to a third-party system, and republishing modified data into a separate topic for further action.
      • Lab: Working with Kafka
      • Demonstrate how to install and configure Kafka (Confluent Platform) within DC/OS.
      • Learn to work with Kafka within the context of DC/OS.
      • Create a simple application for logging Kafka topic output.ElasticSearch: Storage versus Search
        • Understand the role of search within big data and how it differs from other data storage technologies.
        • Learn how data is stored in ElasticSearch, shared across a cluster, and how searches are executed.
        • Review the available APIs and understand which features enable the use of ElasticSearch as an analytics platform: Aggregations, word parsing and tokenization.
        • Introduce Natural Language Processing, core constructs that can be used to work with human language and describe how ElasticSearch can help facilitate NLP workflows
    • Demonstration: Working with ElasticSearch
      • Show how the ElasticSearch APIs work via practical examples
      • Show how data can be prepared and persisted to ElasticSearch using Apache Spark
      • Lab: Exploring ElasticSearch (ElasticSearch and Kibana)
        • Demonstrate installation and configuration of ElasticSearch.
        • Introduce the ElasticSearch REST interface and show how it can be used from command-line HTTP clients (curl) and client libraries.
        • Show ingestion and indexing of a dataset from HDFS.
        • Attempt to use the built-in ElasticSearch index to create annotations for the clinical notes in the index.
        • Use significant term aggregations to see if the annotations created previously are meaningful when grouping data into “populations”.
      • Lab: Visualizing Data with Kibana (ElasticSearch and Kibana)
        • Explore the data indexed in ElasticSearch using the visualization tools of Kibana.
        • Create a simple dashboard which provides a summary of the clinical notes ingested into ElasticSearch.
    • Integration Project 1: Moving Data Between Separate Systems
      • Lab: Building a Data Lake with Kafka Connect
      • Ingest clinical data from a dataset persisted to MySQL using Kafka Connect and stream the information to ElasticSearch or Hadoop/HDFS.
      • Demonstrate how Kafka Connect implements data change capture.
      • Show how Kafka Connect translates scheme changes from a relational database to the Hadoop/HDFS dataset.
    • Cassandra: Unstructured Data Storage at Scale
      • Purpose of Cassandra. What problems does it solve and when would it be used instead of other data storage technologies? How does it compare to traditional SQL databases, in what ways is it superior? What are the major tradeoffs and drawbacks?
      • Parts and pieces. How is Cassandra organized architecturally? How do the services work together to provide fast throughput, scalability, and fault tolerance?
      • API and integration. What are the major APIs? How can Cassandra be integrated as a component into larger applications?

    Analytics and Machine Learning: Getting Value from Data

    • The Machines are Coming: Machine Learning and Artificial Intelligence
      • What are machine learning and artificial intelligence?
      • What are some of the artificial intelligence techniques and how can they be used to solve healthcare problems?
      • Supervised versus unsupervised learning; what are the differences?
      • Querying and Analyzing Data with Spark
      • Creating Machine Learning Models
      • Spark Streaming: Working with data in (near) real-time
  • UNIX Command Line

    Familiarity with Unix operating systems and the BASH command line interface (CLI) is assumed. Students should be comfortable with executing commands from the terminal, capturing and redirecting output, and analyzing program logs and output.

    Version Control

    Students will use the Git version control system extensively and should be comfortable with the add, commit, push, pull, remote, and submodule commands. Resources for learning Git:
    •         Try Git by GitHub and Code School. Interactive tutorial provided by GitHub which covers the main git commands. Available online at
    •         Learn Git Branching. In-depth tutorial that covers Git essentials in addition to more advanced topics such as reversing changes, cherry-picking commits, and other intermediate techniques. Available online at

    Python Programming Languages

    Course examples are written in Python. Students should be familiar with the syntax and basic program structure.

    Structured Query Language (SQL)

    Students should have a basic understanding of its basic syntax and structure. Many of the case studies in the course will either use SQL to retrieve data or leverage tools that do so. Students will need to create simple statements that use the CREATE, SELECT, UPDATE, DELETE, and DROP operations.