Apache Spark is a unified analytics engine designed for large-scale data processing with built-in modules for various workloads including batch processing, interactive queries, streaming, and machine learning.
-
Apache Spark: A unified analytics engine that provides fast, in-memory data processing capabilities across diverse workloads, supporting batch, interactive, and streaming analytics.
-
Spark Core API: The foundational API available in multiple programming languages (Java, Scala, Python, R) that provides basic I/O functionalities, task scheduling, and memory management.
-
Spark Core Engine: The distributed execution engine that leverages in-memory computing and optimized execution plans to deliver superior performance compared to traditional MapReduce frameworks.
-
Spark SQL (+ DataFrames): A module that provides structured data processing through SQL queries and DataFrame API, offering seamless integration with various data sources and formats.
-
Spark MLlib: A comprehensive machine learning library that provides scalable implementations of common algorithms including classification, regression, clustering, and collaborative filtering.
-
Spark Streaming: A scalable and fault-tolerant stream processing engine that enables real-time analytics on live data streams with micro-batch processing capabilities.
-
Spark GraphX: A distributed graph processing framework that provides APIs for graph computation and supports common graph algorithms like PageRank and connected components.
-
SparkR (R on Spark): An R package that provides a lightweight frontend enabling R users to leverage Spark's distributed computing capabilities for large-scale data analysis.
-
Spark Shell: Interactive shells available in multiple languages - spark-shell (Scala), pyspark (Python), and sparkR (R) - for exploratory data analysis and prototyping.