Recent from talks
Nothing was collected or created yet.
Apache Spark
View on Wikipedia| Apache Spark | |
|---|---|
| Original author | Matei Zaharia |
| Developer | Apache Spark |
| Initial release | May 26, 2014 |
| Stable release | 4.0.1 (Scala 2.13)
/ September 6, 2025 |
| Written in | Scala[1] |
| Operating system | Windows, macOS, Linux |
| Available in | Scala, Java, SQL, Python, R, C#, F# |
| Type | Data analytics, machine learning algorithms |
| License | Apache License 2.0 |
| Website | spark |
| Repository | Spark Repository |
Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab starting in 2009, in 2013, the Spark codebase was donated to the Apache Software Foundation, which has maintained it since.
Overview
[edit]Apache Spark has its architectural foundation in the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way.[2] The Dataframe API was released as an abstraction on top of the RDD, followed by the Dataset API. In Spark 1.x, the RDD was the primary application programming interface (API), but as of Spark 2.x use of the Dataset API is encouraged[3] even though the RDD API is not deprecated.[4][5] The RDD technology still underlies the Dataset API.[6][7]
Spark and its RDDs were developed in 2012 in response to limitations in the MapReduce cluster computing paradigm, which forces a particular linear dataflow structure on distributed programs: MapReduce programs read input data from disk, map a function across the data, reduce the results of the map, and store reduction results on disk. Spark's RDDs function as a working set for distributed programs that offers a (deliberately) restricted form of distributed shared memory.[8]
Inside Apache Spark the workflow is managed as a directed acyclic graph (DAG). Nodes represent RDDs while edges represent the operations on the RDDs.
Spark facilitates the implementation of both iterative algorithms, which visit their data set multiple times in a loop, and interactive/exploratory data analysis, i.e., the repeated database-style querying of data. The latency of such applications may be reduced by several orders of magnitude compared to Apache Hadoop MapReduce implementation.[2][9] Among the class of iterative algorithms are the training algorithms for machine learning systems, which formed the initial impetus for developing Apache Spark.[10]
Apache Spark requires a cluster manager and a distributed storage system. For cluster management, Spark supports standalone native Spark, Hadoop YARN, Apache Mesos or Kubernetes.[11] A standalone native Spark cluster can be launched manually or by the launch scripts provided by the install package. It is also possible to run the daemons on a single machine for testing. For distributed storage Spark can interface with a wide variety of distributed systems, including Alluxio, Hadoop Distributed File System (HDFS),[12] MapR File System (MapR-FS),[13] Cassandra,[14] OpenStack Swift, Amazon S3, Kudu, Lustre file system,[15] or a custom solution can be implemented. Spark also supports a pseudo-distributed local mode, usually used only for development or testing purposes, where distributed storage is not required and the local file system can be used instead; in such a scenario, Spark is run on a single machine with one executor per CPU core.
Spark Core
[edit]Spark Core is the foundation of the overall project. It provides distributed task dispatching, scheduling, and basic I/O functionalities, exposed through an application programming interface (for Java, Python, Scala, .NET[16] and R) centered on the RDD abstraction (the Java API is available for other JVM languages, but is also usable for some other non-JVM languages that can connect to the JVM, such as Julia[17]). This interface mirrors a functional/higher-order model of programming: a "driver" program invokes parallel operations such as map, filter or reduce on an RDD by passing a function to Spark, which then schedules the function's execution in parallel on the cluster.[2] These operations, and additional ones such as joins, take RDDs as input and produce new RDDs. RDDs are immutable and their operations are lazy; fault-tolerance is achieved by keeping track of the "lineage" of each RDD (the sequence of operations that produced it) so that it can be reconstructed in the case of data loss. RDDs can contain any type of Python, .NET, Java, or Scala objects.
Besides the RDD-oriented functional style of programming, Spark provides two restricted forms of shared variables: broadcast variables reference read-only data that needs to be available on all nodes, while accumulators can be used to program reductions in an imperative style.[2]
A typical example of RDD-centric functional programming is the following Scala program that computes the frequencies of all words occurring in a set of text files and prints the most common ones. Each map, flatMap (a variant of map) and reduceByKey takes an anonymous function that performs a simple operation on a single data item (or a pair of items), and applies its argument to transform an RDD into a new RDD.
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
val conf: SparkConf = new SparkConf().setAppName("wiki_test") // create a spark config object
val sc: SparkContext = new SparkContext(conf) // Create a spark context
val data: RDD[String] = sc.textFile("/path/to/somedir") // Read files from "somedir" into an RDD of (filename, content) pairs.
val tokens: RDD[String] = data.flatMap(_.split(" ")) // Split each file into a list of tokens (words).
val wordFreq: RDD[(String, Int)] = tokens.map((_, 1)).reduceByKey(_ + _) // Add a count of one to each token, then sum the counts per word type.
val topWords: Array[(Int, String)] = wordFreq.sortBy(s => -s._2).map(x => (x._2, x._1)).top(10) // Get the top 10 words. Swap word and count to sort by count.
Spark SQL
[edit]Spark SQL is a component on top of Spark Core that introduced a data abstraction called DataFrames,[a] which provides support for structured and semi-structured data. Spark SQL provides a domain-specific language (DSL) to manipulate DataFrames in Scala, Java, Python or .NET.[16] It also provides SQL language support, with command-line interfaces and ODBC/JDBC server. Although DataFrames lack the compile-time type-checking afforded by RDDs, as of Spark 2.0, the strongly typed DataSet is fully supported by Spark SQL as well.
import org.apache.spark.sql.{DataFrame, SparkSession}
val url: String = "jdbc:mysql://yourIP:yourPort/test?user=yourUsername;password=yourPassword" // URL for your database server.
val spark: SparkSession = SparkSession.builder().getOrCreate() // Create a Spark session object
val df: DataFrame = spark
.read
.format("jdbc")
.option("url", url)
.option("dbtable", "people")
.load()
df.printSchema() // Looks at the schema of this DataFrame.
val countsByAge: DataFrame = df.groupBy("age").count() // Counts people by age
Or alternatively via SQL:
df.createOrReplaceTempView("people")
val countsByAge: DataFrame = spark.sql("SELECT age, count(*) FROM people GROUP BY age")
Spark Streaming
[edit]Spark Streaming uses Spark Core's fast scheduling capability to perform streaming analytics. It ingests data in mini-batches and performs RDD transformations on those mini-batches of data. This design enables the same set of application code written for batch analytics to be used in streaming analytics, thus facilitating easy implementation of lambda architecture.[19][20] However, this convenience comes with the penalty of latency equal to the mini-batch duration. Other streaming data engines that process event by event rather than in mini-batches include Storm and the streaming component of Flink.[21] Spark Streaming has support built-in to consume from Kafka, Flume, Twitter, ZeroMQ, Kinesis, and TCP/IP sockets.[22]
In Spark 2.x, a separate technology based on Datasets, called Structured Streaming, that has a higher-level interface is also provided to support streaming.[23]
Spark can be deployed in a traditional on-premises data center as well as in the cloud.[24]
MLlib machine learning library
[edit]Spark MLlib is a distributed machine-learning framework on top of Spark Core that, due in large part to the distributed memory-based Spark architecture, is as much as nine times as fast as the disk-based implementation used by Apache Mahout (according to benchmarks done by the MLlib developers against the alternating least squares (ALS) implementations, and before Mahout itself gained a Spark interface), and scales better than Vowpal Wabbit.[25] Many common machine learning and statistical algorithms have been implemented and are shipped with MLlib which simplifies large scale machine learning pipelines, including:
- summary statistics, correlations, stratified sampling, hypothesis testing, random data generation[26]
- classification and regression: support vector machines, logistic regression, linear regression, naive Bayes classification, Decision Tree, Random Forest, Gradient-Boosted Tree
- collaborative filtering techniques including alternating least squares (ALS)
- cluster analysis methods including k-means, and latent Dirichlet allocation (LDA)
- dimensionality reduction techniques such as singular value decomposition (SVD), and principal component analysis (PCA)
- feature extraction and transformation functions
- optimization algorithms such as stochastic gradient descent, limited-memory BFGS (L-BFGS)
GraphX
[edit]GraphX is a distributed graph-processing framework on top of Apache Spark. Because it is based on RDDs, which are immutable, graphs are immutable and thus GraphX is unsuitable for graphs that need to be updated, let alone in a transactional manner like a graph database.[27] GraphX provides two separate APIs for implementation of massively parallel algorithms (such as PageRank): a Pregel abstraction, and a more general MapReduce-style API.[28] Unlike its predecessor Bagel, which was formally deprecated in Spark 1.6, GraphX has full support for property graphs (graphs where properties can be attached to edges and vertices).[29]
Like Apache Spark, GraphX initially started as a research project at UC Berkeley's AMPLab and Databricks, and was later donated to the Apache Software Foundation and the Spark project.[30]
Language support
[edit]Apache Spark has built-in support for Scala, Java, SQL, R, Python, and Swift with 3rd party support for the .NET CLR,[31] Julia,[32] and more.
History
[edit]Spark was initially started by Matei Zaharia at UC Berkeley's AMPLab in 2009, and open sourced in 2010 under a BSD license.[33]
In 2013, the project was donated to the Apache Software Foundation and switched its license to Apache 2.0. In February 2014, Spark became a Top-Level Apache Project.[34]
In November 2014, Spark founder M. Zaharia's company Databricks set a new world record in large scale sorting using Spark.[35][33]
Spark had in excess of 1000 contributors in 2015,[36] making it one of the most active projects in the Apache Software Foundation[37] and one of the most active open source big data projects.
| Version | Original release date | Latest version | Release date |
|---|---|---|---|
| 0.5 | 2012-06-12 | 0.5.2 | 2012-11-22 |
| 0.6 | 2012-10-15 | 0.6.2 | 2013-02-07 |
| 0.7 | 2013-02-27 | 0.7.3 | 2013-07-16 |
| 0.8 | 2013-09-25 | 0.8.1 | 2013-12-19 |
| 0.9 | 2014-02-02 | 0.9.2 | 2014-07-23 |
| 1.0 | 2014-05-26 | 1.0.2 | 2014-08-05 |
| 1.1 | 2014-09-11 | 1.1.1 | 2014-11-26 |
| 1.2 | 2014-12-18 | 1.2.2 | 2015-04-17 |
| 1.3 | 2015-03-13 | 1.3.1 | 2015-04-17 |
| 1.4 | 2015-06-11 | 1.4.1 | 2015-07-15 |
| 1.5 | 2015-09-09 | 1.5.2 | 2015-11-09 |
| 1.6 | 2016-01-04 | 1.6.3 | 2016-11-07 |
| 2.0 | 2016-07-26 | 2.0.2 | 2016-11-14 |
| 2.1 | 2016-12-28 | 2.1.3 | 2018-06-26 |
| 2.2 | 2017-07-11 | 2.2.3 | 2019-01-11 |
| 2.3 | 2018-02-28 | 2.3.4 | 2019-09-09 |
| 2.4 LTS | 2018-11-02 | 2.4.8 | 2021-05-17[38] |
| 3.0 | 2020-06-18 | 3.0.3 | 2021-06-01[39] |
| 3.1 | 2021-03-02 | 3.1.3 | 2022-02-18[40] |
| 3.2 | 2021-10-13 | 3.2.4 | 2023-04-13[41] |
| 3.3 | 2022-06-16 | 3.3.3 | 2023-08-21[42] |
| 3.4 | 2023-04-13 | 3.4.4 | 2024-10-27[43] |
| 3.5 LTS | 2023-09-09 | 3.5.8 | 2026-01-15[44] |
| 4.0 | 2025-05-23 | 4.0.1 | 2025-09-06[45] |
| 4.1 | 2025-12-16 | 4.1.1 | 2026-01-09[46] |
| 4.2 | 2026-01-11 | 4.2.0-preview1 | 2026-01-11[47] |
Legend: Unsupported Supported Latest version Preview version | |||
Scala version
[edit]Spark 3.5.2 is based on Scala 2.13 (and thus works with Scala 2.12 and 2.13 out-of-the-box), but it can also be made to work with Scala 3.[48]
Developers
[edit]Apache Spark is developed by a community. The project is managed by a group called the "Project Management Committee" (PMC).[49]
Maintenance releases and EOL
[edit]Feature release branches will, generally, be maintained with bug fix releases for a period of 18 months. For example, branch 2.3.x is no longer considered maintained as of September 2019, 18 months after the release of 2.3.0 in February 2018. No more 2.3.x releases should be expected after that point, even for bug fixes.
The last minor release within a major a release will typically be maintained for longer as an “LTS” release. For example, 2.4.0 was released on November 2, 2018, and had been maintained for 31 months until 2.4.8 was released in May 2021. 2.4.8 is the last release and no more 2.4.x releases should be expected even for bug fixes.[50]
See also
[edit]Notes
[edit]References
[edit]- ^ "Spark Release 2.0.0".
MLlib in R: SparkR now offers MLlib APIs [..] Python: PySpark now offers many more MLlib algorithms"
- ^ a b c d Zaharia, Matei; Chowdhury, Mosharaf; Franklin, Michael J.; Shenker, Scott; Stoica, Ion. Spark: Cluster Computing with Working Sets (PDF). USENIX Workshop on Hot Topics in Cloud Computing (HotCloud).
- ^ "Spark 2.2.0 Quick Start". apache.org. 2017-07-11. Retrieved 2017-10-19.
we highly recommend you to switch to use Dataset, which has better performance than RDD
- ^ "Spark 2.2.0 deprecation list". apache.org. 2017-07-11. Retrieved 2017-10-10.
- ^ Damji, Jules (2016-07-14). "A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets: When to use them and why". databricks.com. Retrieved 2017-10-19.
- ^ Chambers, Bill (2017-08-10). "12". Spark: The Definitive Guide. O'Reilly Media.
virtually all Spark code you run, where DataFrames or Datasets, compiles down to an RDD
[permanent dead link] - ^ "What is Apache Spark? Spark Tutorial Guide for Beginner". janbasktraining.com. 2018-04-13. Retrieved 2018-04-13.
- ^ Zaharia, Matei; Chowdhury, Mosharaf; Das, Tathagata; Dave, Ankur; Ma, Justin; McCauley, Murphy; J., Michael; Shenker, Scott; Stoica, Ion (2010). Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing (PDF). USENIX Symp. Networked Systems Design and Implementation.
- ^ Xin, Reynold; Rosen, Josh; Zaharia, Matei; Franklin, Michael; Shenker, Scott; Stoica, Ion (June 2013). Shark: SQL and Rich Analytics at Scale (PDF). SIGMOD 2013. arXiv:1211.6176. Bibcode:2012arXiv1211.6176X.
- ^ Harris, Derrick (28 June 2014). "4 reasons why Spark could jolt Hadoop into hyperdrive". Gigaom. Archived from the original on 24 October 2017. Retrieved 25 February 2016.
- ^ "Cluster Mode Overview - Spark 2.4.0 Documentation - Cluster Manager Types". apache.org. Apache Foundation. 2019-07-09. Archived from the original on 2025-09-26. Retrieved 2019-07-09.
- ^ Figure showing Spark in relation to other open-source Software projects including Hadoop
- ^ MapR ecosystem support matrix
- ^ Doan, DuyHai (2014-09-10). "Re: cassandra + spark / pyspark". Cassandra User (Mailing list). Retrieved 2014-11-21.
- ^ Wang, Yandong; Goldstone, Robin; Yu, Weikuan; Wang, Teng (May 2014). "Characterization and Optimization of Memory-Resident MapReduce on HPC Systems". 2014 IEEE 28th International Parallel and Distributed Processing Symposium. IEEE. pp. 799–808. doi:10.1109/IPDPS.2014.87. ISBN 978-1-4799-3800-1. S2CID 11157612.
- ^ a b dotnet/spark, .NET Platform, 2020-09-14, retrieved 2020-09-14
- ^ "GitHub - DFDX/Spark.jl: Julia binding for Apache Spark". GitHub. 2019-05-24.
- ^ "Spark Release 1.3.0 | Apache Spark".
- ^ "Applying the Lambda Architecture with Spark, Kafka, and Cassandra | Pluralsight". www.pluralsight.com. Retrieved 2016-11-20.
- ^ Shapira, Gwen (29 August 2014). "Building Lambda Architecture with Spark Streaming". cloudera.com. Cloudera. Archived from the original on 14 June 2016. Retrieved 17 June 2016.
re-use the same aggregates we wrote for our batch application on a real-time data stream
- ^ Chintapalli, Sanket; Dagit, Derek; Evans, Bobby; Farivar, Reza; Graves, Thomas; Holderbaugh, Mark; Liu, Zhuo; Nusbaum, Kyle; Patil, Kishorkumar; Peng, Boyang Jerry; Poulosky, Paul (May 2016). "Benchmarking Streaming Computation Engines: Storm, Flink and Spark Streaming". 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE. pp. 1789–1792. doi:10.1109/IPDPSW.2016.138. ISBN 978-1-5090-3682-0. S2CID 2180634.
- ^ Kharbanda, Arush (17 March 2015). "Getting Data into Spark Streaming". sigmoid.com. Sigmoid (Sunnyvale, California IT product company). Archived from the original on 15 August 2016. Retrieved 7 July 2016.
- ^ Zaharia, Matei (2016-07-28). "Structured Streaming In Apache Spark: A new high-level API for streaming". databricks.com. Retrieved 2017-10-19.
- ^ "On-Premises vs. Cloud Data Warehouses: Pros and Cons". SearchDataManagement. Retrieved 2022-10-16.
- ^ Sparks, Evan; Talwalkar, Ameet (2013-08-06). "Spark Meetup: MLbase, Distributed Machine Learning with Spark". slideshare.net. Spark User Meetup, San Francisco, California. Retrieved 10 February 2014.
- ^ "MLlib | Apache Spark". spark.apache.org. Retrieved 2016-01-18.
- ^ Malak, Michael (14 June 2016). "Finding Graph Isomorphisms In GraphX And GraphFrames: Graph Processing vs. Graph Database". slideshare.net. sparksummit.org. Retrieved 11 July 2016.
- ^ Malak, Michael (1 July 2016). Spark GraphX in Action. Manning. p. 89. ISBN 9781617292521.
Pregel and its little sibling aggregateMessages() are the cornerstones of graph processing in GraphX. ... algorithms that require more flexibility for the terminating condition have to be implemented using aggregateMessages()
- ^ Malak, Michael (14 June 2016). "Finding Graph Isomorphisms In GraphX And GraphFrames: Graph Processing vs. Graph Database". slideshare.net. sparksummit.org. Retrieved 11 July 2016.
- ^ Gonzalez, Joseph; Xin, Reynold; Dave, Ankur; Crankshaw, Daniel; Franklin, Michael; Stoica, Ion (Oct 2014). GraphX: Graph Processing in a Distributed Dataflow Framework (PDF). OSDI 2014.
- ^ ".NET for Apache Spark | Big data analytics". 15 October 2019.
- ^ "Spark.jl". GitHub. 14 October 2021.
- ^ a b Clark, Lindsay. "Apache Spark speeds up big data decision-making". ComputerWeekly.com. Retrieved 2018-05-16.
- ^ "The Apache Software Foundation Announces Apache™ Spark™ as a Top-Level Project". apache.org. Apache Software Foundation. 27 February 2014. Retrieved 4 March 2014.
- ^ Spark officially sets a new record in large-scale sorting
- ^ Open HUB Spark development activity
- ^ "The Apache Software Foundation Announces Apache™ Spark™ as a Top-Level Project". apache.org. Apache Software Foundation. 27 February 2014. Retrieved 4 March 2014.
- ^ "Spark 2.4.8 released". spark.apache.org. Archived from the original on 2021-08-25.
- ^ "Spark 3.0.3 released". spark.apache.org.
- ^ "Spark 3.1.3 released". spark.apache.org. Archived from the original on 2022-06-18.
- ^ "Spark 3.2.4 released". spark.apache.org.
- ^ "Spark 3.3.3 released". spark.apache.org.
- ^ "Spark 3.4.4 released". spark.apache.org.
- ^ "Spark 3.5.8 released". spark.apache.org.
- ^ "Spark 4.0.1 released". spark.apache.org.
- ^ "Spark 4.1.1 released". spark.apache.org.
- ^ "Preview release of Spark 4.2.0". spark.apache.org.
- ^ "Using Scala 3 with Spark". 47 Degrees. Retrieved 29 July 2022.
- ^ "Apache Committee Information".
- ^ "Versioning policy". spark.apache.org.
External links
[edit]Apache Spark
View on GrokipediaIntroduction
Definition and Purpose
Apache Spark is an open-source, distributed computing framework that provides a unified analytics engine for large-scale data processing, leveraging in-memory computing to handle vast datasets efficiently across clusters.[1] Developed under the Apache Software Foundation, it supports high-level APIs in multiple languages including Scala, Java, Python, and R, allowing developers to perform complex data operations without managing low-level distributed systems details.[6] The primary purposes of Apache Spark include enabling batch processing for large static datasets, real-time streaming for continuous data flows, machine learning workflows through its MLlib library, and graph processing via GraphX, all integrated into a single cohesive platform that reduces the need for multiple specialized tools.[1][7] This unified approach facilitates seamless transitions between different data processing paradigms, supporting data engineering, science, and analytics tasks at scale.[1] In comparison to predecessors like Hadoop MapReduce, which rely on disk-based processing, Spark delivers up to 100 times faster performance in memory and 10 times faster on disk by employing directed acyclic graph (DAG) execution for optimized task scheduling and lazy evaluation to minimize unnecessary computations.[8] At the heart of Spark's design lies the Resilient Distributed Dataset (RDD), a fault-tolerant, immutable distributed collection that serves as the core abstraction for parallel data operations, ensuring reliability without full data replication.[9][10] Spark's versatility makes it suitable for high-level use cases such as extract-transform-load (ETL) operations to prepare data pipelines, interactive queries for ad-hoc analysis, and real-time analytics to derive insights from streaming sources, powering applications in industries like finance, healthcare, and e-commerce.[7]Key Features and Benefits
Apache Spark's in-memory computation capability allows data to be cached directly in RAM across cluster nodes, significantly reducing disk I/O overhead and accelerating iterative algorithms such as machine learning training. This approach enables Spark to achieve up to 100 times faster performance compared to disk-based systems like Hadoop MapReduce for multi-pass applications.[5] By keeping intermediate results in memory, Spark minimizes data movement, making it particularly effective for workloads requiring repeated data access.[9] As a unified analytics engine, Spark provides a single framework for diverse data processing tasks, including batch processing, real-time streaming, machine learning, and graph analytics, thereby reducing the need for multiple specialized tools and simplifying development workflows.[6] This integration fosters developer productivity through high-level APIs that abstract low-level distributed systems complexities, allowing code written on a single machine to scale seamlessly to large clusters.[1] Spark ensures fault tolerance through its Resilient Distributed Datasets (RDDs), which track data lineage to automatically recompute lost partitions upon node failures without requiring manual checkpoints or data replication.[5] This mechanism supports reliable operation across distributed environments, enhancing system resilience for large-scale deployments. Spark's scalability extends to petabyte-scale datasets distributed over thousands of nodes, leveraging efficient execution graphs for high-throughput processing.[6] Key benefits include substantial cost savings from accelerated processing times, with in-memory operations often yielding orders-of-magnitude improvements in iterative tasks, and improved developer efficiency via intuitive APIs in languages like Scala, Python, Java, and R.[5][6] Additionally, Spark's interoperability allows it to run on various cluster managers such as Hadoop YARN, Kubernetes, and standalone modes, as well as integrate with storage systems like HDFS.[11]Architecture
Spark Core and RDDs
Spark Core serves as the foundational engine of Apache Spark, providing distributed task dispatching, scheduling, and basic input/output functionalities that underpin all higher-level components. It manages the execution of tasks across a cluster, handling resource allocation in coordination with external cluster managers, and supports in-memory computation to accelerate data processing. This core layer enables Spark's unified analytics capabilities by abstracting the complexities of distributed computing.[12] At the heart of Spark Core is the Resilient Distributed Dataset (RDD), an immutable, partitioned collection of objects that can be processed in parallel across nodes in a cluster. RDDs provide a fault-tolerant abstraction for in-memory cluster computing, allowing users to explicitly cache data in memory and control its partitioning to optimize performance. They can be created from external data sources like Hadoop files or by transforming other RDDs, ensuring data locality and parallelism. Examples include parallelizing a local collection into an RDD or loading data from HDFS.[9][10] RDDs support two primary types of operations: transformations and actions. Transformations, such asmap, filter, and join, create a new RDD from an existing one without immediately computing the result, enabling composition of complex pipelines. Actions, like count, collect, and reduce, trigger the execution of the computation and return a value or write data to an output system. These operations allow developers to express data processing workflows in a functional style.[10]
The execution model of Spark Core relies on lazy evaluation, where transformations build a lineage graph without performing computations until an action is invoked. This lineage is represented as a directed acyclic graph (DAG) of dependencies between RDDs, which tracks the sequence of transformations applied. Upon an action, the DAG scheduler divides the graph into stages of tasks, optimizing for pipelining narrow dependencies while handling wide dependencies separately. This approach minimizes unnecessary work and supports iterative algorithms efficiently.[9][10]
RDD lineage enables fault tolerance by allowing lost partitions to be recomputed from the original data using the recorded transformations, without requiring data replication. In case of node failure, Spark uses the DAG to regenerate only the affected partitions, ensuring reliability in large-scale distributed environments. This recomputation mechanism contrasts with traditional checkpointing, offering low-overhead recovery for in-memory data.[9][10]
Data in RDDs is distributed across the cluster via partitioning, where each RDD consists of multiple partitions stored on different nodes to enable parallelism. The number of partitions can be controlled during creation or via operations like repartition, influencing the granularity of parallel tasks. Shuffling occurs during wide dependencies, such as groupByKey or reduceByKey, where data is redistributed across partitions based on keys, often involving network I/O and disk spills if memory is insufficient; this is one of the most costly operations in Spark. To mitigate shuffle overhead, users can tune partition sizes and use combiners for aggregations.[10]
Spark Core's memory management is enhanced by the Tungsten project, which optimizes serialization through columnar in-memory layouts and whole-stage code generation for CPU efficiency. Tungsten also supports off-heap memory allocation to store data outside the JVM heap, reducing garbage collection pauses and enabling larger working sets. These features, including efficient binary serialization and cache-aware computation, significantly improve performance for memory-intensive workloads.[13]
Cluster Managers and Deployment
Apache Spark relies on external cluster managers to distribute workloads across a cluster of machines, handling resource allocation, task scheduling, and fault recovery independently of Spark's core engine. The cluster manager allocates resources such as CPU, memory, and optionally GPUs to Spark applications by launching executor processes on worker nodes. These executors perform computations and store data in memory or disk, while the manager ensures fault tolerance through mechanisms like restarting failed executors or reallocating tasks upon node failures. This separation allows Spark to integrate with various resource management frameworks without modifying its internal execution model.[11] Spark supports several cluster managers, each suited to different environments. The built-in Standalone mode provides a simple, self-contained option for deployments without external dependencies, using a master-worker architecture where the master tracks resources and workers host executors. Installation involves placing Spark binaries on cluster nodes and starting the master daemon, which then accepts worker registrations; it supports dynamic resource allocation and fault recovery by relaunching executors on healthy nodes. For Hadoop ecosystems, Spark integrates with YARN (Yet Another Resource Negotiator), leveraging YARN's resource management to schedule Spark applications alongside other workloads; this integration, introduced in Spark 0.6.0, allows Spark to request containers for executors and handles fault tolerance via YARN's application master process.[14][15][15] Historically, Spark supported Apache Mesos for fine-grained or coarse-grained resource sharing, where coarse-grained mode (default) allocates entire executors statically and fine-grained mode (deprecated) shares resources at the task level for more dynamic allocation; however, Mesos support was deprecated in Spark 3.2.0 and fully removed in Spark 4.0.0 to streamline maintenance. Since Spark 2.3.0, Spark offers native integration with Kubernetes for containerized deployments, using Kubernetes' scheduler to manage pods for drivers and executors; this enables dynamic resource allocation, scaling, and portability across cloud-native environments, with production readiness achieved in Spark 3.1.0.[16][17][18][18] Spark applications can be deployed in client mode or cluster mode, determining the location of the driver program—the process that creates the SparkContext and coordinates execution. In client mode, the driver runs on the machine submitting the application (e.g., a developer's laptop), suitable for interactive sessions where output is needed locally, while executors run on cluster nodes. In cluster mode, the driver runs inside the cluster (e.g., as a YARN application master or Kubernetes pod), allowing the client to disconnect after submission; this mode is ideal for production jobs to avoid driver failures due to client connectivity issues. Security features, such as Kerberos authentication, are supported across managers to secure access to protected resources like HDFS; configuration involves setting properties likespark.kerberos.renewal.credentials to use ticket caches for authentication without exposing credentials.[11][19][15][20]
Monitoring Spark applications occurs primarily through the built-in Web UI, accessible at http://<driver-node>:4040 during execution, which provides tabs for jobs (showing scheduler stages and task progress), stages (detailing task metrics like duration and shuffle data), executors (listing active instances with memory and disk usage), storage (RDD persistence details), and environment (configuration and dependencies). For completed applications, the Spark History Server (port 18080) aggregates logs if event logging is enabled via spark.eventLog.enabled=true. Integration with cluster managers extends monitoring; for example, YARN provides executor logs via its ResourceManager UI, enhancing visibility into resource utilization and failures.[21][21]
Data Processing Modules
Spark SQL and DataFrames
Spark SQL is a module within Apache Spark designed for processing structured data, enabling users to query data using SQL or the DataFrame API while leveraging Spark's distributed execution engine.[22] It supports HiveQL syntax and provides seamless integration for structured data operations, allowing queries on data from various sources such as files, databases, and existing RDDs.[22] Unlike the low-level RDD API, Spark SQL exposes more information about data structure and computations to enable advanced optimizations.[23] DataFrames represent distributed collections of data organized into named columns, akin to relational database tables or data frames in R and Python, but with built-in optimizations for distributed processing.[22] They are lazily evaluated and implemented on top of RDDs, supporting operations like filtering, aggregation, and joining through a domain-specific language.[24] DataFrames can be created from structured data files, Hive tables, external databases, or RDDs, and are available across Scala, Java, Python, and R APIs.[22] The Dataset API extends DataFrames by providing a type-safe interface, primarily in Scala and Java, where users can work with strongly typed objects while retaining the benefits of structured operations and optimizations.[25] Datasets combine the functional programming capabilities of RDDs with the relational model of DataFrames, allowing transformations using both functional and SQL-like methods.[22] In Python and R, the DataFrame API serves a similar role without the type safety of Datasets.[22] Query optimization in Spark SQL is handled by the Catalyst optimizer, which performs logical and physical planning to generate efficient execution plans.[26] Catalyst uses a tree-based representation of queries and applies rule-based and cost-based optimizations, leveraging Scala's pattern matching and quasiquotes for extensibility.[26] It resolves tables and columns, applies predicate pushdown, and selects join algorithms based on data statistics, significantly improving performance over unoptimized plans.[23] Key features of Spark SQL include support for user-defined functions (UDFs), which allow custom logic on rows or aggregates, and window functions for computations over a set of rows related to the current row, such as ranking or cumulative sums.[27][28] It also natively handles data formats like JSON, where schemas can be inferred and loaded as DataFrames, and Parquet, preserving schema and enabling efficient columnar storage and compression.[29][30] In Spark 4.0 (released May 2025), ANSI SQL mode is enabled by default for improved compliance, and support for the VARIANT data type was added for handling semi-structured data like JSON and Avro.[17] Whenspark.sql.ansi.enabled is set to false (via Spark configuration or session setting; for example, in Databricks Runtime for compatibility), to_date(expr) or CAST(expr AS DATE) provides safe string-to-date conversion by returning NULL for malformed or invalid dates instead of raising an error. For specific formats, to_date(expr, 'fmt') behaves the same way when ANSI mode is disabled, returning NULL on invalid inputs regardless of the mode.[31]
Spark SQL integrates with the Hive metastore to access metadata for Hive tables, supporting Hive SerDes and UDFs for compatibility with existing Hive warehouses.[32] It also exposes a distributed SQL engine through JDBC and ODBC servers, enabling external tools to connect and execute queries directly without writing Spark code.[33][34]
Introduced in Spark 3.0, Adaptive Query Execution (AQE) enhances runtime performance by dynamically adjusting execution plans based on runtime statistics, including handling data skew in joins and aggregations through techniques like splitting skewed partitions.[35][36] AQE is enabled by default in recent versions and includes features like dynamic coalesce for reducing shuffle partitions post-aggregation.[35]
Structured Streaming
Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine, introduced in Apache Spark 2.0 as an evolution from the earlier Spark Streaming model that relied on Discretized Streams (DStreams).[37] It replaces the DStream-based API with a Dataset/DataFrame-based API, enabling unified batch and streaming processing while providing end-to-end exactly-once semantics through integration with replayable data sources and idempotent sinks.[37] This shift allows developers to express streaming computations using the same high-level APIs as batch processing, simplifying development and maintenance. At its core, Structured Streaming treats input data streams as unbounded tables that are continuously appended, where streaming computations are expressed as operations on these tables, similar to static DataFrames.[37] The engine processes data incrementally using a micro-batch approach by default, where incoming data is batched into small, discrete units and processed as a series of deterministic batch jobs triggered at configurable intervals.[37] Triggers control the scheduling, such as processing available data upon arrival (default micro-batch) or at fixed intervals, enabling control over latency and throughput.[37] Since Spark 2.3, an experimental continuous processing mode has been available, allowing sub-millisecond end-to-end latencies by continuously running tasks without batching, though it supports only at-least-once guarantees and a limited set of operations; this mode remains experimental in Spark 3.x and has seen refinements in Spark 4.0 for improved performance and observability. In Spark 4.0 (May 2025), the Arbitrary State API v2 was introduced for more flexible stateful processing, along with the State Data Source for easier debugging of streaming state.[38][17] Structured Streaming supports a variety of input sources for ingesting continuous data, including Apache Kafka for message queues, file systems for monitoring directories, and TCP sockets for simple network streams, among others that provide replayability for fault recovery.[37] For file-based sources such as JSON and CSV, schema inference is supported by setting the inferSchema option to true on the DataStreamReader. There is no global configuration property named spark.sql.streaming.schemaInference. It is recommended as a best practice to explicitly define the schema rather than relying on inference, as inferred schemas can change over time across incoming files and cause runtime failures.[37] Outputs, or sinks, allow writing processed results to destinations such as the console for debugging, in-memory tables for testing, or custom logic via the foreach sink for integration with external systems, with support for idempotent operations to ensure reliable delivery.[37] Fault tolerance is achieved through checkpointing, where the engine periodically saves the current progress, including offsets from sources and intermediate state, to a reliable storage like HDFS or cloud object stores, enabling recovery from failures by restarting from the last checkpoint.[37] For handling late-arriving or out-of-order data based on event time, watermarking defines a threshold beyond which old data is discarded, allowing the system to bound state size in aggregations and joins while processing within acceptable delays— for instance, a 10-minute watermark retains data up to 10 minutes late.[37][39] The processing model guarantees at-least-once delivery by default through output modes like append or complete, but achieves end-to-end exactly-once semantics when using replayable sources (e.g., Kafka with committed offsets) and idempotent sinks, combined with Write-Ahead Logs and checkpointing to prevent duplicates or losses during failures.[37] This ensures reliable stream processing even in distributed environments. In terms of performance, Structured Streaming enables low-latency processing with sub-second trigger intervals in micro-batch mode, often achieving latencies under 250 milliseconds for operational workloads through optimizations like adaptive query execution. It integrates seamlessly with Spark SQL, allowing streaming DataFrames to join with static tables or other streams using familiar query expressions, supporting complex operations like aggregations and windowing without custom code.[37] For scalability, queries can be deployed on Spark cluster managers like YARN or Kubernetes, distributing processing across nodes to handle high-throughput streams.[37]MLlib Machine Learning Library
MLlib is Apache Spark's open-source distributed machine learning library, designed to enable scalable implementation of common machine learning algorithms and workflows across clusters. It integrates seamlessly with Spark's core APIs, supporting languages such as Scala, Java, Python, and R, and operates on distributed data structures like DataFrames for efficient processing of large-scale datasets. The library emphasizes ease of use by providing high-level abstractions for data preparation, model training, and evaluation, while leveraging Spark's in-memory computation to achieve performance gains over traditional disk-based systems.[40][41] MLlib supports a range of algorithms for core machine learning tasks, including classification with logistic regression and decision trees, regression via generalized linear models, clustering through K-means, dimensionality reduction using Principal Component Analysis (PCA), and recommendation systems employing Alternating Least Squares (ALS). These algorithms are implemented to distribute computations across cluster nodes, allowing them to handle datasets that exceed single-machine memory limits. For instance, logistic regression in MLlib uses stochastic gradient descent for training on massive feature vectors, while K-means employs scalable variants like k-means++ for initialization to ensure convergence on billions of data points. Decision trees and their ensembles, such as random forests, support both classification and regression by recursively partitioning data based on feature splits, optimized for distributed execution.[42][43][44] Feature engineering in MLlib is facilitated by a suite of transformers and estimators that process raw data into suitable formats for modeling. Transformers such as Tokenizers convert text into word arrays, while StandardScaler normalizes features to unit variance, enabling consistent scaling across distributed partitions. The Pipeline API allows users to chain these stages—such as feature extraction, transformation, and selection—into reusable workflows, treating the entire sequence as a single estimator for simplified management and serialization. This design promotes modularity, where stages like vector assemblers combine multiple feature columns into a single vector, streamlining input to downstream algorithms.[45][46] Model training in MLlib utilizes distributed optimization techniques tailored for large datasets, including Limited-memory Broyden–Fletcher–Goldfarb–Shanno (L-BFGS) for smooth convex problems like logistic regression, which iteratively approximates the Hessian to converge efficiently on terabyte-scale data. Hyperparameter tuning is supported through tools like CrossValidator, which performs k-fold cross-validation over parameter grids in a distributed manner, evaluating models on held-out subsets to select optimal configurations without manual intervention. For advanced features, MLlib includes Gaussian Mixture Models for probabilistic clustering via expectation-maximization, modeling data as mixtures of multivariate Gaussians, and survival regression through Accelerated Failure Time (AFT) models based on Weibull distributions for time-to-event analysis. Integration with external libraries, such as TensorFlow, is possible via community wrappers like TensorFlowOnSpark, enabling hybrid workflows where Spark handles data preprocessing and TensorFlow performs deep learning inference.[46][43][47] In Spark 3.x releases, MLlib received enhancements for improved GPU support through integration with the RAPIDS Accelerator, which offloads compatible algorithms like linear regression and K-means to NVIDIA GPUs for up to 10x speedups on certain workloads, alongside optimizations in vector assembly for faster feature construction. In Spark 4.0 (May 2025), MLlib gained support for execution via Spark Connect, allowing remote ML pipeline operations.[48][17] Evaluation metrics are built into MLlib's evaluators, providing functions to compute accuracy for classification tasks and F1-score as a harmonic mean of precision and recall, applicable to both binary and multiclass scenarios via DataFrame-based APIs. These metrics facilitate model assessment on distributed test sets, often combined with pipelines for end-to-end validation. Data preparation for MLlib commonly leverages Spark SQL DataFrames for querying and transforming structured data prior to modeling.[48]GraphX Graph Processing
GraphX is Apache Spark's API for graph-parallel computation, providing abstractions for representing and processing large-scale graphs within the Spark ecosystem. Built directly on top of Spark's resilient distributed datasets (RDDs), it enables developers to perform graph analytics alongside other data processing tasks in a unified framework. This integration allows seamless mixing of graph operations with Spark's core dataflow capabilities, facilitating efficient handling of graph data in distributed environments.[49][50] In GraphX, graphs are modeled as property graphs, where vertices are represented as an RDD of tuples in the form (VertexId, VertexValue), with VertexId being a unique 64-bit long identifier and VertexValue holding arbitrary user-defined attributes. Edges are similarly represented as an RDD of Edge[EdgeValue] objects, each containing source and destination VertexIds along with an EdgeValue for properties like weights. This dual-RDD structure supports flexible property graphs while leveraging Spark's fault-tolerant distribution and in-memory computation. GraphX enforces no specific ordering on vertex identifiers, allowing broad applicability to various graph topologies.[51] GraphX offers a range of operators for manipulating graphs, categorized into basic and structural types. Basic operators include joinVertices, which augments vertex attributes based on neighbor information, and subgraph, which filters vertices and edges to create induced subgraphs. Structural operators provide high-level analytics such as connectedComponents for identifying disjoint graph partitions and triangleCounting for measuring local clustering. These operators are optimized for parallel execution, reducing the need for low-level RDD manipulations.[51] For iterative algorithms, GraphX implements the Pregel API, an abstraction for message-passing computations inspired by the Pregel system. It supports algorithms like PageRank, which iteratively propagates rankings across vertices, and single-source shortest paths, where messages accumulate distances from a source. The API operates in supersteps, sending messages along edges and updating vertex states until convergence, all while maintaining Spark's distribution and fault tolerance.[51][50] Graphs in GraphX can be constructed from existing RDDs of vertices and edges or built from collections such as edge lists using utility methods like GraphLoader.edgeListFile. Optimizations include edge partitioning strategies, such as random vertex cut, which balance computational load by co-partitioning vertices with high-degree neighbors to minimize communication overhead. These builders ensure efficient loading and partitioning for large-scale graphs.[51] Performance-wise, GraphX leverages Spark's in-memory processing to handle graphs with billions of edges; for instance, it computes PageRank on a 1-billion-edge graph across 32 machines in under 40 seconds. It integrates with Spark's MLlib for graph-based machine learning tasks, such as community detection, by converting graphs to compatible formats. This enables scalable analytics without data movement.[50] Despite its strengths, GraphX has limitations, particularly in handling dynamic graphs with frequent updates, where its static representation can lead to inefficiencies. For scenarios requiring SQL-like declarative queries on graphs, the GraphFrames library, built on DataFrames, is recommended as a more flexible alternative.[51][52]Language Support and APIs
Supported Programming Languages
Apache Spark primarily supports Scala as its native programming language, in which the framework is implemented on the Java Virtual Machine (JVM). As of Spark 4.0, applications using the Scala API must use Scala 2.13 for compatibility.[6] This allows developers familiar with functional programming paradigms to leverage Scala's concise syntax for building Spark applications directly against the core abstractions.[6] The Java API provides a full-featured interface tailored for enterprise environments, offering semantics closely aligned with the Scala API to ensure consistency in operations like data transformations and job submissions. Spark 4.0 requires Java 17 or 21.[6] It enables Java developers to interact with Spark's resilient distributed datasets (RDDs) and higher-level structures without needing to learn Scala-specific constructs. PySpark offers a high-level Python API through Pythonic wrappers around Spark's JVM-based components, making it particularly popular among data scientists for its integration with libraries like NumPy and Pandas. As of Spark 4.0, Python 3.9 or higher is required, building on support for Python 3.6+ in early 3.x releases and 3.7+ in later 3.x versions.[6] Key features include Pandas User-Defined Functions (UDFs), which enable vectorized operations on batches of data using Apache Arrow for efficient transfer between Spark and Pandas DataFrames.[53] Additionally, PySpark supports broadcast joins via hints that mark small DataFrames for efficient distribution across the cluster, optimizing performance in join operations. Spark 4.0 introduces enhancements such as an improved pandas API on Spark for better compatibility with pandas workloads, native Python Data Source API for custom data sources in Python, and a lightweight pyspark-client (1.5 MB) for easier remote connections.[17] In Spark 4.1.0, further improvements to PySpark UDFs and Data Sources include new Arrow-native UDF and UDTF decorators for efficient PyArrow execution without Pandas conversion overhead, and Python Data Source filter pushdown to reduce data movement.[54] SparkR integrates R's DataFrame API with Spark, allowing R users to perform distributed data processing with operations resembling those in base R and packages like dplyr, such as filtering, aggregation, and grouping. As of Spark 4.0, the R API requires R 3.5+ but is deprecated and scheduled for removal in a future release.[6] It supports column-based access through functions likeselect() and column(), enabling intuitive manipulation of Spark DataFrames in an R-native style.
These language APIs are utilized across Spark's modules, including Spark SQL and MLlib, to provide unified access to data processing and machine learning capabilities.[6] Language support evolves with Spark releases; for instance, the deprecation of SparkR in Spark 4.0 signals a shift toward stronger Python and Scala focus.[6]