Recent from talks
All channels
Be the first to start a discussion here.
Be the first to start a discussion here.
Be the first to start a discussion here.
Be the first to start a discussion here.
Welcome to the community hub built to collect knowledge and have discussions related to Flink.
Nothing was collected or created yet.
Flink
View on Wikipediafrom Wikipedia
Not found
Flink
View on Grokipediafrom Grokipedia
Apache Flink is an open-source, distributed stream processing framework and engine designed for stateful computations over both unbounded (streaming) and bounded (batch) data streams.[1]
Originating from the Stratosphere research project initiated in 2009 at the Technische Universität Berlin, Flink was donated to the Apache Software Foundation in 2014, where it entered the incubator and was renamed from its predecessor.[2] The project graduated to top-level Apache status in December 2014, marking a pivotal shift toward widespread adoption in big data processing.[3] Key milestones include its first large-scale production deployment by Alibaba in 2016, the inaugural Flink Forward conference in 2015, and Alibaba's acquisition of dataArtisans (Flink's founding company, later Ververica) in 2019, which integrated enhancements from Alibaba's Blink codebase.[2] In 2023, Flink received the ACM SIGMOD Systems Award for its contributions to stream processing technology.[2]
Flink's architecture emphasizes scalability, fault tolerance, and low-latency performance through a scale-out design that supports common cluster environments like Kubernetes, YARN, and standalone modes, enabling in-memory computations at high throughput.[1] It provides multiple APIs for development, including the high-level SQL API for stream and batch data, the DataStream API for fine-grained stream processing, and the ProcessFunction for advanced time- and state-based logic, all unified under a single runtime.[4] Core features include exactly-once state consistency via checkpointing and two-phase commit protocols, native event-time processing with support for late data, and incremental checkpoints for efficient handling of large state—up to petabyte-scale in production.[1] Operationally, Flink ensures high availability with automatic failover, savepoints for zero-downtime upgrades, and comprehensive monitoring tools.[5]
Common use cases span real-time analytics, event-driven applications, and continuous ETL pipelines, powering fraud detection, recommendation systems, and monitoring at companies like Alibaba (for search optimization), Capital One (real-time monitoring), and Uber (processing billions of events daily).[6][7] As of November 2025, the project boasts nearly 2,000 contributors—approximately half from China—and its latest stable release, Apache Flink 2.1.1, advances cloud-native capabilities, unified batch-stream processing, and integrations with AI/ML workflows, solidifying its role in modern data lakehouses alongside tools like Apache Paimon.[8][2]
Overview
Definition and Purpose
Apache Flink is an open-source, distributed processing engine for stateful computations over unbounded and bounded data streams, supporting both streaming and batch processing within a unified dataflow model.[1] This framework enables the execution of complex, iterative, and multi-stage data processing pipelines across distributed clusters, treating all data as streams to ensure consistent semantics for real-time and historical analytics.[9] The primary purpose of Flink is to deliver low-latency, high-throughput processing of real-time data, powering scalable analytics, event-driven applications, and continuous data pipelines in big data ecosystems. By leveraging in-memory computation and fault-tolerant mechanisms, it facilitates immediate insights from continuous event streams while scaling to handle massive volumes without compromising performance.[6] This design supports diverse use cases, from fraud detection to ETL operations, by providing exactly-once processing guarantees and sophisticated event-time handling.[9] Originating from the Stratosphere research project at the Technical University of Berlin in 2009, Flink was developed to address limitations in earlier stream processors, such as the high latency and added complexity introduced by micro-batching techniques that approximate continuous streaming with discrete intervals. Unlike hybrid systems that maintain separate engines for streaming and batch workloads, Flink unifies them by modeling batch jobs as finite streams, enabling a single runtime for both paradigms with optimized execution.[9]Core Principles
Apache Flink's core principles revolve around ensuring reliable, efficient, and versatile data processing in distributed environments. A foundational principle is the exactly-once processing guarantee, which ensures that each incoming event affects the final results precisely once, even in the presence of failures, preventing data loss or duplication. This is achieved through a lightweight distributed snapshotting mechanism known as checkpointing, where Flink periodically captures the current application state and input stream positions asynchronously to durable storage such as HDFS or S3. Upon failure recovery, the system restores from the latest checkpoint, replaying only the events processed since that point, combined with a two-phase commit protocol for sinks that coordinates pre-commit and commit phases across operators and external systems like Kafka.[10] Another key principle is the support for event-time semantics over processing-time, enabling accurate handling of out-of-order and delayed data in streams. Event-time processing bases computations on the timestamps embedded in events themselves, rather than the machine's wall-clock time, which decouples application logic from variable network delays, backpressure, or recovery times. Flink facilitates this through watermarks, which indicate the progress of event time and allow the system to bound latency while processing late-arriving events via side outputs or result updates, ensuring consistent and correct results for applications like real-time analytics on unbounded streams.[4][11] Flink emphasizes stateful stream processing, where applications maintain and update state across long-running jobs to support complex operations such as aggregations, joins, and sessionization over unbounded data. This state—potentially spanning terabytes—is stored locally in memory or efficient on-disk structures for low-latency access, with parallelization across thousands of cores to handle trillions of events daily. Fault tolerance is integrated via incremental checkpointing, which asynchronously persists state changes to ensure exactly-once consistency without halting processing, making it suitable for production-scale event-driven applications.[12] Underpinning these is Flink's unified batch-stream model, which treats batch jobs as special cases of finite (bounded) streams, allowing a single runtime, APIs, and semantics to handle both continuous streaming and historical batch processing seamlessly. This streaming-first philosophy enables consistent query execution—such as using the same DataStream or Table API for real-time and offline analytics—while optimizing bounded workloads with specialized operators like hybrid hash joins for improved throughput. By unifying computation, Flink simplifies development, supports mixed execution modes, and facilitates tasks like data reprocessing or bootstrapping streaming jobs from batch results.[13][14]History
Origins and Early Development
Apache Flink originated as the Stratosphere research project, initiated in 2009 and formally funded starting in 2009 by the German Research Foundation (DFG) under grant FOR 1306.[15][16] The project was led by Volker Markl at Technische Universität Berlin (TU Berlin), in collaboration with Humboldt-Universität zu Berlin (HU Berlin) and the Hasso Plattner Institute (HPI) at the University of Potsdam.[15][17] This academic consortium aimed to develop a scalable platform for big data analytics, building on prior work in parallel query processing and optimization techniques from database systems.[17] The primary motivations for Stratosphere were to overcome key limitations in existing big data frameworks like Hadoop MapReduce, which struggled with efficiency in iterative algorithms common in machine learning, graph processing, and statistical analysis, as well as real-time streaming workloads.[17] MapReduce's batch-oriented, single-pass model incurred high overhead from repeated data loading and disk spills, making it unsuitable for programs requiring multiple iterations over datasets until convergence.[17] Stratosphere addressed these gaps through a pipelined execution engine (Nephele) that supported both bulk and incremental iterations, enabling low-latency processing and better resource utilization on distributed clusters, while integrating declarative higher-level APIs for improved programmer productivity.[17][16] After approximately three years of development, the Stratosphere team released its first open-source version in 2011, marking the project's transition from pure research to a publicly available platform.[16] To broaden adoption and ensure long-term sustainability, the core developers donated the codebase to the Apache Software Foundation on April 9, 2014, entering the Apache Incubator shortly thereafter.[3][18] Due to trademark conflicts with the existing "Stratosphere" name held by a commercial entity, the project was renamed Apache Flink—derived from the Low German word "flink," meaning swift or agile—to emphasize its focus on efficient, low-latency data processing.[16] Flink graduated to top-level Apache project status on December 17, 2014, solidifying its position as an open-source standard for unified batch and stream processing.[18]Major Releases and Milestones
Apache Flink's journey under Apache governance began with its entry into the Apache Incubator in April 2014, marking a significant milestone in its transition from an academic project to an open-source framework supported by a growing community. Early incubating releases included Apache Flink 0.6-incubating in August 2014 and 0.8-incubating in January 2015, introducing foundational streaming capabilities and laying the groundwork for unified batch and stream processing. This release emphasized low-latency data processing and fault tolerance, attracting early industry interest, including contributions from Alibaba, which began integrating Flink into its real-time platforms.[2][19] Subsequent releases built on this foundation, with Apache Flink 1.0 launching on March 8, 2016, as the project's first stable version, solidifying its API stability and introducing the DataStream API for complex event processing. By 2019, Alibaba's contributions accelerated growth through the upstreaming of its Blink fork, enhancing query optimization and performance for large-scale deployments. The project continued to mature, reaching Apache Flink 1.20 LTS in August 2024, which focused on improved operator chaining and resource management for better scalability in cloud environments.[13] In 2024, Flink celebrated its 10th anniversary at Flink Forward Berlin, highlighting community achievements and future directions, including the donation of Flink CDC by Ververica in April 2024 for simplified change data capture integrations and no-code data pipelines. The momentum carried into 2025 with Apache Flink 2.0, released on March 24, 2025, which dropped support for Java 8, shifted to Python 3.10+ for better ecosystem compatibility, and introduced disaggregated state backends to enable horizontal scaling of state beyond single-job limits, improving fault tolerance for massive workloads.[20][21][22] The stable Apache Flink 2.1.0 followed in July 2025, refining these advancements with optimizations for Kubernetes-native deployments and further ecosystem integrations. On November 10, 2025, Apache Flink 2.1.1 was released as the latest stable version, incorporating bug fixes, vulnerability patches, and minor improvements. These developments reflect the project's robust growth under Apache stewardship and contributions from over 165 developers in recent cycles.[8][23]Architecture
Runtime Environment
Apache Flink's runtime environment provides a distributed execution platform for processing large-scale data streams and batch jobs, consisting primarily of the JobManager and TaskManager components. The JobManager acts as the central coordinator, responsible for managing job submissions, scheduling tasks, and overseeing the overall execution lifecycle.[12] TaskManagers, on the other hand, are worker nodes that execute the actual computational tasks in parallel, handling data buffering, stream exchanges, and local resource management.[12] Flink employs a pipelined execution model where applications are compiled into a logical dataflow graph and parallelized into subtasks distributed across TaskManagers for concurrent processing. This model enables efficient, low-latency operations by allowing data to flow continuously through operator chains without intermediate materialization, optimizing for both streaming and batch workloads.[12] Each TaskManager divides its resources into task slots, the smallest unit of scheduling, which can host multiple operators from the same pipeline stage to maximize throughput.[24] Flink supports multiple cluster deployment modes to accommodate various environments, including Standalone for local development and testing, which runs a simple cluster without external dependencies.[25] For production setups, it integrates with resource managers such as YARN for dynamic allocation in Hadoop ecosystems and Kubernetes for container orchestration.[25] Additionally, Flink facilitates Docker-based deployments through its resource providers and offers cloud-native options via vendor solutions like those from AWS and Alibaba Cloud, enabling seamless scaling in managed environments.[25] Scalability in Flink's runtime is achieved through horizontal scaling, where additional TaskManagers can be added to the cluster to handle increased workloads, supporting applications that process trillions of events per day across thousands of cores.[12] Dynamic resource allocation allows the system to automatically request or release resources based on job requirements and configured parallelism, ensuring efficient utilization without manual intervention.[12] High availability is provided by deploying multiple JobManagers in a leader election setup, with standby instances taking over in case of failures to maintain continuous operation.[25] Resource management in Flink relies on integration with external orchestrators, such as Kubernetes, which handles container provisioning, scaling, and isolation for JobManagers and TaskManagers.[25] In application mode, each job runs in its own isolated cluster, preventing resource contention, while session mode shares a single cluster among multiple jobs for better resource efficiency in multi-tenant scenarios.[25]Dataflow Model
Flink's dataflow programming model represents applications as directed acyclic graphs (DAGs), where nodes correspond to operators that perform transformations on data, and edges represent the flow of data streams between these operators. Sources initiate the graph by ingesting data, while sinks terminate it by outputting results; common operators include map for one-to-one transformations (e.g., converting a string to an integer) and filter for selecting elements based on predicates. This paradigm allows developers to compose complex topologies declaratively, with the runtime handling parallelization and distribution transparently.[26] A key aspect of the model is the uniform treatment of unbounded and bounded streams, enabling a single API to handle both continuous, event-driven data (unbounded streams that arrive indefinitely) and finite datasets (bounded streams processed in batch fashion). Unbounded streams are processed incrementally as events occur, supporting real-time applications, whereas bounded streams are fully buffered before complete computation, akin to traditional batch processing. This unification simplifies development by avoiding separate codebases for streaming and batch workloads, with the same graph structure applying to both.[26] Operator chaining optimizes the dataflow by automatically pipelining compatible operators into single tasks, reducing serialization, deserialization, and network overhead. For instance, consecutive one-to-one operators like map followed by filter can be fused if they preserve partitioning and ordering, executing within the same thread to minimize latency. Developers can control chaining explicitly using methods likestartNewChain() to break chains where needed, such as before redistributing operations. This pipelining is enabled by default and forms the basis for efficient stream processing.
The runtime further enhances efficiency through graph-level optimizations, including rewrites that fuse operators and adjust the execution plan based on the topology. These transformations convert the logical StreamGraph (built from the user's program) into an optimized JobGraph for execution, incorporating fusions to eliminate intermediate buffers and streamline data movement. Such optimizations ensure low-latency processing while maintaining fault tolerance, without requiring user intervention.
State Management and Fault Tolerance
Flink's state management enables stateful stream processing by allowing operators to maintain and update local data structures across events, supporting both keyed and operator state primitives. Keyed state is partitioned by keys and accessed locally for scalability, while operator state is scoped to parallel operator instances. These mechanisms ensure that applications can perform computations that depend on historical data, such as aggregations or windowing, without losing information during processing.[27] Checkpointing provides the core fault tolerance in Flink through periodic, distributed snapshots of application state, coordinated via checkpoint barriers injected into data streams at sources. These barriers propagate downstream, triggering operators to snapshot their state asynchronously while continuing to process records, minimizing latency impact through a copy-on-write approach. This mechanism, inspired by the Chandy-Lamport algorithm, ensures consistent global snapshots without pausing the stream, enabling exactly-once processing semantics when combined with replayable sources and transactional sinks.[28][29] Savepoints extend checkpointing for operational flexibility, serving as manually triggered, portable snapshots that capture the entire job state for purposes beyond recovery, such as version upgrades, cluster rescaling, or migrations. Unlike automatic checkpoints, which are primarily for fault recovery and may be discarded after use, savepoints are retained indefinitely and stored in a configurable directory on durable storage like HDFS or S3, allowing jobs to resume from them even across different Flink versions or cluster configurations. Creation is initiated via the Flink CLI or REST API, and resumption supports partial state restoration if needed.[30] State backends define how and where state is stored locally and snapshotted during checkpoints, with options tailored to workload needs. The HashMapStateBackend keeps state in JVM heap objects for fast access but is memory-bound and uses full snapshots, making it suitable for smaller states. In contrast, the EmbeddedRocksDBStateBackend leverages RocksDB for disk-persistent storage, supporting incremental checkpoints to reduce overhead for large states (up to terabytes) and object reuse safety, though it incurs serialization and I/O costs. An experimental ForStStateBackend uses remote LSM-tree storage for cloud-native scenarios with massive state sizes.[31] Fault tolerance is achieved by storing checkpoints in durable locations, such as distributed file systems, allowing rapid recovery upon failure: the system restarts tasks from the latest checkpoint, replays input from the corresponding offset, and restores state without data duplication or loss. Asynchronous checkpointing ensures minimal processing disruption, with configurable parameters like interval (e.g., 1-5 seconds in production) and concurrency limiting resource use. In Flink 2.0, disaggregated state management introduces remote storage as the primary backend, enabling faster rescaling for terabyte-scale states, reduced local disk dependency, and optimized recovery via native file copying (e.g., 2x speedup on S3 with s5cmd), alongside adaptive scheduling to align checkpoints with rescaling operations.[29][32]Features
Processing Capabilities
Apache Flink excels in processing large-scale streaming data with sub-second latencies, enabling real-time applications to respond rapidly to incoming events.[21] It achieves high throughput, handling tens of millions of events per second in production environments, which supports demanding workloads like fraud detection and recommendation systems.[33] Flink's performance is bolstered by in-memory computations, where data is processed at memory speeds to minimize I/O overhead and maximize efficiency across distributed clusters.[1] A key strength of Flink lies in its event-time processing, which uses timestamps embedded in the data itself rather than the processing machine's clock, ensuring accurate results even with out-of-order or delayed arrivals.[34] Watermarks serve as a mechanism to track progress in event time; a watermark with timestamp indicates that no further events with timestamps are expected, allowing operators to advance and finalize computations.[34] For late data—events arriving after the watermark surpasses their timestamp—Flink provides configurable allowed lateness, where elements within this grace period are still incorporated into windows and can trigger updates, while excess lateness leads to dropping or redirection to side outputs.[35] Flink supports flexible windowing for aggregations over streams, including time-based windows (tumbling for non-overlapping fixed intervals or sliding for overlapping ones), count-based windows (triggered by element counts), and session windows (grouping active periods separated by inactivity gaps).[36] These windows enable computations like sums or averages, with incremental aggregation viaReduceFunction or AggregateFunction to update results as data arrives, avoiding full recomputation and reducing state overhead.[36]
Flink unifies batch and streaming processing, treating bounded datasets as finite streams to leverage the same runtime and APIs for both paradigms.[37] In batch mode, it optimizes for large-scale ETL pipelines and analytics on static data by enabling sequential scheduling, efficient joins, and materialization of intermediates, delivering exact results with lower resource demands compared to streaming mode on bounded inputs.[37] This approach simplifies development for scenarios like historical data analysis or data warehousing.[6]
