Recent from talks
Contribute something
Nothing was collected or created yet.
Presto (SQL query engine)
View on Wikipedia| Presto | |
|---|---|
| Original authors | Martin Traverso, Dain Sundstrom, David Phillips, Eric Hwang |
| Initial release | 10 November 2013 |
| Written in | Java |
| Operating system | Cross-platform |
| Standard | SQL |
| Type | Data warehouse |
| License | Apache License 2.0 |
| Website | |

Presto (including PrestoDB, and PrestoSQL which was re-branded to Trino) is a distributed query engine for big data using the SQL query language. Its architecture allows users to query data sources such as Hadoop, Cassandra, Kafka, AWS S3, Alluxio, MySQL, MongoDB and Teradata,[1] and allows use of multiple data sources within a query. Presto is community-driven open-source software released under the Apache License.
History
[edit]Presto was originally designed and developed at Facebook, Inc. (later renamed Meta) for their data analysts to run interactive queries on its large data warehouse in Apache Hadoop. The first four developers were Martin Traverso, Dain Sundstrom, David Phillips, and Eric Hwang. Before Presto, the data analysts at Facebook relied on Apache Hive for running SQL analytics on their multi-petabyte data warehouse.[2] Hive was deemed too slow for Facebook's scale and Presto was invented to fill the gap to run fast queries.[3] Original development started in 2012 and deployed at Facebook later that year. In November 2013, Facebook announced its open source release.[3][4]
In 2014, Netflix disclosed they used Presto on 10 petabytes of data stored in the Amazon Simple Storage Service (S3).[5] In November, 2016, Amazon announced a service called Athena that was based on Presto.[6] In 2017, Teradata spun out a company called Starburst Data to commercially support Presto, which included staff acquired from Hadapt in 2014.[7] Teradata's QueryGrid software allowed Presto to access a Teradata relational database.[8]
In January 2019, the Presto Software Foundation was announced. The foundation is a not-for-profit organization for the advancement of the Presto open source distributed SQL query engine.[9][10] At the same time, Presto development forked: PrestoDB maintained by Facebook, and PrestoSQL maintained by the Presto Software Foundation, with some cross pollination of code.
In September 2019, Facebook donated PrestoDB to the Linux Foundation, establishing the Presto Foundation.[11] Neither the creators of Presto, nor the top contributors and committers, were invited to join this foundation.[12]
By 2020, all four of the original Presto developers had joined Starburst.[13] In December 2020, PrestoSQL was rebranded as Trino, since Facebook had obtained a trademark on the name "Presto" (also donated to the Linux Foundation).[14]
Another company called Ahana was announced in 2020 to commercialize the PrestoDB fork as a cloud service and was acquired by IBM in 2023.[15]
Architecture
[edit]Presto's architecture is very similar to other database management systems using cluster computing, sometimes called massively parallel processing (MPP). One coordinator works in sync with multiple workers. Clients submit SQL statements that are parsed and planned, following which parallel tasks are scheduled to workers. Workers jointly process rows from the data sources and produce results that are returned to the client. Compared to the original Apache Hive execution model which used the Hadoop MapReduce mechanism on each query, Presto does not write intermediate results to disk, resulting in a significant speed improvement. Presto is written in Java.
A Presto query can combine data from multiple sources. Presto offers connectors to data sources including files in Alluxio, Hadoop Distributed File System (often called a data lake), Amazon S3, MySQL, PostgreSQL, Microsoft SQL Server, Amazon Redshift, Apache Kudu, Apache Phoenix, Apache Kafka, Apache Cassandra, Apache Accumulo, MongoDB and Redis. Unlike other Hadoop distribution-specific tools, such as Apache Impala, Presto can work with any variant of Hadoop or without it. Presto supports separation of compute and storage and may be deployed on-premises or using cloud computing.
See also
[edit]References
[edit]- ^ 1.1. Teradata Distribution of Presto — Teradata Distribution of Presto 0.167-t.0.2 Documentation
- ^ Mike Volpi (November 20, 2019). "Starburst and Presto: with Stellar Velocity". Index Ventures Blog. Retrieved January 27, 2022.
- ^ a b Joab Jackson (November 6, 2013). "Facebook goes open source with query engine for big data". Computer World. Retrieved April 26, 2017.
- ^ Jordan Novet (June 6, 2013). "Facebook unveils Presto engine for querying 250 PB data warehouse". Giga Om. Archived from the original on June 8, 2013. Retrieved April 26, 2017.
- ^ Eva Tse; Zhenxiao Luo; Nezih Yigitbasi (October 7, 2014). "Using Presto in our Big Data Platform on AWS". Netflix technical blog. Retrieved April 26, 2017.
- ^ Jeff Barr (November 30, 2016). "Amazon Athena – Interactive SQL Queries for Data in Amazon S3". AWS News Blog. Retrieved January 27, 2022.
- ^ Philip Howard (December 21, 2017). "Teradata spins off Starburst". Bloor. Retrieved January 26, 2022.
- ^ Lindsay Clark (December 17, 2020). "Hey Presto! Teradata admits its vision is dead by hooking QueryGrid analytics platform up to rival data warehouses". The Register. Retrieved January 26, 2022.
- ^ "Presto Software Foundation Launches to Advance Presto Open Source Community". Press release. January 31, 2019. Retrieved January 2, 2022.
- ^ "Presto's New Foundation Signals Growth for the Big Data SQL Engine". The New Stack. 2019-01-31. Retrieved 2019-02-01.
- ^ "Facebook, Uber, Twitter and Alibaba form Presto Foundation to Tackle Distributed Data Processing at Scale". 23 September 2019. Retrieved 2019-11-12.
- ^ Piotr Findeisen (November 22, 2019). "What's the relationship between prestosql and prestodb?". Comment on issue #38 of Trino Github. Retrieved January 27, 2022.
- ^ "Original Presto Co-Creators Reunite on the Starburst Technical Leadership Team". Press release. September 22, 2020. Retrieved January 26, 2022.
- ^ Martin Traverso, Dain Sundstrom, David Phillips (December 27, 2020). "We're rebranding PrestoSQL as Trino". Trino blog. Retrieved January 26, 2022.
{{cite web}}: CS1 maint: multiple names: authors list (link) - ^ Gillin, Paul (14 April 2023). "IBM acquires Ahana, joins the Presto Foundation". SiliconANGLE. Retrieved 20 April 2023.
External links
[edit]Presto (SQL query engine)
View on GrokipediaOverview
Definition and Purpose
Presto is an open-source, distributed SQL query engine designed for interactive ad-hoc analytics on big data.[6] It enables users to execute standard SQL queries across heterogeneous data sources, such as Hadoop, Cassandra, and relational databases, without requiring data movement or preprocessing.[6] Developed initially at Facebook in 2012, Presto addressed the need for rapid querying of the company's vast data warehouse, allowing analysts to derive insights in seconds rather than hours.[6] [7] The core purpose of Presto is to facilitate fast analytic queries on petabyte-scale datasets, emphasizing accessibility for data analysts through a familiar SQL interface.[8] By federating queries across multiple storage systems in a single cluster, it eliminates the need for extract, transform, and load (ETL) processes, reducing complexity and enabling real-time decision-making.[6] At organizations like Meta, Presto processes hundreds of petabytes daily, supporting diverse workloads from sub-second reporting to longer-running jobs.[6] [7] In its basic workflow, users submit SQL queries that are parsed, optimized, and executed in parallel across a distributed architecture, leveraging in-memory processing for high performance.[6] This design prioritizes scalability and extensibility, allowing seamless integration with various data connectors while maintaining ANSI SQL compliance.[8]Key Distributions
Presto, originally developed at Facebook, has evolved into two primary distributions following a project split, each maintaining distinct focuses within the open-source ecosystem. PrestoDB is maintained by the Presto Foundation, which operates under the Linux Foundation umbrella to ensure neutral governance and community collaboration.[1][9] This distribution emphasizes core engine stability and seamless enterprise integrations, such as its use in Amazon Web Services' Athena for serverless querying of data lakes.[10] As of 2025, PrestoDB remains the choice for environments prioritizing reliability in production-scale deployments tied to its foundational contributions from early adopters like Facebook and Uber.[11] In contrast, Trino—formerly known as PrestoSQL—was forked in 2018 to accelerate innovation beyond the original project's pace and rebranded in 2020 under the Trino Software Foundation for independent governance.[12][13] This variant prioritizes community-driven enhancements, broader support for diverse data connectors, and rapid iteration to address evolving analytics needs.[14] Trino's governance model fosters a more decentralized, volunteer-led structure, distancing it from the original Facebook-influenced direction of PrestoDB.[13] Key governance differences highlight their divergent paths: PrestoDB retains ties to its origins through contributions from founding companies like Facebook, focusing on conservative stability, while Trino exhibits higher open-source activity with more frequent releases—often quarterly or faster—to incorporate new features and optimizations.[15] As of 2025, adoption trends show PrestoDB prevalent in proprietary, managed services like AWS Athena for cost-effective, integrated querying, whereas Trino dominates open ecosystems, powering platforms such as Starburst for federated data access in hybrid environments.[16][17]History
Origins and Development
Presto was developed in the fall of 2012 by a small team of engineers in Facebook's Data Infrastructure group, including Martin Traverso, Dain Sundstrom, David Phillips, Eric Hwang, Nileema Shingte, and Ravi Murthy, to enable interactive SQL queries on the company's vast data warehouse.[18][12] The project addressed key limitations in the existing Hadoop ecosystem, where tools like MapReduce and Hive were designed for high-throughput batch processing rather than low-latency ad-hoc analysis, often leaving data analysts waiting hours for query results on terabyte- and petabyte-scale datasets.[18] The initial motivations stemmed from the need to boost productivity for Facebook's data scientists, analysts, and engineers by supporting complex, interactive queries across diverse storage systems without the inefficiencies of disk-based MapReduce jobs. Early evaluations of external query engines revealed shortcomings in flexibility and scalability for Facebook's environment, prompting the team to build a custom solution. The prototypes were implemented in Java, emphasizing an in-memory, pipelined execution model to minimize latency and avoid intermediate disk spills, while incorporating extensible connectors for sources like HDFS (via Hive) and Cassandra.[18] Presto's first internal deployment occurred in early 2013, initially supporting queries across HDFS, Hive, and Cassandra to handle Facebook's petabyte-scale data warehouse. By spring 2013, it had scaled to over 1,000 nodes and was fully rolled out company-wide, marking a significant shift toward interactive analytics. This internal success led to its open-sourcing later that year.[18]Open-Sourcing and Forks
Presto was initially developed internally at Facebook and open-sourced in 2013 under the Apache License 2.0, with the original GitHub repository hosted at github.com/facebook/presto, featuring contributions primarily from Facebook engineers such as Dain Sundstrom and Martin Traverso.[19] The project quickly gained traction within the open-source community, leading to widespread adoption by major organizations; by 2015, companies like Netflix and Uber had integrated Presto into their data analytics pipelines, with Netflix deploying it in production as early as 2014 to query petabyte-scale data across diverse sources. This growth culminated in the formation of the Presto Foundation under the Linux Foundation in September 2019, established by founding members including Facebook, Uber, Twitter, and Alibaba to provide neutral governance, foster community contributions, and ensure the project's long-term sustainability.[4] In March 2018, tensions arose within the community over the project's direction, particularly concerns about increasing commercialization efforts, including the involvement of startups like Ahana aiming to build enterprise offerings around Presto, which some contributors felt risked prioritizing proprietary features over open development.[14] This led to a fork by key maintainers, including Dain Sundstrom, Martin Traverso, and David Phillips, who created PrestoSQL to maintain a focus on rapid innovation and community-driven enhancements without commercial constraints. The original project was subsequently renamed PrestoDB to distinguish the variants, with PrestoSQL continuing active development until it was rebranded as Trino in December 2020 due to trademark conflicts, as Facebook had registered "Presto" and donated it to the Presto Foundation, prompting the fork's maintainers to seek a new identity to avoid legal issues and affirm their independent path.[12][16] As of November 2025, PrestoDB has reached version 0.295, released on October 1, 2025, emphasizing stability through incremental improvements in query reliability and connector compatibility while maintaining compatibility with existing deployments.[20] In parallel, Trino has advanced to version 478, also released on October 29, 2025, incorporating enhanced fault tolerance features such as improved task retry mechanisms and adaptive query recovery to better handle failures in large-scale distributed environments.[21] These developments reflect the divergent yet complementary evolutions of the two projects, with PrestoDB prioritizing enterprise stability under the Presto Foundation and Trino focusing on cutting-edge scalability through its own Trino Software Foundation (established in 2019 as the Presto Software Foundation and renamed in 2020) to support ongoing community governance.[22][12][23]Technical Features
SQL Standards and Extensions
Presto adheres to ANSI SQL standards, supporting core constructs such as SELECT statements, JOIN operations, GROUP BY clauses, subqueries, and window functions, which enable complex analytical queries across distributed data sources.[24] This compliance facilitates seamless integration with standard SQL tools and clients, including business intelligence platforms like Tableau and Power BI.[5] While PrestoDB and Trino (the primary continuation of PrestoSQL) both maintain this foundational support, their implementations ensure compatibility with SQL:2011 features where applicable.[25] Presto extends standard SQL with specialized functions optimized for big data environments, including approximate aggregates likeapprox_distinct, which estimates the number of unique values using HyperLogLog sketches for efficient processing of large datasets.[26] Geospatial capabilities are provided through ST_ prefixed functions, such as ST_Area and ST_Buffer, compliant with the Open Geospatial Consortium (OGC) Simple Features specification for spatial analysis. Additionally, built-in JSON operators like json_extract and json_value allow querying and manipulating semi-structured data without external preprocessing.[27]
As a distributed query engine, Presto emphasizes read-only operations, lacking native DDL or DML statements for data modification to avoid interference with underlying storage systems; instead, it focuses on federated querying across heterogeneous sources.[28] Parameterized queries are supported through client drivers, enhancing security by preventing SQL injection in interactive and ad-hoc workloads.[29]
Dialect variations exist between PrestoDB and Trino, with Trino introducing advanced extensions such as machine learning functions (e.g., learn_classifier for training SVM models within SQL), which expand analytical capabilities beyond PrestoDB's core offerings.[30] These differences are minor and generally backward-compatible, allowing most queries to execute across both distributions with minimal adjustments.[14]
Performance and Scalability
Presto achieves high performance through its in-memory, pipelined query execution model, which processes data in columnar format without intermediate disk writes, enabling sub-second query times on terabyte-scale datasets.[31] This vectorized approach leverages dynamic code generation and streaming from data sources, minimizing latency for interactive analytics workloads.[3] For scalability, Presto supports horizontal scaling by adding worker nodes to the cluster, allowing it to handle massive workloads across distributed environments. At Facebook (now Meta), Presto processes hundreds of petabytes of data and quadrillions of rows daily across thousands of nodes in multiple data centers.[31] This design ensures fault tolerance and elastic resource allocation, supporting both low-latency ad-hoc queries and long-running batch jobs without disrupting ongoing operations.[28] Key optimizations in Presto include predicate and projection pushdown to data sources, which reduces data transfer by filtering and selecting only necessary columns at the connector level. The cost-based optimizer uses table statistics to evaluate join orders and distribution types, automatically selecting strategies like broadcast or partitioned joins to minimize CPU and network costs.[32] Additionally, history-based query optimization refines estimates for complex queries by learning from past executions, improving accuracy over traditional rule-based methods.[33] Benchmarks demonstrate Presto's efficiency for ad-hoc workloads. Recent Presto C++ implementations further boost TPC-DS 100TB performance, outperforming alternatives like Databricks Photon in price-performance ratio.[34]Architecture
Core Components
Presto operates as a distributed SQL query engine, relying on a cluster of nodes to handle query processing across large datasets. The core components form a master-worker architecture, where one or more coordinators oversee operations and multiple workers perform the actual computation, enabling scalability and fault tolerance.[35][6] The coordinator node serves as the central point of control in the Presto cluster, responsible for parsing incoming SQL statements, generating optimized query plans, and scheduling tasks across worker nodes. It manages metadata, coordinates worker assignments, and acts as the interface for client connections, using a REST API for communication with workers. In typical deployments, the coordinator runs on dedicated hardware to handle these orchestration duties without participating in data processing, though in single-node setups it can double as a worker. Every Presto cluster requires at least one coordinator to function. For larger deployments with multiple coordinators, a resource manager aggregates data from all coordinators and workers to provide a global view of the cluster, using a Thrift API for communication and supporting coordinated resource allocation.[35][36][37][6] Worker nodes execute the distributed tasks assigned by the coordinator, processing data in parallel to support high-throughput queries. Each worker fetches data from underlying sources via connectors, performs computations such as filtering, aggregation, and joins, and exchanges intermediate results with other workers as needed. As of 2025, Presto also supports a Native Worker, implemented in C++ as a drop-in replacement for the traditional Java-based worker, to reduce CPU and memory footprint while maintaining compatibility through integration with the Velox library and supporting key connectors like Hive and Iceberg. Workers register themselves with the discovery service upon startup and communicate via REST API, allowing the cluster to scale by adding more workers to handle increased load. In production environments, clusters can comprise hundreds to thousands of workers for petabyte-scale analytics.[35][36][38][6] The discovery service facilitates dynamic node management by allowing workers to advertise their availability to the coordinator, enabling automatic cluster scaling and fault recovery. Presto includes an embedded discovery server within the coordinator, activated via thediscovery-server.enabled=true property, where nodes register upon launch. Alternative configurations use the discovery.uri property to specify the URI of the discovery service, typically pointing to the coordinator's HTTP endpoint, for setups without an embedded server. The embedded option is standard for most PrestoDB clusters.[35][36]
Configuration elements are essential for tuning Presto's behavior and integrating data sources, managed through property files in the installation directory. JVM settings, defined in etc/jvm.config, control memory management and garbage collection to optimize performance; for example, properties like -Xmx16G set maximum heap size, while -XX:+UseG1GC enables the G1 garbage collector to handle large heaps efficiently. Catalog files, located in etc/catalog/, define data sources with properties such as connector.name=hive-hadoop2 to specify the connector type and hive.metastore.uri for metadata access, allowing Presto to interface with diverse storage systems without code changes. These configurations ensure reliable operation and adaptability in distributed environments.[36][39]
