Recent from talks
All channels
Be the first to start a discussion here.
Be the first to start a discussion here.
Be the first to start a discussion here.
Be the first to start a discussion here.
Welcome to the community hub built to collect knowledge and have discussions related to List of column-oriented DBMSes.
Nothing was collected or created yet.
List of column-oriented DBMSes
View on Wikipediafrom Wikipedia
This article is a list of column-oriented database management system software.
| Database name | Language implemented in | Notes |
|---|---|---|
| Apache Doris | Java & C++ | Open source (since 2017), database for high-concurrency point queries and high-throughput analysis. |
| Apache Druid | Java | Started in 2011 for low-latency massive ingestion and queries. Support and extensions available from Imply Data. |
| Apache Kudu | C++ | Released in 2016 to complete the Apache Hadoop ecosystem |
| Apache Pinot | Java | Open sourced in 2015 for real-time low-latency analytics. Support and extensions available from StarTree. |
| Calpont InfiniDB | C++ | |
| ClickHouse | C++ | Released in 2016 to analyze data that is updated in real time |
| CrateDB | Java | |
| C-Store | C++ | The last release of the original code was in 2006; Vertica a commercial fork, lives on. |
| DuckDB | C++ | An embeddable, in-process, column-oriented SQL OLAP RDBMS |
| Databend | Rust | An elastic and reliable Serverless Data Warehouse |
| InfluxDB | Rust | Time series database |
| Greenplum Database | C | Support and extensions available from VMware. |
| HEAVY.AI | C++ | Formerly MapD |
| MariaDB ColumnStore | C & C++ | Formerly Calpont InfiniDB |
| Metakit | C++ | |
| MonetDB | C | Open-source (since 2004) columnar Relational DBMS pioneer |
| PostgreSQL cstore fdw,[1] vops[2] | C | cstore_fdw uses ORC format |
| StarRocks | Java & C++ | Open source, unified analytics platform for batch and real-time analytics. Supports and extensions available from CelerData. |
| VictoriaMetrics | Go | Time series database[3] |
Platform as a Service (PaaS)
[edit]- Amazon Redshift
- Microsoft Azure Synapse Analytics (formerly Azure SQL Data Warehouse)
- Google BigQuery
- Oracle Autonomous Data Warehouse Cloud (ADWC)
- Snowflake Computing
- MariaDB SkySQL
- Actian Avalanche
- Vertica Accelerator
- CelerData
Proprietary
[edit]- Actian Vector (formerly VectorWise)
- Actuate Corporation BIRT Analytics ColumnarDB
- Dimensional Insight
- Endeca
- EXASOL
- EXtremeDB
- Hydrolix
- IBM Db2
- Infobright
- KDB
- kdb+
- memSQL
- Microsoft SQL Server
- Oracle Database (in-memory option)[4]
- SAND CDBMS
- SAP HANA[5]
- SAP IQ
- SenSage
- SQream
- Teradata
- Vertica (developed from open source C-Store)
- Yellowbrick Data
References
[edit]- ^ "Columnar store for analytics with PostgreSQL".
- ^ "Vectorized Operations extension for PostgreSQL".
- ^ "Install and Use VictoriaMetrics time-series database on Ubuntu | ComputingForGeeks". 2023-04-18. Retrieved 2025-10-29.
- ^ "Oracle Database 12c: In-Memory Option" (PDF).
- ^ "Home | SAP HANA". hana.sap.com. Retrieved 2016-07-07.
List of column-oriented DBMSes
View on Grokipediafrom Grokipedia
Introduction
Definition and Characteristics
A column-oriented database management system (DBMS), also known as a columnar DBMS, is a relational database system that stores data tables by column rather than by row, optimizing for read-intensive operations on large datasets. This storage approach groups data by attributes, allowing queries to access only the relevant columns without retrieving entire rows, which significantly reduces I/O overhead and enables efficient compression. Unlike traditional row-oriented systems, columnar storage facilitates vectorized query processing, where operations are applied to entire columns as arrays, leveraging modern CPU instructions for faster execution.[2][5] Key characteristics of column-oriented DBMSes include their columnar storage format, which physically organizes data by column on disk, enabling sequential access patterns that align with analytical workloads. These systems commonly support advanced compression techniques, such as run-length encoding (RLE) for sequences of repeated values and dictionary encoding to map unique values to smaller codes, achieving compression ratios significantly higher than row-based methods, such as 10:1 compared to 3:1. They are particularly suited for online analytical processing (OLAP) workloads, where aggregations and scans over specific attributes dominate, and integrate well with columnar file formats like Apache Parquet and ORC, which further enhance data portability and query performance in distributed environments.[5][2][6][7] In contrast to row-oriented DBMSes, which store entire rows contiguously to optimize transactional updates and full-row retrievals in online transaction processing (OLTP) scenarios, column-oriented systems prioritize read efficiency by scanning only necessary columns for queries like sums or averages, avoiding the transfer of irrelevant data. Row-oriented systems excel in random access and write-heavy operations but incur higher costs for column-wise analytics due to fragmented access patterns. Storage mechanics in columnar DBMSes involve grouping columns into projections or extents on disk, often with sorting and indexing to support rapid reconstruction of rows when needed, and exploitation of single instruction, multiple data (SIMD) instructions for parallel processing of column vectors.[2][5]Benefits and Use Cases
Column-oriented DBMSes offer significant advantages in query performance for analytical workloads, particularly those involving aggregations such as SUM and AVG on large datasets, by accessing only the relevant columns and minimizing I/O overhead.[2] This approach can yield substantial speedups; for instance, in TPC-H benchmarks, a column-oriented system demonstrated 164 times faster performance compared to row-oriented systems due to reduced data scanning and efficient processing of compressed columns.[2] Additionally, columnar storage enables higher storage efficiency through compression techniques that exploit data similarity within columns, often achieving up to 10:1 ratios, which reduces disk space requirements by 40-70% relative to row-stores while supporting vectorized query execution for better CPU utilization.[5] Scalability is enhanced in distributed environments, where horizontal partitioning across nodes allows handling of massive datasets in grid-like setups, tolerating failures through redundancy mechanisms.[2] These systems excel in use cases centered on read-heavy, analytical processing, such as data warehousing for business intelligence reporting, where complex queries aggregate historical data to inform decisions.[2] They are particularly suited to real-time analytics in domains like telecommunications and scientific computing, including processing large-scale observational data from surveys.[8] Time-series analysis in IoT for sensor monitoring and in finance for market data tracking benefits from their ability to efficiently query temporal patterns without loading entire rows.[5] In contrast, column-oriented DBMSes are less ideal for online transaction processing (OLTP) workloads, where frequent inserts and updates predominate, as row-oriented systems better support transactional integrity and point queries.[5] Column-oriented DBMSes emerged in the late 1990s and early 2000s to address the limitations of row-oriented systems in data warehousing and online analytical processing (OLAP), with pioneering efforts like those at CWI in the Netherlands driving innovations in main-memory columnar architectures around 2004.[8] However, they involve trade-offs, including slower performance for frequent inserts and updates due to the overhead of reorganizing columnar structures and the need for batch processing to maintain efficiency.[5] These systems often employ hybrid strategies, such as write-optimized storage for updates and read-optimized columnar views, to balance analytical speed with occasional modifications.[2]Open-Source Column-Oriented DBMSes
General-Purpose Systems
General-purpose open-source column-oriented DBMSes are designed for versatile analytical workloads, including data warehousing and online analytical processing (OLAP), supporting high-performance queries on large datasets through columnar storage and parallel execution. These systems emphasize SQL compatibility, scalability, and integration with big data ecosystems, enabling efficient handling of complex aggregations and joins without domain-specific optimizations. Apache Doris, initially released in 2017 and implemented primarily in Java and C++, is a massively parallel processing (MPP) real-time data warehouse optimized for high-concurrency queries and sub-second analytical responses on large datasets within the Hadoop ecosystem.[9] It supports real-time data ingestion from streams like Kafka and batch sources, leveraging vectorized execution for accelerated performance in OLAP scenarios.[10] ClickHouse, first released in 2016 and developed in C++, serves as a high-performance OLAP database management system for real-time data analysis using standard SQL queries.[11] Its columnar architecture enables rapid generation of analytical reports on massive volumes of data, with features like distributed query processing and compression techniques that achieve high throughput for aggregations and filtering.[12] DuckDB, released in 2019 and built in C++, functions as an embeddable OLAP relational database management system tailored for executing complex analytical queries directly on local data files without requiring a separate server process.[13] It offers in-process execution with full SQL support, including window functions and joins, making it suitable for data science workflows and ad-hoc analysis on datasets up to terabyte scale.[14] DuckDB supports interoperability with PostgreSQL via its postgres extension, allowing direct querying of PostgreSQL databases. For example, the extension can be installed and loaded with the following commands:INSTALL postgres;
LOAD postgres;
INSTALL postgres;
LOAD postgres;
ATTACH 'dbname=postgres user=postgres host=127.0.0.1' AS db (TYPE postgres, READ_ONLY);
ATTACH 'dbname=postgres user=postgres host=127.0.0.1' AS db (TYPE postgres, READ_ONLY);
SELECT * FROM db.uuids;
SELECT * FROM db.uuids;
Specialized Systems
Specialized open-source column-oriented DBMSes are designed for niche applications, such as time-series data management or integration with big data ecosystems, offering optimizations like high-ingestion rates and domain-specific querying that distinguish them from broader OLAP systems. These systems leverage columnar storage to enhance compression and query performance on structured event or temporal data, enabling efficient handling of high-velocity inputs in areas like monitoring and analytics. Apache Druid, initiated in 2011 and implemented in Java, is a distributed database optimized for real-time ingestion and sub-second queries on event-oriented data, such as web analytics or IoT streams, with support from the company Imply for enterprise deployments.[26][27] Its architecture supports low-latency operations through segment-based storage and indexing, achieving ingestion rates exceeding 1 million events per second on commodity hardware.[28] InfluxDB, launched in 2013 and developed primarily in Go with recent components in Rust, serves as a time-series database tailored for metrics, monitoring, and observability workloads, featuring built-in downsampling and retention policies.[29][30] The system employs a columnar engine based on Apache Arrow for efficient compression and querying, supporting over 100,000 writes per second in production environments.[31] VictoriaMetrics, started in 2018 and written in Go, provides a scalable time-series database compatible with Prometheus, emphasizing cost-effective storage and high query throughput for monitoring large-scale infrastructures.[32] It uses a custom columnar format for data blocks, enabling up to 10x better compression than alternatives and handling billions of active series with minimal resource usage.[33] QuestDB, released in 2018 and built in Java, is a relational time-series database that supports standard SQL with extensions for high-speed ingestion and analytics, ideal for financial and industrial applications requiring sub-millisecond query times.[34] Its columnar design facilitates vectorized processing, supporting ingestion rates of over 1 million rows per second via protocols like InfluxDB Line Protocol.[35] Apache Kudu, introduced in 2016 and implemented in C++, acts as a storage engine complementary to Hadoop ecosystems like Apache Impala and Spark, enabling fast analytics on mutable datasets with ACID transactions.[36] The columnar storage allows efficient scans and updates, with performance benchmarks showing up to 10x faster inserts compared to HBase for analytical workloads.[37] Databend, founded in 2021 and developed in Rust, functions as a serverless, cloud-native data warehouse supporting SQL analytics on semi-structured data, with features like zero-copy sharing and elastic scaling. Its columnar format, inspired by Apache Parquet, optimizes for vectorized execution, achieving query speeds competitive with proprietary systems on petabyte-scale datasets.[38] OpenTSDB, created in 2010 and coded in Java, is a distributed time-series database layered on Apache HBase, designed for storing and querying billions of metrics from large-scale monitoring systems without losing temporal granularity.[39] Leveraging HBase's column-family model, it supports horizontal scaling to handle over 1 billion data points daily, with aggregation functions for downsampling historical data.[40]Proprietary Column-Oriented DBMSes
On-Premises Systems
On-premises proprietary column-oriented DBMSes are designed for deployment on customer-controlled infrastructure, offering enterprises full administrative control over hardware, scaling, and security for high-performance analytics workloads. These systems typically leverage columnar storage to optimize query performance on large datasets, supporting data warehousing, real-time analytics, and complex reporting without relying on cloud-managed services. Key examples include established platforms from major vendors, emphasizing in-memory processing and parallel execution for enterprise-scale operations. Vertica, founded in 2005 and implemented in C++, is an analytics platform specializing in high-velocity data processing and data warehousing on shared-nothing architectures.[41] It supports massive parallel processing for petabyte-scale analytics, enabling fast ad-hoc queries and integration with machine learning workflows, and has been owned by OpenText since 2023 (previously under Hewlett Packard Enterprise and Micro Focus).[42] SAP HANA, launched in 2010 and developed in C++, functions as an in-memory columnar database that unifies real-time analytics and transactional processing.[43] It accelerates data-intensive applications through multi-core processing and supports hybrid transactional/analytical workloads, provided by SAP SE. Exasol, commercially released in 2008 and built in C++, is an in-memory analytics database optimized for executing complex queries at high speed on large volumes of data. It employs massively parallel processing to deliver sub-second response times for business intelligence tasks, developed by Exasol AG. kdb+, introduced in 2003 using the Q programming language, excels in high-frequency trading and time-series analytics with its column-oriented, in-memory design for handling tick data and real-time streams. It provides vectorized operations for ultra-low latency queries on billions of records, offered by KX Systems. SAP IQ, originating in the 1990s and implemented in C, serves as a columnar engine primarily for data warehousing and large-scale reporting.[44] It focuses on compression and index optimization to manage petabyte datasets efficiently for analytical queries, maintained by SAP SE. SingleStore, established in 2011 and coded in C++, offers a unified SQL interface for both transactional and analytical processing in a distributed, row- and column-hybrid store. Formerly known as MemSQL, it supports real-time ingestion and vector search for AI-driven applications, provided by SingleStore Inc.[45] Actian Vector, released in 2010 and developed in C++, is a high-performance analytics database leveraging vectorized execution for rapid processing of complex queries on voluminous data. It incorporates SIMD instructions for hardware-accelerated performance in business intelligence scenarios, distributed by Actian Corporation (formerly VectorWise).[46]Cloud-Based PaaS Systems
Cloud-based Platform as a Service (PaaS) systems for column-oriented DBMSes provide managed, scalable analytics solutions that eliminate infrastructure management, enabling organizations to focus on data insights through elastic compute resources and pay-as-you-go pricing. These services leverage columnar storage to optimize query performance on large datasets, supporting petabyte-scale operations with automatic scaling and integration into major cloud ecosystems.- Amazon Redshift (2012): Launched on Amazon Web Services (AWS), Redshift is a fully managed petabyte-scale data warehouse that uses columnar storage to accelerate analytical queries on structured data, with features like concurrency scaling for handling variable workloads without downtime.[47][48]
- Google BigQuery (2010): Offered as a serverless, fully managed analytics platform on Google Cloud, BigQuery employs columnar storage for real-time querying of massive datasets, supporting multicloud data processing and automatic resource allocation to minimize operational overhead.[49]
- Snowflake (2012): This cloud data platform, available across AWS, Azure, and Google Cloud, separates storage and compute for independent scaling, utilizing columnar storage to enable efficient analytics and data sharing while providing zero-copy cloning for rapid deployment.[50][51]
- Microsoft Azure Synapse Analytics (2019, formerly SQL Data Warehouse): An integrated analytics service on Microsoft Azure, it combines enterprise data warehousing with big data processing using columnar storage in its dedicated SQL pools, offering serverless options for on-demand scaling and seamless integration with Azure Machine Learning.[52][53]
- Oracle Autonomous Data Warehouse (2018): Deployed on Oracle Cloud Infrastructure, this self-driving data warehouse automates tuning, security, and backups with columnar storage optimized for analytics, allowing elastic scaling and pay-per-use pricing to reduce administrative effort.[54]
- Rockset (2016): A real-time search and analytics database on cloud platforms including AWS and Google Cloud, Rockset uses columnar indexing for sub-second queries on semi-structured data, with convergent indexing to support dynamic schemas; it was acquired by OpenAI in 2024 to enhance AI-driven data retrieval.[55]
