List of column-oriented DBMSes

List of column-oriented DBMSesMain

Community hub

7 pages, 0 posts

0 subscribers

Recent from talks

Be the first to start a discussion here.

Recent from talks

Be the first to start a discussion here.

Contribute something

About hubMembersContent overviewUpdatesRules

Main reference articles

List of column-oriented DBMSes

View on Wikipedia

from Wikipedia

This article is a list of column-oriented database management system software.

Free and open-source software (FOSS)

[edit]


Database name	Language implemented in	Notes
Apache Doris	Java & C++	Open source (since 2017), database for high-concurrency point queries and high-throughput analysis.
Apache Druid	Java	Started in 2011 for low-latency massive ingestion and queries. Support and extensions available from Imply Data.
Apache Kudu	C++	Released in 2016 to complete the Apache Hadoop ecosystem
Apache Pinot	Java	Open sourced in 2015 for real-time low-latency analytics. Support and extensions available from StarTree.
Calpont InfiniDB	C++
ClickHouse	C++	Released in 2016 to analyze data that is updated in real time
CrateDB	Java
C-Store	C++	The last release of the original code was in 2006; Vertica a commercial fork, lives on.
DuckDB	C++	An embeddable, in-process, column-oriented SQL OLAP RDBMS
Databend	Rust	An elastic and reliable Serverless Data Warehouse
InfluxDB	Rust	Time series database
Greenplum Database	C	Support and extensions available from VMware.
HEAVY.AI	C++	Formerly MapD
MariaDB ColumnStore	C & C++	Formerly Calpont InfiniDB
Metakit	C++
MonetDB	C	Open-source (since 2004) columnar Relational DBMS pioneer
PostgreSQL cstore fdw,^[1] vops^[2]	C	cstore_fdw uses ORC format
StarRocks	Java & C++	Open source, unified analytics platform for batch and real-time analytics. Supports and extensions available from CelerData.
VictoriaMetrics	Go	Time series database^[3]

Platform as a Service (PaaS)

[edit]

Amazon Redshift
Microsoft Azure Synapse Analytics (formerly Azure SQL Data Warehouse)
Google BigQuery
Oracle Autonomous Data Warehouse Cloud (ADWC)
Snowflake Computing
MariaDB SkySQL
Actian Avalanche
Vertica Accelerator
CelerData

Proprietary

[edit]

Actian Vector (formerly VectorWise)
Actuate Corporation BIRT Analytics ColumnarDB
Dimensional Insight
Endeca
EXASOL
EXtremeDB
Hydrolix
IBM Db2
Infobright
KDB
kdb+
memSQL
Microsoft SQL Server
Oracle Database (in-memory option)^[4]
SAND CDBMS
SAP HANA^[5]
SAP IQ
SenSage
SQream
Teradata
Vertica (developed from open source C-Store)
Yellowbrick Data

References

[edit]

^ "Columnar store for analytics with PostgreSQL".
^ "Vectorized Operations extension for PostgreSQL".
^ "Install and Use VictoriaMetrics time-series database on Ubuntu | ComputingForGeeks". 2023-04-18. Retrieved 2025-10-29.
^ "Oracle Database 12c: In-Memory Option" (PDF).
^ "Home | SAP HANA". hana.sap.com. Retrieved 2016-07-07.

Revisions and contributors Edit on Wikipedia Read on Wikipedia

List of column-oriented DBMSes

View on Grokipedia

from Grokipedia

A column-oriented database management system (DBMS), also known as a columnar database, is a type of DBMS that stores relational data tables by column rather than by row, contrasting with traditional row-oriented systems that group entire records together.^[1] This storage model enables more efficient querying for analytical workloads, as it allows the system to read only the required columns, reducing I/O overhead and leveraging column-specific compression techniques.^[2] Column-oriented DBMSes are particularly suited for online analytical processing (OLAP) tasks, such as data aggregation, reporting, and business intelligence, where scans of large datasets focus on specific attributes rather than full rows.^[3] The origins of column-oriented storage trace back to the 1970s, with early implementations using transposed files and vertical partitioning to optimize for columnar access in analytical environments.^[4] By the mid-1980s, research highlighted the advantages of the fully decomposed storage model (DSM), which separates attributes into independent column files, over the native storage model (NSM) used in row-oriented systems.^[4] Interest waned temporarily due to the dominance of transactional (OLTP) workloads, but the concept resurged in the 2000s amid growing demands for data warehousing and the widening performance gap between CPUs and disk I/O, prompting innovations in compression and query optimization.^[3] Foundational academic projects like C-Store (2005) and MonetDB demonstrated dramatic speedups—up to 100 times—for read-heavy queries, influencing commercial developments.^[2] Key advantages of column-oriented DBMSes include superior cache locality, as related values in a column are stored contiguously, minimizing page faults during scans; high compression ratios, often exceeding 10:1 for low-entropy columns like dates or categories; and simplified vectorized query execution, which processes data in batches for better SIMD utilization.^[4] These features make them ideal for big data analytics, scientific computing, and modern cloud environments, though they may underperform for frequent updates or inserts compared to row-oriented systems.^[3] This list catalogs notable column-oriented DBMSes, including early commercial, academic, open-source, and cloud-based systems.

Introduction

Definition and Characteristics

A column-oriented database management system (DBMS), also known as a columnar DBMS, is a relational database system that stores data tables by column rather than by row, optimizing for read-intensive operations on large datasets. This storage approach groups data by attributes, allowing queries to access only the relevant columns without retrieving entire rows, which significantly reduces I/O overhead and enables efficient compression. Unlike traditional row-oriented systems, columnar storage facilitates vectorized query processing, where operations are applied to entire columns as arrays, leveraging modern CPU instructions for faster execution.^[2]^[5] Key characteristics of column-oriented DBMSes include their columnar storage format, which physically organizes data by column on disk, enabling sequential access patterns that align with analytical workloads. These systems commonly support advanced compression techniques, such as run-length encoding (RLE) for sequences of repeated values and dictionary encoding to map unique values to smaller codes, achieving compression ratios significantly higher than row-based methods, such as 10:1 compared to 3:1. They are particularly suited for online analytical processing (OLAP) workloads, where aggregations and scans over specific attributes dominate, and integrate well with columnar file formats like Apache Parquet and ORC, which further enhance data portability and query performance in distributed environments.^[5]^[2]^[6]^[7] In contrast to row-oriented DBMSes, which store entire rows contiguously to optimize transactional updates and full-row retrievals in online transaction processing (OLTP) scenarios, column-oriented systems prioritize read efficiency by scanning only necessary columns for queries like sums or averages, avoiding the transfer of irrelevant data. Row-oriented systems excel in random access and write-heavy operations but incur higher costs for column-wise analytics due to fragmented access patterns. Storage mechanics in columnar DBMSes involve grouping columns into projections or extents on disk, often with sorting and indexing to support rapid reconstruction of rows when needed, and exploitation of single instruction, multiple data (SIMD) instructions for parallel processing of column vectors.^[2]^[5]

Benefits and Use Cases

Column-oriented DBMSes offer significant advantages in query performance for analytical workloads, particularly those involving aggregations such as SUM and AVG on large datasets, by accessing only the relevant columns and minimizing I/O overhead.^[2] This approach can yield substantial speedups; for instance, in TPC-H benchmarks, a column-oriented system demonstrated 164 times faster performance compared to row-oriented systems due to reduced data scanning and efficient processing of compressed columns.^[2] Additionally, columnar storage enables higher storage efficiency through compression techniques that exploit data similarity within columns, often achieving up to 10:1 ratios, which reduces disk space requirements by 40-70% relative to row-stores while supporting vectorized query execution for better CPU utilization.^[5] Scalability is enhanced in distributed environments, where horizontal partitioning across nodes allows handling of massive datasets in grid-like setups, tolerating failures through redundancy mechanisms.^[2] These systems excel in use cases centered on read-heavy, analytical processing, such as data warehousing for business intelligence reporting, where complex queries aggregate historical data to inform decisions.^[2] They are particularly suited to real-time analytics in domains like telecommunications and scientific computing, including processing large-scale observational data from surveys.^[8] Time-series analysis in IoT for sensor monitoring and in finance for market data tracking benefits from their ability to efficiently query temporal patterns without loading entire rows.^[5] In contrast, column-oriented DBMSes are less ideal for online transaction processing (OLTP) workloads, where frequent inserts and updates predominate, as row-oriented systems better support transactional integrity and point queries.^[5] Column-oriented DBMSes emerged in the late 1990s and early 2000s to address the limitations of row-oriented systems in data warehousing and online analytical processing (OLAP), with pioneering efforts like those at CWI in the Netherlands driving innovations in main-memory columnar architectures around 2004.^[8] However, they involve trade-offs, including slower performance for frequent inserts and updates due to the overhead of reorganizing columnar structures and the need for batch processing to maintain efficiency.^[5] These systems often employ hybrid strategies, such as write-optimized storage for updates and read-optimized columnar views, to balance analytical speed with occasional modifications.^[2]

Open-Source Column-Oriented DBMSes

General-Purpose Systems

General-purpose open-source column-oriented DBMSes are designed for versatile analytical workloads, including data warehousing and online analytical processing (OLAP), supporting high-performance queries on large datasets through columnar storage and parallel execution. These systems emphasize SQL compatibility, scalability, and integration with big data ecosystems, enabling efficient handling of complex aggregations and joins without domain-specific optimizations. Apache Doris, initially released in 2017 and implemented primarily in Java and C++, is a massively parallel processing (MPP) real-time data warehouse optimized for high-concurrency queries and sub-second analytical responses on large datasets within the Hadoop ecosystem.^[9] It supports real-time data ingestion from streams like Kafka and batch sources, leveraging vectorized execution for accelerated performance in OLAP scenarios.^[10] ClickHouse, first released in 2016 and developed in C++, serves as a high-performance OLAP database management system for real-time data analysis using standard SQL queries.^[11] Its columnar architecture enables rapid generation of analytical reports on massive volumes of data, with features like distributed query processing and compression techniques that achieve high throughput for aggregations and filtering.^[12] DuckDB, released in 2019 and built in C++, functions as an embeddable OLAP relational database management system tailored for executing complex analytical queries directly on local data files without requiring a separate server process.^[13] It offers in-process execution with full SQL support, including window functions and joins, making it suitable for data science workflows and ad-hoc analysis on datasets up to terabyte scale.^[14] DuckDB supports interoperability with PostgreSQL via its postgres extension, allowing direct querying of PostgreSQL databases. For example, the extension can be installed and loaded with the following commands:

INSTALL postgres; LOAD postgres;

A PostgreSQL database can then be attached as:

ATTACH 'dbname=postgres user=postgres host=127.0.0.1' AS db (TYPE postgres, READ_ONLY);

And queried with:

SELECT * FROM db.uuids;

This feature enhances DuckDB's utility for hybrid analytical environments.^[15] MonetDB, first released in 2004 and implemented in C, stands as a pioneering columnar relational database management system focused on analytical processing through its MonetDB Assembly Language (MAL) interpreter for optimized query execution.^[16] It excels in multi-core parallel processing for complex queries on large datasets, providing SQL compliance with extensions for scientific and array-based analytics.^[17] Apache Pinot, released in 2015 and written in Java, is a distributed OLAP datastore designed for real-time analytics with low-latency data ingestion from streaming sources like Apache Kafka.^[18] It supports high-throughput querying for user-facing applications, incorporating inverted indexes and segment-based storage to handle billions of events with sub-second response times.^[19] StarRocks, open-sourced in the early 2020s and developed in Java and C++, provides a unified analytics platform that combines real-time and batch processing for multi-dimensional OLAP workloads.^[20] Its cost-based optimizer and vectorized engine deliver sub-second query latency on petabyte-scale data, with native support for federated querying across data lakes like Apache Iceberg.^[21] CrateDB, first released in 2013 and implemented in Java, operates as a distributed SQL database that integrates search and analytics capabilities for handling structured and unstructured data in real-time.^[22] Built on a shared-nothing architecture, it supports full-text search via Lucene, geospatial queries, and scalable aggregations, enabling hybrid workloads with standard SQL interfaces.^[23] Greenplum Database, originating in the 2000s and primarily written in C as a PostgreSQL extension, delivers massively parallel processing for large-scale analytics on distributed clusters.^[24] Supported by VMware, it facilitates data warehousing with features like external table access to Hadoop and high availability through mirroring, processing petabyte-scale datasets via SQL-based parallel query execution.^[25]

Specialized Systems

Specialized open-source column-oriented DBMSes are designed for niche applications, such as time-series data management or integration with big data ecosystems, offering optimizations like high-ingestion rates and domain-specific querying that distinguish them from broader OLAP systems. These systems leverage columnar storage to enhance compression and query performance on structured event or temporal data, enabling efficient handling of high-velocity inputs in areas like monitoring and analytics. Apache Druid, initiated in 2011 and implemented in Java, is a distributed database optimized for real-time ingestion and sub-second queries on event-oriented data, such as web analytics or IoT streams, with support from the company Imply for enterprise deployments.^[26]^[27] Its architecture supports low-latency operations through segment-based storage and indexing, achieving ingestion rates exceeding 1 million events per second on commodity hardware.^[28] InfluxDB, launched in 2013 and developed primarily in Go with recent components in Rust, serves as a time-series database tailored for metrics, monitoring, and observability workloads, featuring built-in downsampling and retention policies.^[29]^[30] The system employs a columnar engine based on Apache Arrow for efficient compression and querying, supporting over 100,000 writes per second in production environments.^[31] VictoriaMetrics, started in 2018 and written in Go, provides a scalable time-series database compatible with Prometheus, emphasizing cost-effective storage and high query throughput for monitoring large-scale infrastructures.^[32] It uses a custom columnar format for data blocks, enabling up to 10x better compression than alternatives and handling billions of active series with minimal resource usage.^[33] QuestDB, released in 2018 and built in Java, is a relational time-series database that supports standard SQL with extensions for high-speed ingestion and analytics, ideal for financial and industrial applications requiring sub-millisecond query times.^[34] Its columnar design facilitates vectorized processing, supporting ingestion rates of over 1 million rows per second via protocols like InfluxDB Line Protocol.^[35] Apache Kudu, introduced in 2016 and implemented in C++, acts as a storage engine complementary to Hadoop ecosystems like Apache Impala and Spark, enabling fast analytics on mutable datasets with ACID transactions.^[36] The columnar storage allows efficient scans and updates, with performance benchmarks showing up to 10x faster inserts compared to HBase for analytical workloads.^[37] Databend, founded in 2021 and developed in Rust, functions as a serverless, cloud-native data warehouse supporting SQL analytics on semi-structured data, with features like zero-copy sharing and elastic scaling. Its columnar format, inspired by Apache Parquet, optimizes for vectorized execution, achieving query speeds competitive with proprietary systems on petabyte-scale datasets.^[38] OpenTSDB, created in 2010 and coded in Java, is a distributed time-series database layered on Apache HBase, designed for storing and querying billions of metrics from large-scale monitoring systems without losing temporal granularity.^[39] Leveraging HBase's column-family model, it supports horizontal scaling to handle over 1 billion data points daily, with aggregation functions for downsampling historical data.^[40]

Proprietary Column-Oriented DBMSes

On-Premises Systems

On-premises proprietary column-oriented DBMSes are designed for deployment on customer-controlled infrastructure, offering enterprises full administrative control over hardware, scaling, and security for high-performance analytics workloads. These systems typically leverage columnar storage to optimize query performance on large datasets, supporting data warehousing, real-time analytics, and complex reporting without relying on cloud-managed services. Key examples include established platforms from major vendors, emphasizing in-memory processing and parallel execution for enterprise-scale operations. Vertica, founded in 2005 and implemented in C++, is an analytics platform specializing in high-velocity data processing and data warehousing on shared-nothing architectures.^[41] It supports massive parallel processing for petabyte-scale analytics, enabling fast ad-hoc queries and integration with machine learning workflows, and has been owned by OpenText since 2023 (previously under Hewlett Packard Enterprise and Micro Focus).^[42] SAP HANA, launched in 2010 and developed in C++, functions as an in-memory columnar database that unifies real-time analytics and transactional processing.^[43] It accelerates data-intensive applications through multi-core processing and supports hybrid transactional/analytical workloads, provided by SAP SE. Exasol, commercially released in 2008 and built in C++, is an in-memory analytics database optimized for executing complex queries at high speed on large volumes of data. It employs massively parallel processing to deliver sub-second response times for business intelligence tasks, developed by Exasol AG. kdb+, introduced in 2003 using the Q programming language, excels in high-frequency trading and time-series analytics with its column-oriented, in-memory design for handling tick data and real-time streams. It provides vectorized operations for ultra-low latency queries on billions of records, offered by KX Systems. SAP IQ, originating in the 1990s and implemented in C, serves as a columnar engine primarily for data warehousing and large-scale reporting.^[44] It focuses on compression and index optimization to manage petabyte datasets efficiently for analytical queries, maintained by SAP SE. SingleStore, established in 2011 and coded in C++, offers a unified SQL interface for both transactional and analytical processing in a distributed, row- and column-hybrid store. Formerly known as MemSQL, it supports real-time ingestion and vector search for AI-driven applications, provided by SingleStore Inc.^[45] Actian Vector, released in 2010 and developed in C++, is a high-performance analytics database leveraging vectorized execution for rapid processing of complex queries on voluminous data. It incorporates SIMD instructions for hardware-accelerated performance in business intelligence scenarios, distributed by Actian Corporation (formerly VectorWise).^[46]

Cloud-Based PaaS Systems

Cloud-based Platform as a Service (PaaS) systems for column-oriented DBMSes provide managed, scalable analytics solutions that eliminate infrastructure management, enabling organizations to focus on data insights through elastic compute resources and pay-as-you-go pricing. These services leverage columnar storage to optimize query performance on large datasets, supporting petabyte-scale operations with automatic scaling and integration into major cloud ecosystems.

Amazon Redshift (2012): Launched on Amazon Web Services (AWS), Redshift is a fully managed petabyte-scale data warehouse that uses columnar storage to accelerate analytical queries on structured data, with features like concurrency scaling for handling variable workloads without downtime.^[47]^[48]
Google BigQuery (2010): Offered as a serverless, fully managed analytics platform on Google Cloud, BigQuery employs columnar storage for real-time querying of massive datasets, supporting multicloud data processing and automatic resource allocation to minimize operational overhead.^[49]
Snowflake (2012): This cloud data platform, available across AWS, Azure, and Google Cloud, separates storage and compute for independent scaling, utilizing columnar storage to enable efficient analytics and data sharing while providing zero-copy cloning for rapid deployment.^[50]^[51]
Microsoft Azure Synapse Analytics (2019, formerly SQL Data Warehouse): An integrated analytics service on Microsoft Azure, it combines enterprise data warehousing with big data processing using columnar storage in its dedicated SQL pools, offering serverless options for on-demand scaling and seamless integration with Azure Machine Learning.^[52]^[53]
Oracle Autonomous Data Warehouse (2018): Deployed on Oracle Cloud Infrastructure, this self-driving data warehouse automates tuning, security, and backups with columnar storage optimized for analytics, allowing elastic scaling and pay-per-use pricing to reduce administrative effort.^[54]
Rockset (2016): A real-time search and analytics database on cloud platforms including AWS and Google Cloud, Rockset uses columnar indexing for sub-second queries on semi-structured data, with convergent indexing to support dynamic schemas; it was acquired by OpenAI in 2024 to enhance AI-driven data retrieval.^[55]

Info Pages

Talk Pages

Special Pages

List of column-oriented DBMSes

Recent from talks

Recent from talks

Contribute something

Contribute something

Media Pages

Timelines

Articles

Notes collections

Notes

Notes

Days in Chronicle

List of column-oriented DBMSes

Free and open-source software (FOSS)

Platform as a Service (PaaS)

Proprietary

References

List of column-oriented DBMSes

Introduction

Definition and Characteristics

Benefits and Use Cases

Open-Source Column-Oriented DBMSes

General-Purpose Systems

Specialized Systems

Proprietary Column-Oriented DBMSes

On-Premises Systems

Cloud-Based PaaS Systems

References

Add your contribution

Related Hubs

Contribute something

History

List of column-oriented DBMSes

Recent from talks

Recent from talks

Contribute something

Contribute something

Media Pages

Timelines

Articles

Notes collections

Notes

Notes

Days in Chronicle

List of column-oriented DBMSes

Free and open-source software (FOSS)

Platform as a Service (PaaS)

Proprietary

References

List of column-oriented DBMSes

Introduction

Definition and Characteristics

Benefits and Use Cases

Open-Source Column-Oriented DBMSes

General-Purpose Systems

Specialized Systems

Proprietary Column-Oriented DBMSes

On-Premises Systems

Cloud-Based PaaS Systems

References

Add your contribution

Related Hubs

Contribute something