Hubbry Logo
List of column-oriented DBMSesList of column-oriented DBMSesMain
Open search
List of column-oriented DBMSes
Community hub
List of column-oriented DBMSes
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
List of column-oriented DBMSes
List of column-oriented DBMSes
from Wikipedia

This article is a list of column-oriented database management system software.

Free and open-source software (FOSS)

[edit]
Database name Language implemented in Notes
Apache Doris Java & C++ Open source (since 2017), database for high-concurrency point queries and high-throughput analysis.
Apache Druid Java Started in 2011 for low-latency massive ingestion and queries. Support and extensions available from Imply Data.
Apache Kudu C++ Released in 2016 to complete the Apache Hadoop ecosystem
Apache Pinot Java Open sourced in 2015 for real-time low-latency analytics. Support and extensions available from StarTree.
Calpont InfiniDB C++
ClickHouse C++ Released in 2016 to analyze data that is updated in real time
CrateDB Java
C-Store C++ The last release of the original code was in 2006; Vertica a commercial fork, lives on.
DuckDB C++ An embeddable, in-process, column-oriented SQL OLAP RDBMS
Databend Rust An elastic and reliable Serverless Data Warehouse
InfluxDB Rust Time series database
Greenplum Database C Support and extensions available from VMware.
HEAVY.AI C++ Formerly MapD
MariaDB ColumnStore C & C++ Formerly Calpont InfiniDB
Metakit C++
MonetDB C Open-source (since 2004) columnar Relational DBMS pioneer
PostgreSQL cstore fdw,[1] vops[2] C cstore_fdw uses ORC format
StarRocks Java & C++ Open source, unified analytics platform for batch and real-time analytics. Supports and extensions available from CelerData.
VictoriaMetrics Go Time series database[3]

Platform as a Service (PaaS)

[edit]

Proprietary

[edit]

References

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
A column-oriented database management system (DBMS), also known as a columnar database, is a type of DBMS that stores relational data tables by column rather than by row, contrasting with traditional row-oriented systems that group entire records together. This storage model enables more efficient querying for analytical workloads, as it allows the system to read only the required columns, reducing I/O overhead and leveraging column-specific compression techniques. Column-oriented DBMSes are particularly suited for online analytical processing (OLAP) tasks, such as data aggregation, reporting, and business intelligence, where scans of large datasets focus on specific attributes rather than full rows. The origins of column-oriented storage trace back to the , with early implementations using transposed files and vertical partitioning to optimize for columnar access in analytical environments. By the mid-1980s, research highlighted the advantages of the fully decomposed storage model (DSM), which separates attributes into independent column files, over the native storage model (NSM) used in row-oriented systems. Interest waned temporarily due to the dominance of transactional (OLTP) workloads, but the concept resurged in the amid growing demands for data warehousing and the widening performance gap between CPUs and disk I/O, prompting innovations in compression and query optimization. Foundational academic projects like C-Store (2005) and MonetDB demonstrated dramatic speedups—up to 100 times—for read-heavy queries, influencing commercial developments. Key advantages of column-oriented DBMSes include superior cache locality, as related values in a column are stored contiguously, minimizing page faults during scans; high compression ratios, often exceeding 10:1 for low-entropy columns like dates or categories; and simplified vectorized query execution, which processes data in batches for better SIMD utilization. These features make them ideal for analytics, scientific computing, and modern environments, though they may underperform for frequent updates or inserts compared to row-oriented systems. This list catalogs notable column-oriented DBMSes, including early commercial, academic, open-source, and cloud-based systems.

Introduction

Definition and Characteristics

A column-oriented database management system (DBMS), also known as a columnar DBMS, is a system that stores data tables by column rather than by row, optimizing for read-intensive operations on large datasets. This storage approach groups data by attributes, allowing queries to access only the relevant columns without retrieving entire rows, which significantly reduces I/O overhead and enables efficient compression. Unlike traditional row-oriented systems, columnar storage facilitates vectorized query , where operations are applied to entire columns as arrays, leveraging modern CPU instructions for faster execution. Key characteristics of column-oriented DBMSes include their columnar storage format, which physically organizes data by column on disk, enabling patterns that align with analytical workloads. These systems commonly support advanced compression techniques, such as (RLE) for sequences of repeated values and dictionary encoding to map unique values to smaller codes, achieving compression ratios significantly higher than row-based methods, such as 10:1 compared to 3:1. They are particularly suited for (OLAP) workloads, where aggregations and scans over specific attributes dominate, and integrate well with columnar file formats like and , which further enhance data portability and query performance in distributed environments. In contrast to row-oriented DBMSes, which store entire rows contiguously to optimize transactional updates and full-row retrievals in (OLTP) scenarios, column-oriented systems prioritize read efficiency by scanning only necessary columns for queries like sums or averages, avoiding the transfer of irrelevant data. Row-oriented systems excel in and write-heavy operations but incur higher costs for column-wise due to fragmented access patterns. Storage mechanics in columnar DBMSes involve grouping columns into projections or extents on disk, often with sorting and indexing to support rapid reconstruction of rows when needed, and exploitation of (SIMD) instructions for parallel processing of column vectors.

Benefits and Use Cases

Column-oriented DBMSes offer significant advantages in query performance for analytical workloads, particularly those involving aggregations such as SUM and AVG on large datasets, by accessing only the relevant columns and minimizing I/O overhead. This approach can yield substantial speedups; for instance, in TPC-H benchmarks, a column-oriented system demonstrated 164 times faster performance compared to row-oriented systems due to reduced data scanning and efficient processing of compressed columns. Additionally, columnar storage enables higher storage efficiency through compression techniques that exploit data similarity within columns, often achieving up to 10:1 ratios, which reduces disk space requirements by 40-70% relative to row-stores while supporting vectorized query execution for better CPU utilization. is enhanced in distributed environments, where horizontal partitioning across nodes allows handling of massive datasets in grid-like setups, tolerating failures through redundancy mechanisms. These systems excel in use cases centered on read-heavy, analytical processing, such as data warehousing for reporting, where complex queries aggregate historical to inform decisions. They are particularly suited to real-time in domains like and scientific , including processing large-scale observational from surveys. Time-series analysis in IoT for monitoring and in finance for tracking benefits from their ability to efficiently query temporal patterns without loading entire rows. In contrast, column-oriented DBMSes are less ideal for (OLTP) workloads, where frequent inserts and updates predominate, as row-oriented systems better support transactional integrity and point queries. Column-oriented DBMSes emerged in the late 1990s and early 2000s to address the limitations of row-oriented systems in data warehousing and (OLAP), with pioneering efforts like those at CWI in the driving innovations in main-memory columnar architectures around 2004. However, they involve trade-offs, including slower performance for frequent inserts and updates due to the overhead of reorganizing columnar structures and the need for to maintain efficiency. These systems often employ hybrid strategies, such as write-optimized storage for updates and read-optimized columnar views, to balance analytical speed with occasional modifications.

Open-Source Column-Oriented DBMSes

General-Purpose Systems

General-purpose open-source column-oriented DBMSes are designed for versatile analytical workloads, including data warehousing and (OLAP), supporting high-performance queries on large datasets through columnar storage and parallel execution. These systems emphasize SQL compatibility, scalability, and integration with ecosystems, enabling efficient handling of complex aggregations and joins without domain-specific optimizations. Apache Doris, initially released in 2017 and implemented primarily in and C++, is a processing (MPP) real-time data warehouse optimized for high-concurrency queries and sub-second analytical responses on large datasets within the Hadoop . It supports ingestion from streams like Kafka and batch sources, leveraging vectorized execution for accelerated performance in OLAP scenarios. ClickHouse, first released in 2016 and developed in C++, serves as a high-performance OLAP database management system for real-time data analysis using standard SQL queries. Its columnar architecture enables rapid generation of analytical reports on massive volumes of data, with features like distributed query processing and compression techniques that achieve high throughput for aggregations and filtering. DuckDB, released in 2019 and built in C++, functions as an embeddable OLAP management system tailored for executing complex analytical queries directly on local data files without requiring a separate server process. It offers in-process execution with full SQL support, including window functions and joins, making it suitable for workflows and ad-hoc analysis on datasets up to terabyte scale. DuckDB supports interoperability with PostgreSQL via its postgres extension, allowing direct querying of PostgreSQL databases. For example, the extension can be installed and loaded with the following commands:

INSTALL postgres; LOAD postgres;

INSTALL postgres; LOAD postgres;

A PostgreSQL database can then be attached as:

ATTACH 'dbname=postgres user=postgres host=127.0.0.1' AS db (TYPE postgres, READ_ONLY);

ATTACH 'dbname=postgres user=postgres host=127.0.0.1' AS db (TYPE postgres, READ_ONLY);

And queried with:

SELECT * FROM db.uuids;

SELECT * FROM db.uuids;

This feature enhances DuckDB's utility for hybrid analytical environments. MonetDB, first released in 2004 and implemented in C, stands as a pioneering columnar management system focused on analytical processing through its MonetDB Assembly Language (MAL) interpreter for optimized query execution. It excels in multi-core parallel processing for complex queries on large datasets, providing SQL compliance with extensions for scientific and array-based analytics. Apache , released in 2015 and written in , is a distributed OLAP datastore designed for real-time with low-latency data ingestion from streaming sources like . It supports high-throughput querying for user-facing applications, incorporating inverted indexes and segment-based storage to handle billions of events with sub-second response times. StarRocks, open-sourced in the early 2020s and developed in and C++, provides a unified analytics platform that combines real-time and for multi-dimensional OLAP workloads. Its cost-based optimizer and vectorized engine deliver sub-second query latency on petabyte-scale data, with native support for federated querying across data lakes like . CrateDB, first released in 2013 and implemented in , operates as a database that integrates search and analytics capabilities for handling structured and in real-time. Built on a , it supports via Lucene, geospatial queries, and scalable aggregations, enabling hybrid workloads with standard SQL interfaces. Greenplum Database, originating in the and primarily written in C as a extension, delivers processing for large-scale on distributed clusters. Supported by , it facilitates data warehousing with features like external table access to Hadoop and through mirroring, processing petabyte-scale datasets via SQL-based parallel query execution.

Specialized Systems

Specialized open-source column-oriented DBMSes are designed for niche applications, such as time-series data management or integration with big data ecosystems, offering optimizations like high-ingestion rates and domain-specific querying that distinguish them from broader OLAP systems. These systems leverage columnar storage to enhance compression and query performance on structured event or temporal data, enabling efficient handling of high-velocity inputs in areas like monitoring and . Apache Druid, initiated in 2011 and implemented in , is a optimized for real-time and sub-second queries on event-oriented data, such as or IoT streams, with support from the company Imply for enterprise deployments. Its architecture supports low-latency operations through segment-based storage and indexing, achieving ingestion rates exceeding 1 million events per second on commodity hardware. , launched in 2013 and developed primarily in Go with recent components in , serves as a time-series database tailored for metrics, monitoring, and workloads, featuring built-in downsampling and retention policies. The system employs a columnar engine based on for efficient compression and querying, supporting over 100,000 writes per second in production environments. VictoriaMetrics, started in and written in Go, provides a scalable time-series database compatible with , emphasizing cost-effective storage and high query throughput for monitoring large-scale infrastructures. It uses a custom columnar format for data blocks, enabling up to 10x better compression than alternatives and handling billions of active series with minimal resource usage. QuestDB, released in 2018 and built in , is a relational time-series database that supports standard SQL with extensions for high-speed ingestion and analytics, ideal for financial and industrial applications requiring sub-millisecond query times. Its columnar design facilitates vectorized processing, supporting ingestion rates of over 1 million rows per second via protocols like Line Protocol. Apache Kudu, introduced in 2016 and implemented in C++, acts as a storage engine complementary to Hadoop ecosystems like and Spark, enabling fast analytics on mutable datasets with transactions. The columnar storage allows efficient scans and updates, with performance benchmarks showing up to 10x faster inserts compared to HBase for analytical workloads. Databend, founded in 2021 and developed in , functions as a serverless, cloud-native supporting SQL analytics on , with features like sharing and elastic scaling. Its columnar format, inspired by , optimizes for vectorized execution, achieving query speeds competitive with proprietary systems on petabyte-scale datasets. OpenTSDB, created in 2010 and coded in , is a distributed time-series database layered on , designed for storing and querying billions of metrics from large-scale monitoring systems without losing temporal . Leveraging HBase's column-family model, it supports horizontal scaling to handle over 1 billion data points daily, with aggregation functions for downsampling historical data.

Proprietary Column-Oriented DBMSes

On-Premises Systems

On-premises proprietary column-oriented DBMSes are designed for deployment on customer-controlled , offering enterprises full administrative control over hardware, scaling, and for high-performance workloads. These systems typically leverage columnar storage to optimize query performance on large datasets, supporting data warehousing, real-time , and complex reporting without relying on cloud-managed services. Key examples include established platforms from major vendors, emphasizing and parallel execution for enterprise-scale operations. Vertica, founded in 2005 and implemented in C++, is an platform specializing in high-velocity and data warehousing on shared-nothing architectures. It supports massive parallel processing for petabyte-scale , enabling fast ad-hoc queries and integration with workflows, and has been owned by since 2023 (previously under and ). SAP HANA, launched in 2010 and developed in C++, functions as an in-memory columnar database that unifies real-time analytics and transactional processing. It accelerates data-intensive applications through multi-core processing and supports hybrid transactional/analytical workloads, provided by SAP SE. Exasol, commercially released in 2008 and built in C++, is an in-memory analytics database optimized for executing complex queries at high speed on large volumes of data. It employs massively parallel processing to deliver sub-second response times for business intelligence tasks, developed by Exasol AG. kdb+, introduced in 2003 using the programming language, excels in and time-series analytics with its column-oriented, in-memory design for handling tick data and real-time streams. It provides vectorized operations for ultra-low latency queries on billions of records, offered by KX Systems. SAP IQ, originating in the and implemented in C, serves as a columnar engine primarily for data warehousing and large-scale reporting. It focuses on compression and index optimization to manage petabyte datasets efficiently for analytical queries, maintained by SE. SingleStore, established in 2011 and coded in C++, offers a unified SQL interface for both transactional and analytical in a distributed, row- and column-hybrid store. Formerly known as MemSQL, it supports real-time ingestion and vector search for AI-driven applications, provided by Inc. Actian Vector, released in 2010 and developed in C++, is a high-performance analytics database leveraging vectorized execution for rapid of complex queries on voluminous . It incorporates SIMD instructions for hardware-accelerated in business intelligence scenarios, distributed by Corporation (formerly VectorWise).

Cloud-Based PaaS Systems

Cloud-based (PaaS) systems for column-oriented DBMSes provide managed, scalable analytics solutions that eliminate infrastructure management, enabling organizations to focus on data insights through elastic compute resources and pay-as-you-go pricing. These services leverage columnar storage to optimize query performance on large datasets, supporting petabyte-scale operations with automatic scaling and integration into major cloud ecosystems.
  • Amazon Redshift (2012): Launched on (AWS), Redshift is a fully managed petabyte-scale that uses columnar storage to accelerate analytical queries on structured data, with features like concurrency scaling for handling variable workloads without downtime.
  • Google BigQuery (2010): Offered as a serverless, fully managed analytics platform on Cloud, BigQuery employs columnar storage for real-time querying of massive datasets, supporting multicloud data processing and automatic resource allocation to minimize operational overhead.
  • Snowflake (2012): This cloud data platform, available across AWS, Azure, and Google Cloud, separates storage and compute for independent scaling, utilizing columnar storage to enable efficient analytics and while providing cloning for rapid deployment.
  • Microsoft Azure Synapse Analytics (2019, formerly SQL Data Warehouse): An integrated analytics service on , it combines enterprise with processing using columnar storage in its dedicated SQL pools, offering serverless options for on-demand scaling and seamless integration with .
  • Oracle Autonomous Data Warehouse (2018): Deployed on Infrastructure, this self-driving data warehouse automates tuning, security, and backups with columnar storage optimized for analytics, allowing elastic scaling and pay-per-use to reduce administrative effort.
  • Rockset (2016): A real-time search and analytics database on cloud platforms including AWS and Cloud, Rockset uses columnar indexing for sub-second queries on , with convergent indexing to support dynamic schemas; it was acquired by in 2024 to enhance AI-driven data retrieval.

References

Add your contribution
Related Hubs
User Avatar
No comments yet.