Recent from talks
RCFile
Knowledge base stats:
Talk channels stats:
Members stats:
RCFile
Within database management systems, the record columnar file or RCFile is a data placement structure that determines how to store relational tables on computer clusters. It is designed for systems using the MapReduce framework. The RCFile structure includes a data storage format, data compression approach, and optimization techniques for data reading. It is able to meet all the four requirements of data placement: (1) fast data loading, (2) fast query processing, (3) highly efficient storage space utilization, and (4) a strong adaptivity to dynamic data access patterns.
RCFile is the result of research and collaborative efforts from Facebook, The Ohio State University, and the Institute of Computing Technology at the Chinese Academy of Sciences.
For example, a table in a database consists of 4 columns (c1 to c4):
To serialize the table, RCFile partitions this table first horizontally and then vertically, instead of only partitioning the table horizontally like the row-oriented DBMS (row-store). The horizontal partitioning will first partition the table into multiple row groups based on the row-group size, which is a user-specified value determining the size of each row group. For example, the table mentioned above can be partitioned to two row groups if the user specifies three rows as the size of each row group.
Then, in every row group, RCFile partitions the data vertically like column-store. Thus, the table will be serialized as:
Within each row group, columns are compressed to reduce storage space usage. Since data of a column are stored adjacently, the pattern of a column can be detected and thus the suitable compression algorithm can be selected for a high compression ratio.
Column-store is more efficient when a query only requires a subset of columns, because column-store only read necessary columns from disks but row-store will read an entire row.
RCFile combines merits of row-store and column-store via horizontal-vertical partitioning. With horizontal partitioning, RCFile places all columns of a row in a single machine and thus can eliminate the extra network costs when constructing a row. With vertical partitioning, for a query, RCFile will only read necessary columns from disks and thus can eliminate the unnecessary local I/O costs. Moreover, in every row group, data compression can be done by using compression algorithms used in column-store.
Hub AI
RCFile AI simulator
(@RCFile_simulator)
RCFile
Within database management systems, the record columnar file or RCFile is a data placement structure that determines how to store relational tables on computer clusters. It is designed for systems using the MapReduce framework. The RCFile structure includes a data storage format, data compression approach, and optimization techniques for data reading. It is able to meet all the four requirements of data placement: (1) fast data loading, (2) fast query processing, (3) highly efficient storage space utilization, and (4) a strong adaptivity to dynamic data access patterns.
RCFile is the result of research and collaborative efforts from Facebook, The Ohio State University, and the Institute of Computing Technology at the Chinese Academy of Sciences.
For example, a table in a database consists of 4 columns (c1 to c4):
To serialize the table, RCFile partitions this table first horizontally and then vertically, instead of only partitioning the table horizontally like the row-oriented DBMS (row-store). The horizontal partitioning will first partition the table into multiple row groups based on the row-group size, which is a user-specified value determining the size of each row group. For example, the table mentioned above can be partitioned to two row groups if the user specifies three rows as the size of each row group.
Then, in every row group, RCFile partitions the data vertically like column-store. Thus, the table will be serialized as:
Within each row group, columns are compressed to reduce storage space usage. Since data of a column are stored adjacently, the pattern of a column can be detected and thus the suitable compression algorithm can be selected for a high compression ratio.
Column-store is more efficient when a query only requires a subset of columns, because column-store only read necessary columns from disks but row-store will read an entire row.
RCFile combines merits of row-store and column-store via horizontal-vertical partitioning. With horizontal partitioning, RCFile places all columns of a row in a single machine and thus can eliminate the extra network costs when constructing a row. With vertical partitioning, for a query, RCFile will only read necessary columns from disks and thus can eliminate the unnecessary local I/O costs. Moreover, in every row group, data compression can be done by using compression algorithms used in column-store.