Hubbry Logo
Extensible Storage EngineExtensible Storage EngineMain
Open search
Extensible Storage Engine
Community hub
Extensible Storage Engine
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Contribute something
Extensible Storage Engine
Extensible Storage Engine
from Wikipedia
Extensible Storage Engine
Other namesJET Blue
DeveloperMicrosoft
Initial release1994; 32 years ago (1994)
Repository
Written inC++
Operating systemMicrosoft Windows
PlatformIA-32, x86-64, ARM and Itanium (and historically DEC Alpha, MIPS, and PowerPC)
TypeDatabase engine
LicenseMIT License
Websitedocs.microsoft.com/en-us/windows/win32/extensible-storage-engine/extensible-storage-engine Edit this on Wikidata

Extensible Storage Engine (ESE), also known as JET Blue, is an ISAM (indexed sequential access method) data storage technology from Microsoft. ESE is the core of Microsoft Exchange Server, Active Directory, and Windows Search. It is also used by a number of Windows components including Windows Update client and Help and Support Center. Its purpose is to allow applications to store and retrieve data via indexed and sequential access.

ESE provides transacted data update and retrieval. A crash recovery mechanism is provided so that data consistency is maintained even in the event of a system crash. Transactions in ESE are highly concurrent making ESE suitable for server applications. ESE caches data intelligently to ensure high performance access to data. In addition, ESE is lightweight making it suitable for auxiliary applications.

The ESE Runtime (ESENT.DLL) has shipped in every Windows release since Windows 2000, with native x64 version of the ESE runtime shipping with x64 versions of Windows XP and Windows Server 2003. Microsoft Exchange, up to Exchange 2003 shipped with only the 32-bit edition, as it was the only supported platform. With Exchange 2007, it ships with the 64-bit edition.

Databases

[edit]

A database is both a physical and logical grouping of data. An ESE database looks like a single file to Windows. Internally the database is a collection of 2, 4, 8, 16, or 32 KB pages (16 and 32 KB page options are only available in Windows 7 and Exchange 2010),[1] arranged in a balanced B-tree structure.[2] These pages contain meta-data to describe the data contained within the database, data itself, indexes to persist interesting orders of the data, and other information. This information is intermixed within the database file but efforts are made to keep data used together clustered together within the database. An ESE database may contain up to 232 pages, or 16 terabytes of data,[3] for 8 kilobyte sized pages.

ESE databases are organized into groups called instances. Most applications use a single instance, but all applications can also use multiple instances. The importance of the instance is that it associates a single recovery log series with one or more databases. Currently, up to 6 user databases may be attached to an ESE instance at any time. Each separate process using ESE may have up to 1024 ESE instances.

A database is portable in that it can be detached from one running ESE instance and later attached to the same or a different running instance. While detached, a database may be copied using standard Windows utilities. The database cannot be copied while it is being actively used since ESE opens database files exclusively. A database may physically reside on any device supported for directly addressable I/O operations by Windows.

Tables

[edit]

A table is a homogeneous collection of records, where each record has the same set of columns. Each table is identified by a table name, whose scope is local to the database in which the table is contained. The amount of disk space allocated to a table within a database is determined by a parameter given when the table is created with the CreateTable operation. Tables grow automatically in response to data creation.

Tables have one or more indexes. There must be at least one clustered index for record data. When no clustered index is defined by the application, an artificial index is used which orders and clusters records by the chronological order of record insertion. Indexes are defined to persist interesting orders of data, and allow both sequential access to records in index order, and direct access to records by index column values. Clustered indexes in ESE must also be primary, meaning that the index key must be unique.

Clustered and non-clustered indexes are represented using B+ trees. If an insert or update operation causes a page to overflow, the page is split: a new page is allocated and is logically chained in between the two previously adjacent pages. Since this new page is not physically adjacent to its logical neighbors, access to it is not as efficient. ESE has an on-line compaction feature that re-compacts data. If a table is expected to be frequently updated, space may be reserved for future insertions by specifying an appropriate page density when creating a table or index. This allows split operations to be avoided or postponed.

Records and columns

[edit]

A record is an associated set of column values. Records are inserted and updated via Update operations and can be deleted via Delete operations. Columns are set and retrieved via SetColumns and RetrieveColumns operations, respectively. The maximum size of a record is 8110 bytes for 8 kilobyte pages with the exception of long value columns. Column types of LongText and LongBinary do not contribute significantly to this size limitation, and records can hold data much larger than a database page size when data is stored in long value columns. When a long value reference is stored in a record, only 9 bytes of in-record data are required. These long values may themselves be up to 2 gigabytes (GB) in size.

Records are typically uniform in that each record has a set of values for the same set of columns. In ESE, it is also possible to define many columns for a table, and yet have any given record contain only a small number of non-NULL column values. In this sense, a table can also be a collection of heterogeneous records.

ESE supports a wide range of columns values, ranging in size from 1-bit to 2 GB. Choosing the correct column type is important because the type of a column determines many of its properties, including its ordering for indexes. The following data types are supported by ESE:

Column types

[edit]
Name Description
Bit ternary value (NULL, 0, or 1)
Unsigned Byte 1-byte unsigned integer
Short 2-byte signed integer
Unsigned Short 2-byte unsigned integer
Long 4-byte signed integer
Unsigned Long 4-byte unsigned integer
LongLong 8-byte signed integer
UnsignedLongLong 8-byte unsigned integer
Currency 8-byte signed integer
IEEE Single 4-byte floating-point number
IEEE Double 8-byte floating-point number
DateTime 8-byte date-time (integral date, fractional time)
GUID 16-byte unique identifier
Binary Binary string, length <= 255 bytes
Text ANSI or Unicode string, length <= 255 bytes
Long Binary Large binary string, length < 2 GB
Long Text Large ANSI or Unicode string, length < 2 GB

Fixed, variable and tagged columns

[edit]

Each ESE table can define up to 127 fixed length columns, 128 variable length columns and 64,993 tagged columns.

  • Fixed columns are essentially columns that take up the same amount of space in each record, regardless of their value. Fixed columns take up a 1-bit to represent NULLity of the column value and a fixed amount of space in each record in which that column, or a later defined fixed column, is set.
  • Variable columns are essentially columns that take up a variable amount of space in each record in which they are set, depending upon the size of the particular column value. Variable columns take up 2-bytes to determine NULLity and size, and a variable amount of space in each record in which that column is set.
  • Tagged columns are columns that take no space whatsoever if they are not set in a record. They may be single valued but can also be multi-valued. The same tagged column may have multiple values in a single record. When tagged columns are set in a record, each instance of a tagged column takes approximately 4-bytes of space in addition to the size of the tagged column instance value. When the number of instances of a single tagged column is large, the overhead per tagged column instance is approximately 2-bytes. Tagged columns are ideal for sparse columns because they take no space whatsoever if they are not set. If a multi-valued tagged column is indexed, the index will contain one entry for the record for each value of the tagged column.

For a given table, columns fall into one of two categories: those which either occur exactly once in each of the records, with possibly a few NULL values; and those which occur rarely, or which may have multiple occurrences in a single record. Fixed and variable columns belong to the former category, while tagged columns belong to the latter. The internal representation of the two column categories is different, and it is important to understand the trade offs between the column categories. Fixed and variable columns are typically represented in every record, even when the occurrence has a NULL value. These columns can be quickly addressed via an offset table. Tagged column occurrences are preceded by a column identifier and the column is located by binary searching the set of tagged columns.

Long values

[edit]

Column types of Long Text and Long Binary are large binary objects. They are stored in separate B+tree from the clustered index keyed by long value id and byte offset. ESE supports append, byte range overwrite, and set size for these columns. Also, ESE has a single instance store feature where multiple records may reference the same large binary object, as though each record had its own copy of the information, i.e. without inter-record locking conflicts. The maximum size of a Long Text or Long Binary column value is 2 GB.

Version, auto-increment and escrow columns

[edit]

Version columns are automatically incremented by ESE each time a record containing this column is modified via an Update operation. This column cannot be set by the application, but can only be read. Applications of version columns include being used to determine if an in-memory copy of a given record needs to be refreshed. If the value in a table record is greater than the value in a cached copy then the cached copy is known to be out of date. Version columns must be of type Long.

Auto increment columns are automatically set by ESE such that the value contained in the column is unique for every record in the table. These columns, like version columns, cannot be set by the application. Auto increment columns are read only, and are automatically set when a new record is inserted into a table via an Update operation. The value in the column remains constant for the life of the record, and only one auto increment column is allowed per table. Auto increment columns may be of type Long or type Currency.

Escrow columns can be modified via an EscrowUpdate operation. Escrowed updates are numeric delta operations. Escrow columns must be of type Long. Examples of numeric delta operations include adding 2 to a value or subtracting 1 from a value. ESE tracks the change in a value rather than the end value of an update. Multiple sessions may each have outstanding changes made via EscrowUpdate to the same value because ESE can determine the actual end value regardless of which transactions commit and which transactions rollback. This allows multiple users to concurrently update a column by making numeric delta changes. Optionally, database engine can erase records with zero value of the column. A common use for such escrow column is reference counter: many threads increment/decrement the value without locks, and when the counter reaches zero, the record automatically gets deleted.

Indexes

[edit]

An index is a persisted ordering of records in a table. Indexes are used for both sequential access to rows in the order defined, and for direct record navigation based on indexed column values. The order defined by an index is described in terms of an array of columns, in precedence order. This array of columns is also called the index key. Each column is called an index segment. Each index segment may be either ascending or descending, in terms of its ordering contribution. Any number of indexes may be defined for a table. ESE provides a rich set of indexing features.

Clustered indexes

[edit]

One index may be specified as the clustered, or primary, index. In ESE, the clustered index must be unique and is referred to as the primary index. Other indexes are described as non-clustered, or secondary, indexes. Primary indexes are different from secondary indexes in that the index entry is the record itself, and not a logical pointer to the record. Secondary indexes have primary keys at their leaves to logically link to the record in the primary index. In other words, the table is physically clustered in primary index order. Retrieval of non-indexed record data in primary index order is generally much faster than in secondary index order. This is because a single disk access can bring into memory multiple records that will be access close together in time. The same disk access satisfies multiple record access operations. However, the insertion of a record into the middle of an index, as determined by the primary index order, may be very much slower than appending it to the end of an index. Update frequency must be carefully considered against retrieval patterns when performing table design. If no primary index is defined for a table, then an implicit primary index, called a database key (DBK) index is created. The DBK is simply a unique ascending number incremented each time a record is inserted. As a result, the physical order of records in a DBK index is chronological insertion order, and new records are always added at the end of the table. If an application wishes to cluster data on a non-unique index, this is possible by adding an autoincrement column to the end of the non-unique index definition.

Indexing over multi-valued columns

[edit]

Indexes can be defined over multi-valued columns. Multiple entries may exist in these indexes for records with multiple values for the indexed column. Multi-valued columns may be indexed in conjunction with single valued columns. When two or more multi-valued columns are indexed together, then the multi-valued property is only honored for the first multi-value column in the index. Lower precedence columns are treated as though they were single valued.

Sparse indexes

[edit]

Indexes can also be defined to be sparse. Sparse indexes do not have at least one entry for each record in the table. There are a number of options in defining a sparse index. Options exist to exclude records from indexes when an entire index key is NULL, when any key segment is NULL or when just the first key segment is NULL. Indexes can also have conditional columns. These columns never appear within an index but can cause a record not to be indexed when the conditional column is either NULL or non-NULL.

Tuple indexes

[edit]

Indexes can also be defined to include one entry for each sub-string of a Text or Long Text column. These indexes are called tuple indexes. They are used to speed queries with sub-string matching predicates. Tuple indexes can only be defined for Text columns. For example, if a Text column value is “I love JET Blue”, and the index is configured to have a minimum tuple size of 4 characters and a maximum tuple length of 10 characters, then the following sub-strings will be indexed:

“I love JET”

“ love JET ”
“love JET B”
“ove JET Bl”
“ve JET Blu”
“e JET Blue”
“ JET Blue”
“JET Blue”
“ET Blue”
“T Blue”
“ Blue”
“Blue”

Even though tuple indexes can be very large, they can significantly speed queries of the form: find all records containing “JET Blue”. They can be used for sub-strings longer than the maximum tuple length by dividing the search sub-string into maximum tuple length search strings and intersecting the results. They can be used for exact matches for strings as long as the maximum tuple length or as short as the minimum tuple length, with no index intersection. For more information on performing index intersection in ESE see Index Intersection. Tuple indexes cannot speed queries where the search string is shorter than the minimum tuple length.

Transactions

[edit]

A transaction is a logical unit of processing delimited by BeginTransaction and CommitTransaction, or Rollback, operations. All updates performed during a transaction are atomic; they either all appear in the database at the same time or none appear. Any subsequent updates by other transactions are invisible to a transaction. However, a transaction can update only data that has not changed in the meantime; else the operation fails at once without waiting. Read-only transactions never need to wait, and update transactions can interfere only with one another updating transaction. Transactions which are terminated by Rollback, or by a system crash, leave no trace on the database. In general, the data state is restored on Rollback to what it was prior to BeginTransaction.

Transactions may be nested up to 7 levels, with one additional level reserved for ESE internal use. This means that a part of a transaction may be rolled back, without need to roll back the entire transaction; a CommitTransaction of a nested transaction merely signifies the success of one phase of processing, and the outer transaction may yet fail. Changes are committed to the database only when the outermost transaction is committed. This is known as committing to transaction level 0. When the transaction commits to transaction level 0, data describing the transaction is synchronously flushed to the log to ensure that the transaction will be completed even in the event of a subsequent system crash. Synchronously flushing the log makes ESE transactions durable. However, in some cases application wish to order their updates, but not immediately guarantee that changes will be done. Here, applications can commit changes with JET_bitIndexLazyFlush.

ESE supports a concurrency control mechanism called multi-versioning. In multi-versioning, every transaction queries a consistent view of the entire database as it was at the time the transaction started. The only updates it encounters are those made by it. In this way, each transaction operates as though it was the only active transaction running on the system, except in the case of write conflicts. Since a transaction may make changes based on data read that has already been updated in another transaction, multi-versioning by itself does not guarantee serializable transactions. However, serializability can be achieved when desired by simply using explicit record read locks to lock read data that updates are based upon. Both read and write locks may be explicitly requested with the GetLock operation.

In addition, an advanced concurrency control feature known as escrow locking is supported by ESE. Escrow locking is an extremely concurrent update where a numeric value is changed in a relative fashion, i.e. by adding or subtracting another numeric value. Escrow updates are non-conflicting even with other concurrent escrow updates to the same datum. This is possible because the operations supported are commutable and can be independently committed or rolled back. As a result, they do not interfere with concurrent update transactions. This feature is often used for maintained aggregations.

ESE also extends transaction semantics from data manipulation operations to data definition operations. It is possible to add an index to a table and have concurrently running transactions update the same table without any transaction lock contention whatsoever. Later, when these transactions are complete, the newly created index is available to all transactions and has entries for record updates made by other transactions that could not perceive the presence of the index when the updates took place. Data definition operations may be performed with all the features expected of the transaction mechanism for record updates. Data definition operations supported in this fashion include AddColumn, DeleteColumn, CreateIndex, DeleteIndex, CreateTable and DeleteTable.

Cursor navigation and the copy buffer

[edit]

A cursor is a logical pointer within a table index. The cursor may be positioned on a record, before the first record, after the last record or even between records. If a cursor is positioned before or after a record, there is no current record. It is possible to have multiple cursors into the same table index. Many record and column operations are based on the cursor position. Cursor position can be moved sequentially by Move operations or directly using index keys with Seek operations. Cursors can also be moved to a fractional position within an index. In this way, the cursor can be quickly moved to a thumb bar position. This operation is performed with the same speed as a Seek operation. No intervening data must be accessed.

Each cursor has a copy buffer in order to create a new record, or modify an existing record, column by column. This is an internal buffer whose contents can be changed with SetColumns operations. Modifications of the copy buffer do not automatically change the stored data. The contents of the current record can be copied into the copy buffer using the PrepareUpdate operation, and Update operations store the contents of the copy buffer as a record. The copy buffer is implicitly cleared on a transaction commit or rollback, as well as on navigation operations. RetrieveColumns may be used to retrieve column data either from the record or from the copy buffer, if one exists.

Query processing

[edit]

ESE applications invariably query their data. This section of the document describes features and techniques for applications to write query procession logic on ESE.

Sorts and temporary tables

[edit]

ESE provides a sort capability in the form of temporary tables. The application inserts data records into the sort process one record at a time, and then retrieves them one record at a time in sorted order. Sorting is actually performed between the last record insertion and the first record retrieval. Temporary tables can be used for partial and complete result sets as well. These tables can offer the same features as base tables including the ability to navigate sequentially or directly to rows using index keys matching the sort definition. Temporary tables can also be updatable for computation of complex aggregates. Simple aggregates can be computed automatically with a feature similar to sorting where the desired aggregate is a natural result of the sort process.

Covering indexes

[edit]

Retrieving column data directly from secondary indexes is an important performance optimization. Columns may be retrieved directly from secondary indexes, without accessing the data records, via the RetrieveFromIndex flag on the RetrieveColumns operation. It is much more efficient to retrieve columns from a secondary index, than from the record, when navigating by the index. If the column data were retrieved from the record, then an additional navigation is necessary to locate the record by the primary key. This may result in additional disk accesses. When an index provides all columns needed then it is called a covering index. Note that columns defined in the table primary index are also found in secondary indexes and can be similarly retrieved using JET_bitRetrieveFromPrimaryBookmark.

Index keys are stored in normalized form which can be, in many cases, denormalized to the original column value. Normalization is not always reversible. For example, Text and Long Text column types cannot be denormalized. In addition, index keys may be truncated when column data is very long. In cases where columns cannot be retrieved directly from secondary indexes, the record can always be accessed to retrieve the necessary data.

Index intersection

[edit]

Queries often involve a combination of restrictions on data. An efficient means of processing a restriction is to use an available index. However, if a query involves multiple restrictions then applications often process the restrictions by walking the full index range of the most restrictive predicate satisfied by a single index. Any remaining predicate, called the residual predicate, is processed by applying the predicate to the record itself. This is a simple method but has the disadvantage of potentially having to perform many disk accesses to bring records into memory to apply the residual predicate.

Index intersection is an important query mechanism in which multiple indexes are used together to more efficiently process a complex restriction. Instead using only a single index, index ranges on multiple indexes are combined to result in a much smaller number of records on which any residual predicate can be applied. ESE makes this easy by supplying an IntersectIndexes operation. This operation accepts a series of index ranges on indexes from the same table and returns a temporary table of primary keys that can be used to navigate to the base table records that satisfy all index predicates.

Pre-joined tables

[edit]

A join is a common operation on a normalized table design, where logically related data is brought back together for use in an application. Joins can be expensive operations because many data accesses may be needed to bring related data into memory. This effort can be optimized in some cases by defining a single base table that contains data for two or more logical tables. The column set of the base table is the union of the column sets of these logical tables. Tagged columns make this possible because of their good handling of both multi-valued and sparse valued data. Since related data is stored together in the same record, it is accessed together thereby minimizing the number of disk accesses to perform the join. This process can be extended to a large number of logical tables as ESE can support up to 64,993 tagged columns. Since indexes can be defined over multi-valued columns, it is still possible to index ‘interior’ tables. However, some limitations exist and applications should consider pre-joining carefully before employing this technique.

Logging and crash recovery

[edit]

The logging and recovery feature of ESE supports guaranteed data integrity and consistency in the event of a system crash. Logging is the process of redundantly recording database update operations in a log file. The log file structure is very robust against system crashes. Recovery is the process of using this log to restore databases to a consistent state after a system crash.

Transaction operations are logged and the log is flushed to disk during each commit to transaction level 0. This allows the recovery process to redo updates made by transactions which commit to transaction level 0, and undo changes made by transactions which did not commit to transaction level 0. This type of recovery scheme is often referred to as a ‘roll-forward/roll-backward’ recovery scheme. Logs can be retained until the data is safely copied via a backup process described below, or logs can be reused in a circular fashion as soon as they are no longer needed for recovery from system crash. Circular logging minimizes the amount of disk space needed for the log but has implications on the ability to recreate a data state in the event of a media failure.

Backup and restore

[edit]

Logging and recovery also play a role in protecting data from media failure. ESE supports on-line backup where one or more databases are copied, along with log files in a manner that does not affect database operations. Databases can continue to be queried and updated while the backup is being made. The backup is referred to as a ‘fuzzy backup’ because the recovery process must be run as part of backup restoration to restore a consistent set of databases. Both streaming and shadow copy backup are supported.

Streaming backup is a backup method where copies of all desired database files and the necessary log files are made during the backup process. File copies may be saved directly to tape or can be made to any other storage device. No quiescing of activity of any kind is required with streamed backups. Both the database and log files are check summed to ensure that no data corruptions exist within the data set during the backup process. Streaming backups may also be incremental backups. Incremental backups are ones in which only the log files are copied and which can be restored along with a previous full backup to bring all databases to a recent state.

Shadow copy backups are a new high speed backup method. Shadow copy backups are dramatically faster because the copy is virtually made after a brief period of quiescing an application. As subsequent updates are made to the data, the virtual copy is materialized. In some cases, hardware support for shadow copy backups means that actually saving the virtual copies is unnecessary. Shadow copy backups are always full backups.

Restore can be used to apply a single backup, or it can be used to apply a combination of a single full backup with one or more incremental backups. Further, any existing log files can be replayed as well to recreate an entire data set all the way up to the last transaction logged as committed to transaction level 0. Restoration of a backup can be made to any system capable of supporting the original application. It need not be the same machine, or even the same machine configuration. Location of files can be changed as part of the restoration process.

Backup and restore to different hardware

[edit]

When an ESENT database is created, the physical disk sector size is stored with the database. The physical sector size is expected to remain consistent between sessions; otherwise, an error is reported. When a physical drive is cloned or restored from a drive image to a drive that uses a different physical sector size (Advanced Format Drives), ESENT will report errors.[4]

This is a known issue and Microsoft has hot fixes available. For Windows Vista or Windows Server 2008 see KB2470478.[5] For Windows 7 or Windows Server 2008 R2 see KB982018.[6]

History

[edit]

JET Blue was originally developed by Microsoft as a prospective upgrade for the JET Red database engine in Microsoft Access, but was never used in this role. Instead, it went on to be used by Exchange Server, Active Directory, File Replication Service (FRS), Security Configuration Editor, Certificate Services, Windows Internet Name Service (WINS) and a host of other Microsoft services, applications and Windows components.[7] For years, it was a private API used by Microsoft only, but has since become a published API that anyone can use.

Work began on Data Access Engine (DAE) in March 1989 when Allen Reiter joined Microsoft. Over the next year a team of four developers worked for Allen to largely complete the ISAM. Microsoft already had the BC7 ISAM (JET Red) but began the Data Access Engine (DAE) effort to build a more robust database engine as an entry in the then new client-server architecture realm. In the spring of 1990, BC7 ISAM and DAE teams were joined to become the Joint Engine Technology (JET) effort; responsible for producing two engines a v1 (JET Red) and a v2 (JET Blue) that would conform to the same API specification (JET API). DAE became JET Blue for the color of the flag of Israel. BC7 ISAM became JET Red for the color of the flag of Russia. While JET Blue and JET Red were written to the same API specification, they shared no ISAM code whatsoever. They did both support a common query processor, QJET, which later together with the BC7 ISAM became synonymous with JET Red.

JET Blue first shipped in 1994 as an ISAM for WINS, DHCP, and the now defunct RPL services in Windows NT 3.5. It shipped again as the storage engine for Microsoft Exchange in 1996. Additional Windows services chose JET Blue as their storage technology and by 2000 every version of Windows began to ship with JET Blue. JET Blue was used by Active Directory and became part of a special set of Windows code called the Trusted Computing Base (TCB). The number of Microsoft applications using JET Blue continues to grow and the JET Blue API was published in 2005 to facilitate usage by an ever-increasing number of applications and services both within and beyond Windows.

A Microsoft Exchange Web Blog entry[8] stated that developers who have contributed to JET Blue include Cheen Liao, Stephen Hecht, Matthew Bellew, Ian Jose, Edward "Eddie" Gilbert, Kenneth Kin Lum, Balasubramanian Sriram, Jonathan Liem, Andrew Goodsell, Laurion Burchall, Andrei Marinescu, Adam Foxman, Ivan Trindev, Spencer Low and Brett Shirley.

In January 2021 Microsoft open sourced ESE.[9] It was posted to GitHub with the permissive MIT License.

Comparison to JET Red

[edit]

While they share a common lineage, there are vast differences between JET Red and ESE.

  • JET Red is a file sharing technology, while ESE is designed to be embedded in a server application, and does not share files.
  • JET Red makes best effort file recovery, while ESE has write ahead logging and snapshot isolation for guaranteed crash recovery.
  • JET Red before version 4.0 supports only page-level locking, while ESE and JET Red version 4.0 supports record-level locking.
  • JET Red supports a wide variety of query interfaces, including ODBC and OLE DB. ESE does not ship with a query engine but instead relies on applications to write their own queries as C ISAM code.
  • JET Red has a maximum database file size of 2 GiB, while ESE has a maximum database file size of 8 TiB with 4 KiB pages, and 16 TiB with 8 KiB pages.

References

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
The Extensible Storage Engine (ESE), also known as JET Blue, is an advanced indexed sequential access method (ISAM) database engine developed by for storing and retrieving structured data in tables using flat binary files. It operates in user mode, providing high-performance access through the esent.dll library included in Windows, and supports databases ranging from 1 MB to over 1 TB in size, with common implementations exceeding 50 GB. Introduced with as a successor to earlier JET technologies, ESE was designed to handle denormalized schemas, including wide tables and sparse or multi-valued columns, distinguishing it from the JET Red engine used in . Key features include full transaction compliance, robust crash recovery mechanisms, snapshot isolation for concurrent operations, and efficient data caching to optimize in lightweight, embedded scenarios. Its architecture emphasizes local, single-process access without built-in remote capabilities, though file sharing via SMB is possible but not recommended for production use. ESE powers critical components in several Microsoft products, serving as the core database engine for to manage email and calendar data, Domain Services (AD DS) and (LDAP) Directory Services (LDS) for directory storage, and for indexing files in the Windows.edb database. In February 2021, open-sourced ESE under the , making its source code available on to facilitate broader development and contributions while maintaining its role in proprietary Windows ecosystem applications. ESE continues to evolve with Windows updates, including support for 32k page databases in as of 2025 to enhance scalability. This release highlighted ESE's maturity and reliability, having evolved over two decades to support high-concurrency workloads in enterprise environments.

Introduction and History

Overview

The Extensible Storage Engine (ESE), also known as JET Blue, is an indexed method (ISAM) database engine developed by for storing and retrieving data from tables in a logical sequence. It serves as a high-performance, transactional storage solution primarily for embedded use within applications, powering critical components such as , , and . Key characteristics of ESE include its embeddable design, which integrates via a lightweight directly into application processes without requiring a separate server, and its multi-threaded that enables concurrent access to multiple databases. The engine supports full (Atomicity, Consistency, Isolation, ) transactions, ensuring reliable data management even in failure scenarios, while handling large-scale datasets with low overhead—typically scaling to over 50 GB and up to 1 TB in demanding environments. ESE utilizes file-based storage, with primary data housed in .edb database files and transaction logs maintained in .log files to facilitate recovery and consistency. Evolved from Microsoft's original , it introduces enhanced extensibility, permitting the definition of custom column types and indexes tailored to specific application requirements.

Development History

The Extensible Storage Engine (ESE), originally known as JET Blue, was developed in the early 1990s by as a high-performance successor to the JET Red , which powered , though it ultimately found primary use in server applications rather than desktop ones. It was first shipped as a Windows component with in 1995, providing an embedded indexed sequential access method (ISAM) for data storage and retrieval. ESE saw early adoption in 4.0, released in 1996, where it served as the core database engine for email and messaging data management. This integration continued with Exchange Server 5.0 in 1997, enhancing transactional reliability and scalability for enterprise environments. By 2000, ESE became integral to Windows 2000's , storing directory objects in a robust, multi-user format that supported domain management at scale. Subsequent major versions of Exchange brought significant ESE enhancements; Exchange Server 2003 introduced improved storage optimizations for larger deployments, while Exchange Server 2010 delivered key performance gains, including better I/O efficiency and high-availability features for databases exceeding traditional limits. These updates solidified ESE's role in handling terabyte-scale data across products. A pivotal milestone occurred in February 2021, when open-sourced ESE on under the , releasing the codebase for community review and contributions after more than 25 years of proprietary development. This move enabled broader adoption via its public , allowing third-party developers to integrate ESE into custom applications for embedded, transactional storage needs. As of 2025, ESE continues to evolve, with Windows Server 2025 introducing support for 32 KB database pages in —doubling the previous 8 KB size—to accommodate larger databases, reduce I/O overhead, and improve query performance in modern infrastructures. These advancements have extended ESE's capabilities to petabyte-scale operations in cloud-based deployments, such as managed Exchange environments.

Core Architecture

Databases

In the Extensible Storage Engine (ESE), a database serves as the primary container for , consisting of a single main file typically named with a .edb extension that encapsulates multiple tables. This file acts as the fundamental unit for database operations, including mounting and unmounting, which are essential for making the database accessible to the engine, as well as defining the scope for transactions to ensure atomicity and consistency. Databases are created using API functions such as JetCreateDatabase or JetCreateDatabase2, which initialize the .edb file and attach it to an ESE instance for immediate use. Management involves attaching existing databases via JetAttachDatabase or JetAttachDatabase2, allowing up to six databases to be simultaneously attached to a single instance for concurrent access by multiple sessions within the same process. ESE supports multi-instance configurations through JetEnableMultiInstance, enabling scalability by running multiple independent database engines in parallel, which is particularly useful for high-throughput server applications. All subsequent data operations, such as creating tables or inserting records, must be performed within the context of an attached and opened database instance, obtained via JetOpenDatabase after attachment. The database structure relies on several key file components for integrity and recovery: the primary .edb file stores the persistent data pages, transaction log files with a .log extension record all changes for durability, and checkpoint files (.chk) mark recovery points to facilitate crash recovery by indicating the last consistent state. In modern versions of ESE, each database supports a maximum size of approximately 16 terabytes, providing substantial capacity for large-scale data storage while maintaining performance through efficient page management. Transactional consistency is enforced at the database level, ensuring that operations across tables remain isolated and durable.

Tables

In the Extensible Storage Engine (ESE), tables serve as the primary organizational units within a database, consisting of collections of records structured according to a predefined that specifies column definitions and constraints. Tables are created using calls such as JetCreateTable or the more comprehensive JetCreateTableColumnIndex3, which utilizes the JET_TABLECREATE3 structure to define the table's properties, including an array of column creations (rgcolumncreate) and initial indexes (rgindexcreate). During creation, the table is opened exclusively for the calling session, returning a JET_TABLEID handle for subsequent read or write access, with modes specified via grbit flags to control behaviors like allowing simultaneous updates or fixed enforcement. Tables support both fixed and variable record layouts, determined by the types of columns defined—fixed-length columns result in records of uniform size, while variable-length columns (such as those for text or binary data) allow records to vary in size to optimize storage efficiency. A key requirement for every table is the presence of exactly one primary index, which serves as a unique clustered index organizing the records in a B+ tree structure and must be declared before the first data update; if omitted during creation, ESE automatically generates a sequential primary index based on insertion order. Tables also accommodate multiple secondary indexes, created via JET_INDEXCREATE structures, which provide alternative ordering and fast lookups by pointing to records using the primary key, without inherent uniqueness enforcement unless specified. An ESE database can house one or more user-defined tables, with support for schemas containing a large number—potentially hundreds—of tables, managed through instance parameters like JET_paramCachedClosedTables to cache table schemas for efficient access. While ESE does not provide built-in foreign key constraints, inter-table relationships are maintained through application logic, often leveraging secondary indexes that reference primary keys across tables to enforce referential integrity. From a performance perspective, ESE employs table-level locking granularity to manage concurrency, ensuring that operations on a table are isolated during transactions while allowing finer-grained access within the table via indexes. Space allocation for tables occurs in fixed-size pages, with the initial number of pages configurable via the ulPages field in JET_TABLECREATE3 (values greater than 1 help reduce fragmentation), and density controlled by ulDensity (ranging from 20-100%, defaulting to 80% for balanced space utilization). The database page size is set at the instance level, typically defaulting to 8 KB but configurable up to 32 KB to accommodate larger records and improve I/O efficiency in high-throughput scenarios.

Data Model

Records and Columns

In the Extensible Storage Engine (ESE), data is organized into records within tables, where each record represents a row consisting of a tuple of column values, providing a row-based structure for storing related information. Records can have fixed-length or variable-length formats depending on the column definitions, and they are physically stored in pages organized by B+ trees keyed on the primary index to facilitate efficient access. Insertion and modification of records are performed through the ESE using cursor-based operations. To insert a new record, an application calls JetPrepareUpdate with the JET_prepInsert option to prepare the operation, optionally sets column values via JetSetColumn or JetSetColumns, and finalizes with JetUpdate, which assigns default values to unset columns and generates values for auto-increment columns if defined. For updating an existing record, JetPrepareUpdate is invoked with JET_prepReplace on the current cursor position, followed by setting the desired column values and calling JetUpdate to apply the changes. Retrieval of record data occurs via JetRetrieveColumn or JetRetrieveColumns on the positioned cursor. Columns in ESE store atomic values of specific types, forming the basic units of within . Each column supports null values, which occupy no storage space in tagged columns, and default values that are applied automatically during record insertion if not explicitly set. Record navigation supports both , following the logical order defined by the current index or insertion sequence, and indexed access using seek operations on primary or secondary indexes via cursors. For , ESE optionally includes a version column per table, which is automatically incremented during updates to detect conflicts in multi-user scenarios. Space management for records involves allocating pages within the structure, with large records or long column values extending into additional overflow pages when exceeding the standard page size. to reclaim space and optimize layout is performed offline using the JetDefragment API function, which reorganizes data without allowing concurrent access. Advanced column variants, such as multi-valued or long-text types, build on these basics but are handled separately.

Column Types and Variants

The Extensible Storage Engine (ESE) supports a range of column types defined by the JET_COLTYP , enabling storage of diverse data from simple s to large binary objects. Basic types include JET_coltypBit for values (true, false, or NULL, stored in 1 byte), JET_coltypShort for 16-bit signed integers (-32,768 to 32,767), JET_coltypLong for 32-bit signed integers (-2,147,483,648 to 2,147,483,647), and JET_coltypLongLong for 64-bit signed integers (-9,223,372,036,854,775,808 to 9,223,372,036,854,775,807). Floating-point data is handled by JET_coltypIEEESingle (4-byte single-precision) and JET_coltypIEEEDouble (8-byte double-precision), while JET_coltypDateTime stores dates and times as 8-byte floats representing fractional days since January 1, 1900. For , JET_coltypBinary accommodates up to 255 bytes of binary content, and JET_coltypText supports up to 255 ASCII characters or 127 characters, with sorting behaviors that are case-insensitive for ASCII and customizable for . Columns in ESE are categorized as fixed-length, variable-length, or tagged, each with distinct storage implications to optimize space and access . Fixed-length columns, such as those using JET_coltypBit, JET_coltypShort, JET_coltypLong, JET_coltypLongLong, JET_coltypIEEESingle, JET_coltypIEEEDouble, and JET_coltypDateTime, allocate a predictable amount of space in every record (up to 127 such columns per table), making them suitable for numeric and temporal where sizes are constant; they require only 1 bit for NULL indication and are stored first in the record layout. Variable-length columns, including JET_coltypBinary and JET_coltypText, use 2 bytes to prefix the data length (plus NULL indication), allowing dynamic sizing up to the type's limit (up to 128 such columns per table); these follow fixed columns in the record and are ideal for strings or binaries of varying sizes. The JET_bitColumnFixed flag can force certain variable types to behave as fixed, but they default to variable for flexibility. Tagged columns represent a sparse storage mechanism, where data is absent from a record unless explicitly set (up to 64,993 per table), reducing overhead for optional or infrequently used fields; they can be either fixed or variable in nature but are always stored last in the record layout, with presence indicated by flags in a compact . This format supports conditional inclusion based on record flags, making tagged columns efficient for wide tables with many nullable attributes. Multi-valued columns must be tagged and enable multiple values per record, influencing secondary indexing strategies as detailed elsewhere. For handling large datasets, ESE employs long-value columns via JET_coltypLongBinary () and JET_coltypLongText (text ), each supporting up to 2 GB - 1 byte (or 1,073,741,823 Unicode characters for text). These exceed the typical 8 KB database page size and are stored separately in dedicated B+ trees when larger than 1,024 bytes or when inclusion would overflow the host record's page; otherwise, they may be embedded inline. Storage uses a long value ID (LID) reference (9 bytes in the record) linking to the external tree, with access via byte offsets for streaming operations like appends or partial overwrites; flags such as JET_bitSetSeparateLV force separation, while JET_bitSetIntrinsicLV embeds smaller values. This linked-page approach ensures efficient management of oversized data without fragmenting primary records. ESE provides specialized column variants for advanced functionality, including version columns (marked with JET_bitColumnVersion on JET_coltypLong types) that automatically increment on record modifications to support multi-version concurrency control (MVCC) by tracking change history for conflict detection and record refresh. Auto-increment columns (via JET_bitColumnAutoincrement on JET_coltypLong or JET_coltypLongLong) generate unique, sequential identifiers upon insertion, ensuring non-duplicate values across sessions though not necessarily contiguous or gap-free, with reuse possible after deletions. Escrow columns (using JET_bitColumnEscrowUpdate on JET_coltypLong with a default value) facilitate atomic delta updates for counters or accumulators, avoiding write conflicts in multi-session environments through the JetEscrowUpdate operation; additional flags like JET_bitColumnFinalize (to lock after final update) or JET_bitColumnDeleteOnZero (to nullify on zero value) enhance control for such scenarios. These variants are defined during column creation and integrate seamlessly with ESE's transactional model.

Indexing System

Clustered and Primary Indexes

In the Extensible Storage Engine (ESE), every table requires exactly one primary index, which serves as the foundational structure for organizing and accessing . This primary index is mandatory; if not specified, the will transparently create one. It is created using the JetCreateIndex function, where the JET_bitIndexPrimary flag is set in the grbit parameter to designate it as the primary index, typically based on a composed of one or more columns. The primary index defines the logical order of all records in the table, establishing a persistent sorting that governs how is inserted, retrieved, and maintained. The primary index is inherently clustered, meaning that records are physically stored on disk in a B+ tree structure that mirrors the order specified by the index key, enabling logarithmic , O(log n), for key-based lookups and range scans. This clustered organization co-locates related records, such as those in sequential key ranges, thereby minimizing disk I/O operations during queries that access contiguous data blocks. To specify the key, developers provide a double null-terminated string in the szKey of JetCreateIndex, listing up to 16 columns in precedence order (with JET_cckeyMost set to 16 on and later), prefixed by '+' for ascending sort order or '-' for descending; sorting is case-sensitive by default. The total is limited to a fixed maximum, typically 255 bytes under JET_cbKeyMost for normalized data, though larger limits (up to 2000 bytes) are supported on higher page sizes in modern Windows versions to balance efficiency and functionality. Uniqueness is enforced on the primary index to prevent duplicate key values, with insertion attempts violating this rule triggering a JET_errKeyDuplicate error. This requirement ensures that secondary indexes, which reference records via primary key values, remain reliable and efficient. ESE also supports conditional uniqueness through conditional indexes, where uniqueness constraints apply only to records meeting specified criteria, such as non-null values in certain columns, allowing flexible rules while maintaining the primary index's role in overall table organization.

Secondary and Specialized Indexes

Secondary indexes in the Extensible Storage Engine (ESE) provide non-clustered access paths to table data using alternate keys distinct from the primary index. Each secondary index is implemented as a separate structure that stores only logical record identifiers pointing to the actual data organized by the primary index, enabling efficient lookups without duplicating the full record content. These indexes can be defined on any set of columns and do not inherently enforce uniqueness unless explicitly configured during creation via the JET_INDEXCREATE structure. ESE supports sparse secondary indexes, which omit index entries for records where all key columns contain null or default values, thereby reducing storage overhead for tables with many optional fields. This sparse behavior is particularly beneficial for wide tables with numerous nullable columns, as it minimizes index bloat while maintaining query on populated . Sparse indexes align with ESE's support for denormalized schemas, allowing applications to index only relevant non-empty values without performance penalties from empty entries. For multi-valued columns, ESE enables specialized indexing to handle arrays or tagged multi-values within a single record. Tagged columns, which support multiple values per record, are indexed by default on the first value only; however, columns flagged with the JET_bitColumnMultiValued option generate separate index entries for each individual value, allowing comprehensive searches across all elements in the or list. Multi-valued sparse columns, a of tagged columns, further optimize storage by consuming no for null or unused values and support extensive indexing over all populated elements when the multi-valued flag is set. To index combinations across multiple multi-valued columns, applications can enable cross-product indexing, which expands the index to include entries for every of values from the involved columns, though primary indexes prohibit multi-valued keys to preserve record uniqueness. Tuple indexes represent a specialized secondary index type in ESE, tailored for text or binary columns containing long strings, such as paths or identifiers. Unlike standard indexes, tuple indexes decompose the column value into overlapping substrings (tuples) of configurable minimum and maximum lengths—typically ranging from 2 to 255 characters—and create entries for each, facilitating efficient prefix, , or partial matching queries without full scans. These indexes operate on a single column and incorporate parameters like starting offset and increment to control tuple generation, with limits on the source length (up to 32,767 characters by default) to balance index size and query utility. Tuple indexes support sorting based on the extracted substrings and filtering via range operations, making them suitable for applications requiring flexible string-based retrieval. Composite indexes, a form of secondary index using multiple columns as the key, enable sorting and filtering based on combined column values, with key definitions supporting up to the maximum allowable size determined by page boundaries (e.g., 1000 bytes for 4 KB pages). The JET_INDEXCREATE structure allows specification of column precedence in the key array, ensuring ordered access paths for multi-column queries. All secondary and specialized indexes in ESE are automatically updated during transactional inserts, updates, and deletes to maintain data consistency without application intervention. However, repeated modifications can lead to fragmentation, increasing I/O and degrading performance; in such cases, the JetDefragment function can be invoked to reorganize index pages offline or online, reclaiming space and optimizing density. These indexes contribute to query optimization by providing diverse access paths for efficient record retrieval and intersection operations.

Transaction Management

Transactions

Transactions in the Extensible Storage Engine (ESE) are scoped to individual sessions and initiated using the JetBeginTransaction function, which creates a new save point and allows multiple calls to support nested transactions up to seven levels deep. Only the outermost commit, via JetCommitTransaction, persists changes to the database, while inner transactions enable partial rollbacks to previous save points. ESE transactions adhere to ACID properties: atomicity is ensured through full rollback of all changes if a transaction fails; is achieved via , which records committed changes for recovery (detailed further in logging mechanisms); consistency is maintained by engine-enforced constraints such as unique indexes and application-defined rules; and isolation is provided through a snapshot model, where each transaction views the database state as it existed at the transaction's start, preventing visibility of uncommitted changes from other sessions. Concurrency is managed via multi-version (MVCC), which employs version columns to track data modifications and allow readers to access consistent snapshots without blocking writers. At the record level, reads access snapshots non-blocking, while writes may result in JET_errWriteConflict errors if concurrent modifications have changed the record version; escrow updates via JetEscrowUpdate enable atomic concurrent adjustments to shared values like counters. Savepoint-like behavior within transactions is achieved through nested transactions, allowing selective rollbacks of inner levels without aborting the entire transaction. Long-running transactions in ESE can lead to growth in the version store, an in-memory area that retains data versions for MVCC and rollback, potentially exhausting available buckets (limited by factors like CPU architecture, e.g., approximately 408 MB on single-core 64-bit systems) and halting updates. Best practices include committing or rolling back frequently to minimize version retention, monitoring via performance counters like "Version buckets allocated," and avoiding prolonged use of temporary tables (created via JetOpenTemporaryTable) within such transactions to prevent unnecessary bloat in the version store.

Logging and Crash Recovery

The Extensible Storage Engine (ESE) employs a (WAL) mechanism to ensure data durability, where all database modifications are recorded in files before being applied to the database pages. These log files, typically named with a base prefix followed by a generation number (e.g., EDB00001.LOG), capture operations such as inserts, updates, and deletes in a binary format, allowing the system to maintain atomicity and consistency even if a crash occurs during a transaction. ESE supports two primary logging modes: circular logging and full logging. In circular logging, enabled via the JET_paramCircularLog parameter, the engine automatically truncates and reuses log files once they are no longer needed for recovery—specifically, logs older than the current checkpoint are discarded—reducing storage overhead but limiting recovery options to the last checkpoint, which may result in some . Full logging, the default mode when circular logging is disabled, retains all log files until a full is performed, enabling with zero from the last , though it requires more disk space and periodic to manage log accumulation. Checkpointing advances the recovery point periodically by creating checkpoint files (e.g., EDB.CHK) that mark the state where all prior log operations have been durably written to the database, minimizing the volume of logs that must be replayed during recovery. The checkpoint file records the generation numbers of the oldest surviving log needed for recovery, and its advancement is influenced by parameters like JET_paramCheckpointDepthMax, which controls the maximum number of log pages buffered before forcing a checkpoint to balance performance and recovery time. Upon system startup following a crash, ESE initiates soft recovery through the JetInit function, which detects inconsistencies from a "dirty shutdown" and performs a two-phase process to restore the database to a consistent state. In the redo phase, committed transactions since the last checkpoint are replayed from the surviving log files to reapply changes that may not have been fully written to disk, ensuring all durable operations are reflected. The undo phase then rolls back any uncommitted transactions or partial changes, using log records to reverse operations and maintain properties without data loss for committed work. This ARIES-style recovery (enabled by default via JET_paramRecovery) can be resource-intensive if many log generations have accumulated, but it guarantees consistency even after abrupt failures. Log file management in ESE involves sequential generation numbering, where each full log file (default size 5 MB, configurable via JET_paramLogFileSize) is renamed and a new one created upon filling, with temporary logs pre-generated asynchronously to avoid delays under load. occurs during full backups or with circular enabled, deleting obsolete files (controlled by JET_paramDeleteOldLogs), while reserved log files (e.g., RES00001.JRS) are maintained for high-availability scenarios to facilitate clean shutdowns during temporary disk space shortages. These mechanisms ensure continuous operation without manual intervention in most cases. Performance tuning for logging focuses on parameters like JET_paramLogBuffers, which allocates (default 80-126 buffers of 4 KB each) for caching log writes to reduce I/O latency during high-throughput updates, and the choice between circular and full , where circular mode improves space efficiency for non-critical applications at the cost of recovery . Asynchronous log file creation (JET_paramLogFileCreateAsynch) further optimizes under heavy workloads, though full is recommended for environments requiring comprehensive crash recovery and integration.

Operational Features

Cursor Navigation and Copy Buffer

In the Extensible Storage Engine (ESE), cursors serve as session-bound handles that enable applications to navigate and manipulate records within tables. A cursor is created using the JetOpenTable function, which opens a cursor on a specified table within a database session identified by JET_SESID and JET_DBID; the resulting JET_TABLEID handle is tied exclusively to that session and cannot be shared across sessions. Multiple cursors can be opened on the same table to support concurrent operations, subject to resource limits managed by JET_paramMaxCursors, and they are typically closed with JetCloseTable unless automatically handled by a transaction . These handles facilitate record access without locking the entire table, promoting efficient concurrent usage. Cursor navigation supports three primary modes: sequential traversal, indexed seeks, and bookmark-based positioning. Sequential navigation is performed via the JetMove function, which repositions the cursor relative to the current index entry using parameters such as cRow (e.g., JET_MoveFirst to move to the first entry, JET_MoveLast to the last, JET_MoveNext for forward traversal, or JET_MovePrevious for backward) and grbit options to skip duplicates or respect index ranges set by JetSetIndexRange. This mode is ideal for iterating through records in order, allowing arbitrary offsets like moving forward by 1000 entries for bulk scanning. Indexed seeks, on the other hand, use JetSeek after constructing a search key with JetMakeKey; the function positions the cursor to the nearest matching entry based on grbit flags such as JET_bitSeekGE (greater than or equal) or JET_bitSeekLE (less than or equal), enabling efficient targeted access without full scans. For bookmark-based positioning, JetGetRecordPosition retrieves the fractional location (as a JET_RECPOS structure) of the current record within the index, which can then be used with JetGotoPosition to navigate directly to that fraction (e.g., 0.5 for the midpoint), supporting operations like resuming from a prior point in large datasets. The copy buffer acts as an in-memory associated with each cursor, allowing temporary modifications and bulk operations without immediately locking or altering the original in the database. Accessed via JET_bitRetrieveCopy in functions like JetRetrieveColumn, the buffer holds pending changes during insert or update preparations, enabling retrieval of modified column values (e.g., auto-increment IDs) before commitment. To stage updates, an application first calls JetPrepareUpdate to initialize the buffer, then uses JetSetColumn to modify specific columns—overwriting values, appending to multi-valued columns, or handling long binary/text data with options like JET_bitSetAppendLV—without affecting the database until finalization. This deferred approach minimizes locking duration, as the buffer isolates changes for validation or bulk assembly. Update semantics in ESE leverage the copy buffer for deferred commits, ensuring atomicity and concurrency. After preparing and populating the buffer, JetUpdate finalizes the operation by writing changes to the database and updating indexes; on success, it returns JET_errSuccess and a for the new or modified record, while failures (e.g., due to space constraints) leave the buffer intact for retries without partial commits. This enables temporary modifications in the buffer during navigation-heavy workflows, such as bulk inserts, before a single commit, reducing contention in multi-user environments. For concurrent navigation, ESE employs error handling focused on write conflicts rather than traditional deadlocks, as its snapshot isolation model avoids blocking waits. During cursor movements or updates (e.g., via JetMove or JetUpdate), conflicts arise if another session modifies the same record, returning JET_errWriteConflict; applications must implement retry logic by aborting the transaction with JetRollback, introducing a delay, and reattempting the operation after the conflicting transaction completes. This first-writer-wins policy, combined with session-level single-threading during transactions, ensures prompt detection and encourages optimistic concurrency patterns for cursor-based access.

Query Processing Techniques

The Extensible Storage Engine (ESE) employs API-driven query execution, where applications build retrieval strategies using cursors for navigation, index seeks, and sequential scans rather than declarative SQL queries. This low-level approach allows precise control over data access patterns, leveraging the engine's indexed sequential access method (ISAM) architecture to optimize performance for embedded scenarios. Query processing emphasizes efficient index utilization to minimize I/O operations, with the engine supporting seeks on primary and secondary indexes to locate records without unnecessary full table scans. Sorting in ESE is handled through temporary tables, which support both in-memory and disk-based mechanisms for operations like ORDER BY in application logic or index key sorting. The JetOpenTemporaryTable function creates a volatile, single-indexed temporary table optimized for record storage and retrieval during sorting tasks. For small or simple datasets, an in-memory implementation provides the fastest performance by keeping data in RAM. Larger datasets utilize disk-based sorting with forward-only iterators, enabling efficient duplicate removal and reduced I/O through streaming algorithms. Alternatively, a B+ tree-based materialized approach offers greater flexibility for subsequent operations but incurs higher overhead compared to pure sorting methods. These temporary tables are stored in a dedicated temporary database, whose path can be tuned via JetSetSystemParameter with the JET_paramTempPath parameter to leverage high-performance storage devices. Temporary tables also facilitate intermediate result storage in multi-step query , allowing applications to materialize subsets of for further filtering, aggregation, or without impacting persistent tables. This is particularly useful for complex retrievals where direct cursor operations on main tables would be inefficient, as the volatile nature of temporary tables ensures they do not persist beyond the session and support rapid creation and population via calls like JetPrepareUpdate and JetUpdate. By staging in temporary structures, applications can avoid repeated scans of large base tables, improving overall query throughput in memory-constrained environments. ESE supports covering indexes through its primary index structure, where all table columns are stored directly in the B-tree leaves alongside the , enabling index-only scans that retrieve selected data without additional lookups to the table body. This inherently covers queries that filter and only primary-key-related or co-located columns, reducing latency by eliminating secondary fetches. For secondary indexes, however, the structure typically includes only the indexed key values and record identifiers, necessitating a subsequent seek on the primary index to retrieve non-key columns, which introduces an extra I/O step unless the query is limited to index keys alone. Applications can mitigate this by designing secondary indexes with conditional columns that align closely with query needs, though full coverage requires primary index alignment. Index intersection in ESE is achieved by applications coordinating multiple cursors on different secondary indexes to combine results for selective filters, such as AND conditions across non-overlapping attributes. By performing parallel seeks on each index and intersecting the record identifier sets—often using temporary tables for merging—the approach avoids full scans on large tables while leveraging the selectivity of individual indexes. This technique is particularly effective when no single composite index covers the filter, as it allows dynamic combination of index ranges without . The engine's efficient seek operations, bounded by limits (up to 255 bytes standard, extendable to 2000 bytes), ensure low-cost for moderately selective queries. ESE lacks native SQL join support, relying instead on application-implemented virtual joins via coordinated cursor navigation across related tables or pre-joined denormalized structures to simulate relational operations. For instance, applications can seek on a index in one table to match records in another, using temporary tables to buffer join results and avoid repeated cross-table seeks. This cursor-based enables efficient handling of one-to-many or many-to-one relationships, though it requires careful buffer management to prevent excessive use during large joins. , where related data is stored in a single table with multi-valued columns, further optimizes by eliminating runtime joins altogether, aligning with ESE's strength in handling hierarchical or tagged data. Optimization heuristics in ESE guide index usage and choices to favor low-cost paths, such as preferring seeks over scans on large tables when a matching index exists. During JetSeek operations, the applies heuristics like the JET_bitCheckUniqueness flag (introduced in ) to verify single-match conditions cheaply, returning JET_wrnUniqueKey to short-circuit further retrieval if applicable. Search key construction via JetMakeKey influences cost by enabling precise inequality bounds (e.g., JET_bitSeekGE for greater-than-or-equal), allowing the to prune irrelevant index ranges and avoid exhaustive traversals. While internal cost models are not exposed, these heuristics ensure adaptive selection of index paths based on key specificity and table size, promoting scalability for high-volume access patterns. Queries are ultimately implemented via cursor primitives for and buffering, as described in the Cursor Navigation and Copy Buffer section.

Backup and Recovery

Backup and Restore

The Extensible Storage Engine (ESE) supports both methods for backing up databases, enabling data protection without necessarily interrupting application access. Online backups, facilitated by the JetBackup API, allow creation of consistent copies while the database remains active and transactions continue, primarily through streaming mechanisms that copy database files (.edb) in 64KB chunks with verification to ensure integrity. Full online backups capture the primary database file along with active files from the current checkpoint, providing a point-in-time snapshot that can be restored to a specific recovery point. Incremental backups are achieved via log shipping, where only the transaction logs generated since the last full backup are copied, minimizing storage needs while maintaining recoverability. For offline scenarios, ESE employs defragmentation via the JetDefragment , which optimizes database organization by reclaiming internal space during periods of database detachment, though it operates in place without creating a separate copy. This method is useful for maintenance s on large databases, where streaming APIs like JetBeginExternalBackup initiate the process by flushing dirty pages and halting checkpoints to ensure consistency before file copying. API integration, such as JetGetAttachInfo, coordinates s by querying attached database names and states, allowing backup applications to identify and handle all relevant files dynamically. Additionally, ESE integrates with Volume Shadow Copy Service (VSS) for application-consistent s, where the Exchange VSS (or equivalent) freezes I/O operations briefly to create shadow copies without full API involvement. The restore process begins with mounting the backed-up database files using JetExternalRestore, which specifies paths for checkpoint and log files to initiate recovery. This API orchestrates log replay from the backup's log range (defined by generation numbers genLow to genHigh), rolling forward committed transactions and optionally undoing uncommitted ones to reach a consistent state at the desired recovery point. Post-replay, the database can be reattached via JetAttachDatabase, ensuring all changes from the backup period are applied. For large-scale restores, streaming techniques mirror those in backups to handle volume efficiently. Best practices recommend scheduling full backups based on transaction log retention policies—typically daily for high-activity environments—and combining them with frequent log backups to balance recovery time objectives with storage costs, always verifying backups in isolated environments to confirm restorability.

Cross-Hardware Backup and Restore

Migrating Extensible Storage Engine (ESE) databases across different hardware platforms presents several portability challenges, primarily due to the nature of the .edb . ESE files employ little-endian byte ordering for structures, including UTF-16 encoded strings without a , which is compatible with x86 and x64 Windows architectures but incompatible with big-endian systems outside the Windows ecosystem. Page size variations, such as the traditional 8 KB pages versus the 32 KB pages introduced in 2025 for Domain Services (AD DS), can lead to compatibility issues during direct file transfers, as older formats may require modes or upgrades to mount successfully on newer hardware. Version mismatches between ESE implementations further complicate migrations, as databases from prior versions (e.g., Exchange Server 2016) cannot be directly attached to later ones without risking or service failures. To address these challenges, migration typically involves exporting data or copying database files followed by recovery operations. For hardware transitions within the same Windows and ESE version, administrators copy the .edb file along with associated files (.log) and checkpoint files to the target system, then perform soft recovery using the eseutil to replay logs and ensure transactional consistency (e.g., eseutil /r <log generation>). Database patching for format upgrades occurs during mounting; for instance, attaching an 8 KB page .edb on 2025 upgrades it to a compatible version via the ESE engine (esent.dll), though this can inadvertently alter the format and prevent loading on legacy systems unless performed in a controlled environment. Third-party tools, such as libesedb libraries, enable low-level access for custom export scripts, but official migrations rely on utilities like eseutil for and repair (eseutil /d or /r). ESE remains inherently limited to Windows environments, with no native support for non-Windows platforms due to its tight integration with the Windows kernel and little-endian dependencies. However, the ESE API (esent.dll) facilitates data export to neutral formats like CSV or XML through application-level reads, allowing indirect portability for analysis or integration in cross-platform scenarios. In virtualized setups, such as or , ESE databases can be migrated between hosts on diverse underlying hardware by treating virtual machines as portable units, provided the guest OS version matches and virtualization supports consistent I/O passthrough for log files. Windows Server 2025 introduces enhanced support for page size migrations in ESE-based databases like AD DS, enabling 8 KB legacy formats to operate in an 8 KB simulation mode on 32 KB-capable systems during in-place upgrades or file copies, which expands scalability for multi-valued attributes without immediate data reformatting. This mode maintains but requires all domain controllers in to support 32 KB pages for full , with 8 KB backups becoming obsolete post-migration. Post-migration verification is essential to confirm and consistency. Administrators use eseutil for validation (eseutil /k) to detect corruption in the .edb file and replay any remaining logs (eseutil /r) to apply pending transactions, ensuring the database state matches the backup point. These steps, performed after standard backups, help mitigate risks from hardware differences, with successful mounting and client access (e.g., via Outlook) serving as final confirmation.

Comparisons and Modern Usage

Comparison to JET Red

The Extensible Storage Engine (ESE), previously known as JET Blue, marks a substantial from JET Red in architectural design. JET Red operates as a single-threaded engine optimized for file-sharing scenarios in desktop applications such as , limiting its suitability for high-concurrency environments. In contrast, ESE adopts a multi-threaded, scalable tailored for server-grade operations, enabling robust handling of concurrent transactions through features like record-level locking and . ESE introduces advanced features absent in JET Red, including Multi-Version Concurrency Control (MVCC) for snapshot isolation, support for long columns up to approximately 2 GB in size, sparse indexes for efficient storage of null-heavy data, and tagged multi-valued columns to accommodate complex data structures. JET Red, by comparison, lacks these capabilities, relying instead on simpler column types without native multi-valued or sparse support, which restricts its flexibility for diverse data models. In terms of , ESE excels at managing large-scale reaching over 1 TB in size with superior concurrency for multi-user access, making it ideal for enterprise applications like . JET Red, however, is constrained to smaller with a maximum size of 2 GB, aligning it better with lightweight, single-user or limited multi-user Access deployments where scalability demands are modest. The API for ESE builds on the JET interface but extends it with additional calls, such as JetGetColumnInfo for detailed introspection, while maintaining conceptual compatibility for core operations; however, the two engines are not interoperable due to their distinct implementations. Following its introduction in , ESE emerged as the extensible successor for server workloads, effectively supplanting JET Red's role in high-performance scenarios by the early , though JET Red continued in desktop contexts.

Current Implementations and Extensions

The Extensible Storage Engine (ESE) serves as the core database engine for several key products, enabling efficient, scalable data management in enterprise environments. In Domain Services (AD DS), ESE powers the NTDS.dit database file, which stores directory objects, attributes, and security principals essential for identity management across Windows networks. Similarly, ESE underlies the mailbox databases in , providing transactional integrity and high-availability storage for email, calendars, and attachments in on-premises and hybrid deployments. For , ESE manages the indexing database (typically Windows.edb), facilitating fast across files, emails, and applications by organizing metadata and content into indexed tables. In 2025, ESE integrations have advanced to support larger-scale deployments, particularly for Domain Services (AD DS) and Active Directory Lightweight Directory Services (AD LDS). New installations of AD DS and AD LDS now default to a 32 KB page size for ESE databases, a significant upgrade from the 8 KB pages used since , which enhances object scalability by supporting multi-valued attributes with up to approximately 3,200 values each through 64-bit Long Value IDs (LIDs). This feature requires all domain controllers to run 2025 and an elevated forest functional level, enabling better handling of expansive datasets in hybrid cloud scenarios where on-premises synchronizes with Azure . The enhanced ESE format thus improves performance and capacity for hybrid environments, reducing overhead in cross-cloud identity operations without compromising for existing 8 KB databases during upgrades. Since its open-sourcing in 2021, ESE has benefited from community-driven extensions via the official repository, with ongoing contributions including unit tests and integration with Azure DevOps pipelines for continuous validation. Post-2021 updates have focused on maintaining compatibility and adding reusable sub-components like synchronization libraries and cache managers, though specific performance patches remain integrated into 's internal development cycle rather than public releases. The JET API, ESE's primary programmatic interface, offers significant extensibility for developers to customize database structures beyond standard implementations. It supports the creation of custom column handlers for specialized data types and conditional indexing via functions like JetCreateTableColumnIndex, enabling tailored schemas for unique application needs such as variable-length fields or computed values. This flexibility has extended ESE's adoption to non-Microsoft applications, notably in and recovery tools; for instance, Backup for leverages ESE databases (also known as JET Blue) for storing protected data in repositories, ensuring efficient, transactionally consistent backups of cloud workloads. Looking ahead as of 2025, ESE's evolution emphasizes deeper ties to Microsoft's cloud ecosystem, with 2025's hybrid enhancements positioning it for seamless Azure integration in multi-cloud identity management.

References

Add your contribution
Related Hubs
Contribute something
User Avatar
No comments yet.