Recent from talks
Nothing was collected or created yet.
Apache Lucene
View on WikipediaThis article needs additional citations for verification. (February 2012) |
| Lucene | |
|---|---|
| Developer | Apache Software Foundation |
| Initial release | 1999 |
| Stable release | 10.3.1
/ October 6, 2025[1] |
| Written in | Java |
| Operating system | Cross-platform |
| Type | Search and index |
| License | Apache License 2.0 |
| Website | lucene |
| Repository | |
Apache Lucene is a free and open-source search engine software library, originally written in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License. Lucene is widely used as a standard foundation for production search applications.[2][3][4]
Lucene has been ported to other programming languages including Object Pascal, Perl, C#, C++, Python, Ruby and PHP.[5]
History
[edit]Doug Cutting originally wrote Lucene in 1999.[6] Lucene was his fifth search engine. He had previously written two while at Xerox PARC, one at Apple, and a fourth at Excite.[7] It was initially available for download from its home at the SourceForge web site. It joined the Apache Software Foundation's Jakarta family of open-source Java products in September 2001 and became its own top-level Apache project in February 2005. The name Lucene is Doug Cutting's wife's middle name and her maternal grandmother's first name.[8]
Lucene formerly included a number of sub-projects, such as Lucene.NET, Mahout, Tika and Nutch. These three are now independent top-level projects.
In March 2010, the Apache Solr search server joined as a Lucene sub-project, merging the developer communities.
Version 4.0 was released on October 12, 2012.[9]
In March 2021, Lucene changed its logo, and Apache Solr became a top level Apache project again, independent from Lucene.
Features and common use
[edit]While suitable for any application that requires full text indexing and searching capability, Lucene is recognized for its utility in the implementation of Internet search engines and local, single-site searching.[10][11]
Lucene includes a feature to perform a fuzzy search based on edit distance.[12]
Lucene has also been used to implement recommendation systems.[13] For example, Lucene's 'MoreLikeThis' Class can generate recommendations for similar documents. In a comparison of the term vector-based similarity approach of 'MoreLikeThis' with citation-based document similarity measures, such as co-citation and co-citation proximity analysis, Lucene's approach excelled at recommending documents with very similar structural characteristics and more narrow relatedness.[14] In contrast, citation-based document similarity measures tended to be more suitable for recommending more broadly related documents,[14] meaning citation-based approaches may be more suitable for generating serendipitous recommendations, as long as documents to be recommended contain in-text citations.
Lucene-based projects
[edit]Lucene itself is just an indexing and search library and does not contain crawling and HTML parsing functionality. However, several projects extend Lucene's capability:
- Apache Nutch – provides web crawling and HTML parsing[citation needed]
- Apache Solr – an enterprise search server
- CrateDB – open source, distributed SQL database built on Lucene[15]
- DocFetcher – a multiplatform desktop search application[citation needed]
- Elasticsearch – an enterprise search server released in 2010[16][17]
- Kinosearch – a search engine written in Perl and C[18] and a loose port of Lucene.[19] The Socialtext wiki software uses this search engine,[18] and so does the MojoMojo wiki.[20] It is also used by the Human Metabolome Database (HMDB)[21] and the Toxin and Toxin-Target Database (T3DB).[22]
- MongoDB Atlas Search – a cloud-native enterprise search application based on MongoDB and Apache Lucene
- OpenSearch – an open source enterprise search server based on a fork of Elasticsearch 7
- Swiftype – an enterprise search startup based on Lucene
See also
[edit]References
[edit]- ^ "Welcome to Apache Lucene". Lucene™ News section. Archived from the original on 12 February 2021. Retrieved 12 February 2020.
- ^ Kamphuis, Chris; de Vries, Arjen P.; Boytsov, Leonid; Lin, Jimmy (2020), "Which BM25 do You Mean? A Large-Scale Reproducibility Study of Scoring Variants", in Jose, Joemon M.; Yilmaz, Emine; Magalhães, João; Castells, Pablo (eds.), Advances in Information Retrieval, Lecture Notes in Computer Science, vol. 12036, Cham: Springer International Publishing, pp. 28–34, doi:10.1007/978-3-030-45442-5_4, ISBN 978-3-030-45441-8, PMC 7148026
- ^ Grand, Adrien; Muir, Robert; Ferenczi, Jim; Lin, Jimmy (2020), "From MAXSCORE to Block-Max Wand: The Story of How Lucene Significantly Improved Query Evaluation Performance", in Jose, Joemon M.; Yilmaz, Emine; Magalhães, João; Castells, Pablo (eds.), Advances in Information Retrieval, Lecture Notes in Computer Science, vol. 12036, Cham: Springer International Publishing, pp. 20–27, doi:10.1007/978-3-030-45442-5_3, ISBN 978-3-030-45441-8, PMC 7148045
- ^ Azzopardi, Leif; Moshfeghi, Yashar; Halvey, Martin; Alkhawaldeh, Rami S.; Balog, Krisztian; Di Buccio, Emanuele; Ceccarelli, Diego; Fernández-Luna, Juan M.; Hull, Charlie; Mannix, Jake; Palchowdhury, Sauparna (2017-02-14). "Lucene4IR: Developing Information Retrieval Evaluation Resources using Lucene". ACM SIGIR Forum. 50 (2): 58–75. doi:10.1145/3053408.3053421. ISSN 0163-5840. S2CID 212416159.
- ^ "LuceneImplementations". apache.org. Retrieved 2025-03-25.
- ^ KeywordAnalyzer "Better Search with Apache Lucene and Solr" (PDF). 19 November 2007. Archived from the original (PDF) on 31 January 2012.
- ^ Cutting, Doug (2019-06-07). "I wrote a couple of search engines at Xerox PARC, then V-Twin at Apple, then re-wrote Excite's search, then Lucene. So, Lucene might be considered V-Twin 3.0? Almost 25 years later, V-Twin still lives on as Mac OS X Search Kit!". @cutting. Retrieved 2019-06-19.
- ^ Barker, Deane (2016). Web Content Management. O'Reilly. p. 233. ISBN 978-1491908105.
- ^ "Apache Lucene - Welcome to Apache Lucene". apache.org. Archived from the original on 4 February 2016. Retrieved 4 February 2016.
- ^ McCandless, Michael; Hatcher, Erik; Gospodnetić, Otis (2010). Lucene in Action, Second Edition. Manning. p. 8. ISBN 978-1933988177.
- ^ "GNU/Linux Semantic Storage System" (PDF). glscube.org. Archived from the original (PDF) on 2010-06-01.
- ^ "Apache Lucene - Query Parser Syntax". lucene.apache.org. Archived from the original on 2017-05-02.
- ^ J. Beel, S. Langer, and B. Gipp, “The Architecture and Datasets of Docear’s Research Paper Recommender System,” in Proceedings of the 3rd International Workshop on Mining Scientific Publications (WOSP 2014) at the ACM/IEEE Joint Conference on Digital Libraries (JCDL 2014), London, UK, 2014
- ^ a b M. Schwarzer, M. Schubotz, N. Meuschke, C. Breitinger, V. Markl, and B. Gipp, https://www.gipp.com/wp-content/papercite-data/pdf/schwarzer2016.pdf "Evaluating Link-based Recommendations for Wikipedia" in Proceedings of the 16th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL), New York, NY, USA, 2016, pp. 191-200.
- ^ Wayner, Peter. "11 cutting-edge databases worth exploring now". InfoWorld. Archived from the original on 21 September 2015. Retrieved 21 September 2015.
- ^ "Elasticsearch: RESTful, Distributed Search & Analytics - Elastic". elastic.co. Archived from the original on 8 October 2015. Retrieved 23 September 2015.
- ^ "The Future of Compass & Elasticsearch". the dude abides. Archived from the original on 2015-10-15. Retrieved 2015-10-14.
- ^ a b Natividad, Angela. "Socialtext Updates Search, Goes Kino". CMS Wire. Archived from the original on 2012-09-29. Retrieved 2011-05-31.
- ^ Marvin Humphrey. "KinoSearch - Search engine library. - metacpan.org". p3rl.org. Retrieved 23 September 2015.
- ^ Diment, Kieren; Trout, Matt S (2009). "Catalyst Cookbook". The Definitive Guide to Catalyst. Apress. p. 280. ISBN 978-1-4302-2365-8.
- ^ Wishart, D. S.; et al. (January 2009). "HMDB: a knowledgebase for the human metabolome". Nucleic Acids Res. 37 (Database issue): D603–10. doi:10.1093/nar/gkn810. PMC 2686599. PMID 18953024.
- ^ Lim, Emilia; Pon, Allison; Djoumbou, Yannick; Knox, Craig; Shrivastava, Savita; Guo, An Chi; Neveu, Vanessa; Wishart, David S. (January 2010). "T3DB: a comprehensively annotated database of common toxins and their targets". Nucleic Acids Res. 38 (Database issue): D781–6. doi:10.1093/nar/gkp934. PMC 2808899. PMID 19897546.
Bibliography
[edit]- Gospodnetic, Otis; Erik Hatcher; Michael McCandless (28 June 2009). Lucene in Action (2nd ed.). Manning Publications. ISBN 978-1-9339-8817-7.
- Gospodnetic, Otis; Erik Hatcher (1 December 2004). Lucene in Action (1st ed.). Manning Publications. ISBN 978-1-9323-9428-3.
External links
[edit]Apache Lucene
View on GrokipediaOverview
Introduction
Apache Lucene is a free and open-source information retrieval library written in the Java programming language, providing high-performance full-text indexing and search capabilities suitable for nearly any application requiring structured or unstructured data retrieval.[3] Designed primarily to enable efficient searching over large volumes of data, Lucene focuses on core search functionality without encompassing user interfaces, deployment, or complete search engine features.[6] This library-based approach distinguishes it from standalone server applications, allowing developers to embed robust search directly into custom software stacks for scalable, customizable information retrieval.[7] Originally developed by Doug Cutting in 1997 during his spare time, Lucene emerged as a Java-based toolkit to address the need for advanced text indexing and querying in applications.[2] It began as part of the Jakarta Apache project, reflecting early efforts to build open-source tools for web and enterprise search under the Apache umbrella.[8] Cutting's development drew from prior experience in search technologies, laying the foundation for what would become a cornerstone of modern information retrieval.[9] By 2025, Apache Lucene has evolved into a mature top-level project of the Apache Software Foundation, with version 10 released in late 2024, the latest stable release being version 10.3.1 as of October 2025, and ongoing community-driven enhancements ensuring its relevance in contemporary search ecosystems.[10][11] The project now includes official bindings for other languages, such as Lucene.NET for .NET environments and PyLucene for Python integration, broadening its accessibility beyond Java-based applications.[6] These extensions, alongside its role as the underlying engine for projects like Apache Solr and Elasticsearch, underscore Lucene's enduring impact on scalable search infrastructure.[6]Core Components
Apache Lucene's index is fundamentally an inverted index, structured to enable efficient full-text search by mapping terms to the documents containing them. It consists of one or more segments, where each segment represents a self-contained subset of the entire document collection and serves as a complete searchable unit in its own right.[12] Documents within segments are assigned sequential 32-bit identifiers (docids), and each document comprises multiple fields holding diverse data types, such as text or numerics.[12] Fields contribute to various index structures, including postings lists that form the inverted index for term-to-document lookups, stored fields for retrieving original content, and term vectors for advanced similarity computations.[12] Terms, derived from field content, are organized in a term dictionary per field (stored in .tim and .tip files), which contains terms and pointers (offsets or deltas) to the start of each term's postings list in separate postings files (e.g., .doc for document IDs and frequencies, .pos for positions, .pay for payloads and offsets). Postings lists are not stored adjacent to term dictionary entries on disk, but are accessed via pointers from the dictionary, facilitating rapid access to associated postings.[12][13] Analyzers play a crucial role in processing text during indexing and querying by tokenizing raw input into manageable units and applying normalization to ensure consistent matching.[14] Tokenization breaks text into tokens, such as words separated by whitespace via classes like WhitespaceTokenizer, while subsequent token filters refine these tokens—for instance, stemming reduces variants like "running" to "run," and stop-word filters eliminate common terms like "the" or "a" to reduce index size and improve relevance.[14] An Analyzer orchestrates this pipeline, often incorporating CharFilters for pre-tokenization adjustments that preserve character offsets, ensuring the processed tokens are suitable for Lucene's inverted index.[14] The Directory abstraction manages the physical storage of index files, providing a unified interface for input/output operations across different backends.[15] Implementations include RAMDirectory, which holds the entire index in memory for high-speed access in low-volume scenarios, and FSDirectory, which persists data to the file system for durable, larger-scale storage.[15] IndexWriter, in turn, utilizes a Directory to create and maintain the index, handling document additions, deletions, and updates through methods like addDocument, deleteDocuments (by term or query), and updateDocument for atomic replacements.[16] It buffers changes in RAM—defaulting to a 16 MB limit—before flushing to segments in the Directory, with configurable modes for creating new indexes or appending to existing ones, and employs locking to ensure thread-safe operations.[16] For querying, the IndexSearcher class enables searches over an opened index via an IndexReader, executing queries to return ranked results through methods like search(Query, int).[17] It coordinates the scoring mechanism using Weight and Scorer components to compute relevance scores based on query-document matches.[17] Results are encapsulated in TopDocs collections, where each hit is a ScoreDoc object containing the document's ID and its computed relevance score, allowing applications to retrieve and sort documents by importance.[17] Lucene documents are composed of fields, which can be configured as stored, indexed, or both, to balance searchability with retrieval needs.[18] Stored fields preserve the original content for direct access in search results without re-indexing, such as a document's title or metadata, while indexed fields analyze and add content to the inverted index for querying but may omit storage to save space.[18] Examples include TextField for full-text indexing of string content like article bodies, numeric fields such as IntField or DoubleField for precise range and sorting on values like prices or dates, and binary-capable fields like BinaryField for opaque data such as images or serialized objects, though specialized types like InetAddressPoint handle structured binaries like IP addresses.[18]History
Origins and Early Development
Apache Lucene originated as a personal project by software engineer Doug Cutting, who began developing it in 1997 while seeking to create a marketable tool amid job uncertainty, leveraging the emerging popularity of Java for full-text search capabilities. Cutting's motivation stemmed from the need for an efficient search engine to index and query content on his own website, addressing limitations in existing tools for handling unstructured text data.[19] The initial version was released on SourceForge in April 2000, establishing Lucene as an open-source Java library focused on high-performance indexing and retrieval for single-machine environments.[20] In September 2001, Lucene joined the Apache Software Foundation as a subproject under the Jakarta initiative, marking its transition into a collaborative open-source effort and aligning it with other Java-based Apache projects.[2] The first official Apache release, version 1.2 RC1, arrived in October 2001, with packages renamed to the org.apache.lucene namespace and the license updated to the Apache License.[21] By 2004, version 1.4 provided enhanced stability, including improvements in query parsing, indexing efficiency, and support for analyzers, solidifying its role as a robust text search foundation.[22] Early development emphasized single-node performance optimizations, such as efficient inverted indexing and relevance scoring, but lacked native distributed processing, limiting scalability for large-scale applications like web crawling.[23] A pivotal early influence was Lucene's integration into the Nutch project, an open-source web crawler and search engine co-founded by Cutting and Mike Cafarella in 2002.[24] Nutch adopted Lucene for its indexing backend starting around 2003, enabling the system to handle full-text search over crawled web content; by June 2003, this combination powered a demonstration indexing over 100 million pages, showcasing Lucene's potential despite its single-node constraints.[25] These efforts highlighted Lucene's strengths in modular design and extensibility, while underscoring challenges in distributed fault tolerance that would later inspire related projects. In February 2005, Lucene graduated from the Jakarta subproject to become a top-level Apache project, granting it greater autonomy and community governance.[2] This transition coincided with increasing external contributions, including from Yahoo!, which hired Cutting in early 2006 and began investing in Lucene enhancements to support their search infrastructure needs.[26] The move stabilized Lucene as a cornerstone of open-source search technology, fostering broader adoption in enterprise and research applications.Major Releases and Evolution
Apache Lucene 4.0, released on October 12, 2012, represented a major rewrite emphasizing improved modularity and extensibility. This version introduced the codec API, enabling pluggable storage formats that allowed developers to customize index structures for specific use cases, such as optimizing for compression or speed. The redesign also streamlined the indexing pipeline, reducing complexity while enhancing overall performance and maintainability. Lucene 5.0, released on February 20, 2015, advanced near-real-time search capabilities by integrating more efficient segment merging and reader management.[27] A key change was the removal of the deprecated FieldCache, replaced by more robust doc values for faceting and sorting, which improved memory usage and query speed.[27] These updates laid the groundwork for handling larger-scale, dynamic indexes with minimal latency. In March 2019, Lucene 8.0 established Java 8 as the minimum baseline, enabling leverage of modern language features for better concurrency and garbage collection.[28] This release included optimizations in postings formats and query parsers that contributed to higher throughput in multi-threaded environments. Lucene 9.0, released on December 7, 2021, prioritized stability and long-term maintenance with extensive deprecation cleanups and API stabilizations.[29] It incorporated Unicode 13.0 support for improved internationalization in tokenization and analysis modules, along with the introduction of the VectorValues API for dense vector indexing and similarity computations essential for machine learning applications.[29] The version also refined index formats for backward compatibility, ensuring seamless upgrades while addressing accumulated technical debt from prior iterations.[30] Lucene 10.0, released on October 14, 2024, focused on hardware efficiency with requirements for JDK 21 and new APIs like IndexInput#prefetch for optimized I/O parallelism.[3] It introduced sparse indexing for doc values, reducing CPU and storage overhead in scenarios with irregular data distributions. These changes enhanced search parallelism, yielding up to 40% speedups in benchmarked top-k queries compared to previous versions. Building on this, Lucene 10.2, released on April 10, 2025, further boosted query performance with up to 5x faster execution in certain workloads and 3.5x improvements in pre-filtered vector searches. Enhancements included better integration of seeded KNN queries and reciprocal rank fusion for result reranking, alongside refined I/O handling to minimize latency in distributed setups. Subsequent releases, including Lucene 10.3.0 on September 13, 2025, introduced vectorized lexical search with up to 40% speedups and new multi-vector reranking capabilities; 10.3.1 on October 6, 2025, and 10.3.2 on November 17, 2025, provided bug fixes and further optimizations. By late 2025, Lucene's evolution continued toward deeper AI and machine learning integrations, exemplified by expanded dense vector search capabilities for semantic similarity tasks, reflecting its adaptation to modern data processing demands.[3]Technical Architecture
Indexing Process
The indexing process in Apache Lucene begins with document preparation, where raw data is structured intoDocument objects, each representing a searchable unit such as a web page or record. Developers create a Document instance and add IndexableField objects to it, specifying field names, values (e.g., strings, binaries, or numeric types), and attributes like whether the field should be stored for retrieval, indexed for searching, or both.[31] Indexing options include enabling norms, which are byte-sized normalization factors computed per field to account for document length and other factors during scoring, typically to penalize longer fields in term frequency calculations; norms can be omitted for fields where length normalization is unnecessary to save space.[32] Additionally, term vector generation can be specified via the storeTermVectors() option on fields, storing positional and offset information for tokens to support advanced features like highlighting, though this requires the field to be indexed.[32]
Once prepared, documents pass through the analysis pipeline before being indexed, transforming text into tokens suitable for the inverted index. An Analyzer processes each field's text by first tokenizing it into a stream of terms using a Tokenizer (e.g., breaking on whitespace or punctuation), followed by a chain of filters that modify the stream—such as lowercasing, removing stop words, stemming, or applying custom transformations.[33] The resulting tokens, along with their positions and payloads if needed, form the indexed terms; during this phase, term vectors are generated if enabled, capturing the token list per field for later use.[32] This pipeline ensures language- and domain-specific preprocessing, with the IndexWriter's analyzer handling the conversion atomically per document addition.[16]
Segments are created and managed by the IndexWriter class, which buffers incoming documents in memory until a threshold is reached—either a configurable RAM limit (default 16 MB) or a maximum number of buffered documents—then flushes them to disk as immutable segment files in a Directory.[16] Each new segment contains an inverted index of terms to document postings, stored fields, norms, and other structures, written in a codec-specific format for efficiency. To maintain performance, Lucene employs a log-merge strategy via the default LogByteSizeMergePolicy, which groups small segments into exponentially larger ones (e.g., merging levels where each level's total size is roughly double the previous), reducing the number of segments over time while minimizing write amplification. Merges run concurrently in background threads managed by a MergeScheduler, ensuring indexing throughput without blocking additions.[16]
Updates and deletes are handled efficiently without rewriting entire segments, using soft mechanisms to mark changes. Deletes—whether by term, query, or document ID—are buffered and applied as bitsets (LiveDocs) per segment, logically excluding documents from searches without physical removal until a merge occurs; this "soft delete" approach avoids immediate I/O costs.[16] Updates, such as modifying a field, are atomic: the IndexWriter first deletes the old version by a unique term (e.g., an ID field), then adds the revised document, ensuring consistency even across failures.[16] For partial updates, soft deletes can leverage doc values fields to filter documents virtually during reads.[16]
Commits and refreshes enable control over index durability and visibility, supporting near-real-time (NRT) search. A full commit, invoked via commit(), flushes all buffered changes, writes new segments, applies deletes, and syncs files to storage for crash recovery, creating a new index generation.[16] However, for low-latency applications, NRT searchers—opened via DirectoryReader.open(IndexWriter)—periodically refresh to include unflushed segments and buffered deletes without a full commit, typically every second or on demand, balancing freshness with overhead.[16] This decouples indexing from search, allowing continuous ingestion while queries see recent additions promptly.[16]
Query Processing and Search
Apache Lucene processes search queries by first parsing user input into structured Query objects, enabling efficient retrieval from the inverted index. The QueryParser, implemented using JavaCC, interprets query strings into clauses, supporting operators such as plus (+) for required terms, minus (-) for prohibited terms, and parentheses for grouping.[34] Analyzers play a crucial role by tokenizing and normalizing query terms, ensuring consistency with the indexed data through processes like stemming and stopword removal. This results in Query subclasses tailored to specific needs; for instance, BooleanQuery combines multiple subqueries with logical operators like MUST (AND), SHOULD (OR), and MUST_NOT (NOT) to express complex conditions. Similarly, PhraseQuery matches sequences of terms within a specified proximity, allowing slop factors for approximate phrase matching. Since version 10.0.0 (as of 2024), Lucene supports dense vector indexing and approximate nearest-neighbor (ANN) search for semantic and similarity-based retrieval. During indexing, developers addKnnVectorField or KnnByteVectorField to documents, specifying vector dimensions (up to 4096) and an index strategy like Hierarchical Navigable Small World (HNSW) graphs for efficient querying. These vectors are stored in a separate structure alongside the inverted index, with codecs handling compression and merging. In query processing, a KnnVectorQuery retrieves the top-k nearest vectors using metrics like cosine similarity or Euclidean distance, integrating with filters and reranking via hybrid search combining text and vector scores. This enables applications like recommendation systems and semantic search, with optimizations for multi-segment indexes and pre-filtering.[35][36]
Search execution begins with an IndexSearcher instance, which operates on an opened IndexReader to access the index segments. The searcher traverses the inverted index's postings lists—structured data from the indexing phase that map terms to document occurrences—and evaluates the Query against them to identify matching documents.[37] It collects candidate hits and returns a TopDocs object containing the top-scoring results up to a specified limit, including ScoreDoc arrays with document IDs and relevance scores. This process supports concurrent execution across index segments using an ExecutorService for improved performance on multi-core systems.[38]
Relevance scoring in Lucene determines document ranking based on query-term matches, with the default model being BM25Similarity, an optimized implementation of the Okapi BM25 algorithm introduced as the standard in Lucene 6.0.[39] BM25 computes scores using inverse document frequency (IDF) for term rarity, a non-linear term frequency (TF) saturation controlled by parameter k1 (default 1.2), and document length normalization via parameter b (default 0.75), balancing precision and recall in information retrieval.[39] Developers can configure alternative models, such as ClassicSimilarity for the traditional vector space model using cosine similarity on TF-IDF vectors, by setting a custom Similarity implementation on the IndexSearcher.
Filtering refines search results without affecting scores, achieved by wrapping a base Query in a Filter implementation, such as QueryWrapperFilter for another Query or NumericRangeFilter for date or numeric constraints. Sorting extends beyond relevance scores using a Sort object, which specifies fields like document ID, numeric values, or custom comparators; for example, sorting by publication date descending while secondarily by score. The IndexSearcher applies these in methods like search(Query, Filter, int, Sort) to reorder the TopDocs accordingly.
For handling large result sets, Lucene supports pagination through the search(Query, int) method, which limits output to the top n hits for shallow pages, and deeper pagination via the searchAfter parameter in TopDocs, allowing efficient resumption from a previous ScoreDoc without rescoring the entire set.[37] Custom Collector implementations, such as TopScoreDocCollector, enable fine-grained control over result gathering and pagination. Basic faceting is provided in the lucene-facet module, where FacetField categorizes documents during indexing, and FacetsCollector computes counts and hierarchies at search time for drill-down navigation, such as term frequencies in fields like categories or price ranges.[40]
