Apache Lucene 9.8 is a high-performance, full-featured search engine library written in Java. It provides powerful indexing and search capabilities for applications requiring structured search, full-text search, faceting, nearest-neighbor vector search, spell correction, and query suggestions.
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-core</artifactId>
<version>9.8.0</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-analysis-common</artifactId>
<version>9.8.0</version>
</dependency>
<!-- For query parsing (optional) -->
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-queryparser</artifactId>
<version>9.8.0</version>
</dependency>
Directory Type | Best For | Avoid When | Key Advantages | Key Limitations |
---|---|---|---|---|
ByteBuffersDirectory |
• Short-lived, transient indices • Testing and demos • In-memory operations without persistence • Small indices (less than a few hundred MB) |
• Large indices • Production systems requiring persistence • Limited heap memory environments |
• Fast in-memory operation • No disk I/O • Better multi-threading than old RAMDirectory • No file system dependency |
• Uses Java heap memory • Limited by available RAM • Higher GC pressure • Data is lost when JVM restarts |
MMapDirectory |
• Production environments • Larger indices • Performance-critical applications • Systems with sufficient virtual memory |
• Very memory-constrained 32-bit systems | • Uses OS memory-mapping for optimal performance • Takes advantage of OS page cache • Persists data to disk • Lower heap usage |
• Requires file system access • Initial I/O overhead when data not in OS cache • Can trigger SIGSEGV errors if misused with concurrent access |
FSDirectory |
• Basic filesystem storage • When Lucene should auto-select best implementation |
• When you specifically need memory-mapping features | • Automatically selects best implementation • Works across all platforms |
• May not be optimized for specific use cases |
NIOFSDirectory |
• Fallback when MMapDirectory is not available | • When better alternatives are available | • Uses java.nio for file access • Better than SimpleFSDirectory |
• Not as performant as MMapDirectory |
Field Type | Best For | Index Behavior | Store Behavior | When To Choose |
---|---|---|---|---|
TextField |
• Long text content • Fields requiring full-text search • Description fields • Document bodies |
Analyzed, tokenized, and indexed | Optional (use Field.Store.YES to store) |
When you need full-text search with word analysis, stemming, stop words, etc. |
StringField |
• IDs • Codes • URLs • Exact match fields • Enum values |
Indexed as a single token without analysis | Optional (use Field.Store.YES to store) |
When you need exact matching without text analysis |
StoredField |
• Large fields only needed for retrieval • Data not used for searching • Binary content |
Not indexed | Always stored | When you only need to retrieve the value but never search on it |
IntPoint , LongPoint , etc. |
• Numeric values for range queries • Dates • Prices • Coordinates |
Indexed for range queries | Not stored by default | When you need efficient numeric range queries - add separate StoredField if you need to retrieve values |
SortedDocValuesField |
• Fields used for sorting • Faceting fields |
Not directly searchable | Stored in column-oriented format | When you need efficient sorting or faceting on a field |
NumericDocValuesField |
• Numeric values for efficient sorting • Fields for function queries |
Not directly searchable | Stored in column-oriented format | When you need to sort on numeric values or use them in function queries |
KnnFloatVectorField |
• Vector embeddings • Semantic search vectors • ML feature vectors |
Indexed for vector similarity search | Not stored | When you need to perform nearest-neighbor (similarity) searches with 32-bit floating point precision |
KnnByteVectorField |
• Quantized vector embeddings • Memory-efficient vector storage |
Indexed for vector similarity search | Not stored | When you need vector search with reduced memory footprint and can accept slight precision loss |
Query Type | Best For | Performance | Limitations | When To Choose |
---|---|---|---|---|
TermQuery |
• Exact matches • ID lookups • Code lookups |
Very fast | • No text analysis • Case sensitive |
When you need exact, verbatim matches on a specific term |
BooleanQuery |
• Combining multiple criteria • Complex logical conditions |
Depends on clauses | • Default limit of 1024 clauses | When you need to combine multiple query conditions with AND, OR, NOT logic |
PhraseQuery |
• Exact phrase matching • Multi-word expressions |
Moderate | • Performance degrades with high slop values | When word order and proximity matter (e.g., “big apple” exactly) |
FuzzyQuery |
• Typo-tolerant search • Spelling variations |
Slow (high CPU) | • Can be very expensive • Can match too broadly |
When you need to handle typos and minor spelling variations |
WildcardQuery |
• Pattern matching • Prefix/suffix matching |
Very slow on large indices | • Can be extremely expensive • Prefix wildcards particularly slow |
When you need pattern matching and the field is selective |
PrefixQuery |
• Autocomplete • Begins-with queries |
Moderate | • Less flexible than wildcards | When you need prefix matching without full wildcard capabilities |
RangeQuery |
• Date ranges • Numeric ranges • Alphabetical ranges |
Fast for *Point fields |
• Performance varies by field type | When you need to find values within a specific range |
RegexpQuery |
• Complex pattern matching | Very slow | • Can be extremely expensive | When you need sophisticated pattern matching and performance is not critical |
KnnFloatVectorQuery |
• Semantic search • Similarity search |
Fast with proper indexing | • Requires vector fields • Approximate results |
When you need to find documents with vectors similar to a query vector |
Function | Best For | Normalization | Considerations | When To Choose |
---|---|---|---|---|
EUCLIDEAN |
• General distance measurement • When magnitude matters |
Not required | • Sensitive to magnitude differences | When the absolute distance between vectors matters and vector magnitudes have meaning |
EUCLIDEAN_HNSW |
• Most vector search applications • Default option |
Not required | • Optimized implementation for HNSW graph | When you want a good general-purpose similarity function with no special requirements |
DOT_PRODUCT |
• Optimized cosine similarity • Recommendation systems |
Required (unit vectors) | • All vectors must be normalized to unit length | When vectors are normalized and you want maximum performance for cosine similarity |
COSINE |
• Semantic similarity • When vectors cannot be normalized in advance |
Not required | • Less efficient than DOT_PRODUCT | When you need cosine similarity but cannot pre-normalize your vectors |
MAXIMUM_INNER_PRODUCT |
• Recommendation systems • When higher values in dimensions correlate to preference |
Not required | • Not normalized to [0,1] range • Can return negative scores |
When inner product aligns with your use case (e.g., user preferences represented as dimensional values) |
Operator | Effect on Results | Effect on Score | Use Case |
---|---|---|---|
MUST (AND) |
Document must match this clause | Score contributes positively | Required criteria that must be present |
SHOULD (OR) |
Document can match this clause | Score contributes positively if matched | Optional criteria that improves relevance |
MUST_NOT (NOT) |
Document must not match this clause | No effect on score | Exclusionary criteria |
FILTER |
Document must match this clause | No effect on score | Required criteria where relevance scoring doesn’t matter |
Use in-memory indices (ByteBuffersDirectory) when:
Use on-disk indices (MMapDirectory) when:
Specific recommendations for vector search:
ToParentBlockJoin[Float|Byte]KnnVectorQuery
for joining child vector documents with their parent documents.KnnByteVectorField
and ByteVectorQuery
, alongside the float vector equivalents.TermsEnum#seekCeil
on doc values terms enums that caused IndexOutOfBoundsException.Lucene indexing involves three key concepts:
Field.Store.YES
)// Create an index writer configuration
IndexWriterConfig config = new IndexWriterConfig(analyzer);
// OpenMode options:
config.setOpenMode(OpenMode.CREATE); // New index, deleting existing
config.setOpenMode(OpenMode.APPEND); // Add to existing index
config.setOpenMode(OpenMode.CREATE_OR_APPEND); // Create if missing, else append
// Optional performance settings
config.setRAMBufferSizeMB(256.0); // Default is 16MB
config.setMaxBufferedDocs(10000); // Default is IndexWriterConfig.DISABLE_AUTO_FLUSH
// Create an index writer
IndexWriter writer = new IndexWriter(directory, config);
// Document examples showing different field combinations:
// Example 1: Basic searchable and retrievable text
Document doc1 = new Document();
doc1.add(new TextField("title", "Searchable Title", Field.Store.YES)); // Analyzed, stored
doc1.add(new StringField("id", "doc1", Field.Store.YES)); // Not analyzed, stored
// Example 2: Numeric field with search and sort capabilities
Document doc2 = new Document();
doc2.add(new IntPoint("price", 1000)); // For range queries
doc2.add(new StoredField("price", 1000)); // For value retrieval
doc2.add(new NumericDocValuesField("price", 1000)); // For sorting/faceting
// Example 3: Text field with different options
Document doc3 = new Document();
// Analyzed and stored (full-text search + retrieval)
doc3.add(new TextField("description", "Product description", Field.Store.YES));
// Analyzed but not stored (searchable only)
doc3.add(new TextField("keywords", "searchable terms", Field.Store.NO));
// Not analyzed but stored (exact match + retrieval)
doc3.add(new StringField("sku", "12345", Field.Store.YES));
// Indexing options
writer.addDocument(doc1); // Add new document
writer.updateDocument( // Update existing document
new Term("id", "doc1"), // Unique identifier
doc2 // New document
);
writer.deleteDocuments(new Term("id", "doc1")); // Delete by term
writer.deleteAll(); // Clear the index
// Commit options
writer.commit(); // Standard commit
writer.flush(); // Flush buffer without commit
writer.forceMerge(1); // Optimize index (expensive)
// Always close when done
writer.close();
Document doc = new Document();
// Example 1: Field that is both indexed and stored
doc.add(new StringField("id", "doc123", Field.Store.YES)); // Can search and retrieve
// Example 2: Field that is only indexed (for searching) but not stored
doc.add(new IntPoint("price", 1000)); // Can search but can't retrieve original value
// Need an additional StoredField to retrieve the value:
doc.add(new StoredField("price", 1000));
// Example 3: Field for sorting/faceting with value retrieval
doc.add(new NumericDocValuesField("rating", 4)); // For sorting
doc.add(new StoredField("rating", 4)); // For retrieval
// Example 4: Field that is only stored (no searching/sorting)
doc.add(new StoredField("metadata", "some value")); // Can retrieve but can't search
// Later, when retrieving:
Document retrievedDoc = searcher.doc(docId);
String id = retrievedDoc.get("id"); // Works - field was stored
int rating = retrievedDoc.get("rating"); // Works - added StoredField
String metadata = retrievedDoc.get("metadata"); // Works - field was stored
// int price = retrievedDoc.get("price"); // Would return null if only IntPoint was used
// Create an index reader
IndexReader reader = DirectoryReader.open(directory);
// Create an index searcher
IndexSearcher searcher = new IndexSearcher(reader);
// Create a query (term query example)
Query query = new TermQuery(new Term("title", "lucene"));
// Execute the search
TopDocs results = searcher.search(query, 10);
// Process results
for (ScoreDoc scoreDoc : results.scoreDocs) {
Document doc = searcher.doc(scoreDoc.doc);
System.out.println(doc.get("title") + " (Score: " + scoreDoc.score + ")");
}
// Close the reader when done
reader.close();
Query query = new TermQuery(new Term("field", "term"));
BooleanQuery.Builder builder = new BooleanQuery.Builder();
builder.add(new TermQuery(new Term("field", "term1")), BooleanClause.Occur.MUST);
builder.add(new TermQuery(new Term("field", "term2")), BooleanClause.Occur.SHOULD);
builder.add(new TermQuery(new Term("field", "term3")), BooleanClause.Occur.MUST_NOT);
Query query = builder.build();
PhraseQuery.Builder builder = new PhraseQuery.Builder();
builder.add(new Term("field", "term1"));
builder.add(new Term("field", "term2"));
builder.setSlop(1); // Allow terms to be 1 position apart
Query query = builder.build();
Query query = new WildcardQuery(new Term("field", "te*m"));
Query query = IntPoint.newRangeQuery("year", 2020, 2025);
Query query = new FuzzyQuery(new Term("field", "term"), 2); // Edit distance of 2
QueryParser parser = new QueryParser("defaultField", analyzer);
Query query = parser.parse("title:lucene OR content:search");
Lucene 9.8 uses ByteBuffersDirectory
for in-memory operations, which replaced the deprecated RAMDirectory
:
// Create in-memory directory
Directory directory = new ByteBuffersDirectory();
// Configure analyzer and writer
Analyzer analyzer = new StandardAnalyzer();
IndexWriterConfig config = new IndexWriterConfig(analyzer);
config.setOpenMode(OpenMode.CREATE); // For a new index
IndexWriter writer = new IndexWriter(directory, config);
// Create document with vector field
Document doc = new Document();
// Add metadata fields
doc.add(new StringField("id", "doc1", Field.Store.YES));
doc.add(new TextField("title", "Example Document", Field.Store.YES));
// Add vector fields - KnnFloatVectorField is recommended over deprecated KnnVectorField
float[] vector = new float[] {0.1f, 0.2f, 0.3f, 0.4f};
// Use COSINE similarity for semantic search applications
doc.add(new KnnFloatVectorField("embedding", vector, VectorSimilarityFunction.COSINE));
// Or default to Euclidean distance (L2)
doc.add(new KnnFloatVectorField("vector_l2", vector));
// Index the document
writer.addDocument(doc);
writer.commit();
// Update a document with a new vector
// First, find and delete the old document
Term idTerm = new Term("id", "doc1");
writer.deleteDocuments(idTerm);
// Create new document with same ID but updated vector
Document updatedDoc = new Document();
updatedDoc.add(new StringField("id", "doc1", Field.Store.YES));
updatedDoc.add(new TextField("title", "Updated Document", Field.Store.YES));
// Updated vector
float[] updatedVector = new float[] {0.15f, 0.25f, 0.35f, 0.45f};
updatedDoc.add(new KnnFloatVectorField("embedding", updatedVector, VectorSimilarityFunction.COSINE));
// Add the updated document
writer.addDocument(updatedDoc);
writer.commit();
// Create a reader and searcher
IndexReader reader = DirectoryReader.open(directory);
IndexSearcher searcher = new IndexSearcher(reader);
// Vector similarity search (KNN)
float[] queryVector = new float[] {0.12f, 0.22f, 0.32f, 0.42f};
Query knnQuery = new KnnFloatVectorQuery("embedding", queryVector, 10);
TopDocs results = searcher.search(knnQuery, 10);
// Process results
for (ScoreDoc scoreDoc : results.scoreDocs) {
Document doc = searcher.doc(scoreDoc.doc);
String id = doc.get("id");
String title = doc.get("title");
float score = scoreDoc.score;
System.out.println("Document ID: " + id + ", Title: " + title + ", Score: " + score);
}
// Create a filter (e.g., category = "electronics")
Query filter = new TermQuery(new Term("category", "electronics"));
// Vector search with filter
Query filteredKnnQuery = new KnnFloatVectorQuery("embedding", queryVector, 10, filter);
TopDocs filteredResults = searcher.search(filteredKnnQuery, 10);
// Delete a document by ID
Term idTerm = new Term("id", "doc1");
writer.deleteDocuments(idTerm);
writer.commit();
// Or delete documents matching a query
Query deleteQuery = new TermQuery(new Term("category", "obsolete"));
writer.deleteDocuments(deleteQuery);
writer.commit();
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.*;
import org.apache.lucene.index.*;
import org.apache.lucene.search.*;
import org.apache.lucene.store.ByteBuffersDirectory;
import org.apache.lucene.store.Directory;
public class InMemoryVectorStore {
private Directory directory;
private StandardAnalyzer analyzer;
private IndexWriter writer;
public InMemoryVectorStore() throws Exception {
// Initialize in-memory store
directory = new ByteBuffersDirectory();
analyzer = new StandardAnalyzer();
IndexWriterConfig config = new IndexWriterConfig(analyzer);
config.setOpenMode(OpenMode.CREATE_OR_APPEND);
writer = new IndexWriter(directory, config);
}
public void addDocument(String id, String text, float[] vector) throws Exception {
Document doc = new Document();
doc.add(new StringField("id", id, Field.Store.YES));
doc.add(new TextField("text", text, Field.Store.YES));
doc.add(new KnnFloatVectorField("vector", vector, VectorSimilarityFunction.COSINE));
writer.addDocument(doc);
writer.commit();
}
public void updateDocument(String id, String text, float[] vector) throws Exception {
// Delete existing document
writer.deleteDocuments(new Term("id", id));
// Add updated document
Document doc = new Document();
doc.add(new StringField("id", id, Field.Store.YES));
doc.add(new TextField("text", text, Field.Store.YES));
doc.add(new KnnFloatVectorField("vector", vector, VectorSimilarityFunction.COSINE));
writer.addDocument(doc);
writer.commit();
}
public List<Document> searchByVector(float[] queryVector, int k) throws Exception {
List<Document> results = new ArrayList<>();
IndexReader reader = DirectoryReader.open(directory);
try {
IndexSearcher searcher = new IndexSearcher(reader);
Query query = new KnnFloatVectorQuery("vector", queryVector, k);
TopDocs topDocs = searcher.search(query, k);
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
Document doc = searcher.doc(scoreDoc.doc);
results.add(doc);
}
} finally {
reader.close();
}
return results;
}
public void deleteDocument(String id) throws Exception {
writer.deleteDocuments(new Term("id", id));
writer.commit();
}
public void close() throws Exception {
writer.close();
directory.close();
}
}
// Index a float vector
Document doc = new Document();
float[] vector = new float[] {0.1f, 0.2f, 0.3f, 0.4f};
doc.add(new KnnFloatVectorField("vector_field", vector));
writer.addDocument(doc);
// Search for similar vectors
float[] queryVector = new float[] {0.15f, 0.25f, 0.35f, 0.45f};
Query knnQuery = new KnnFloatVectorQuery("vector_field", queryVector, 10);
TopDocs results = searcher.search(knnQuery, 10);
// Using a filter with KNN search
Query filter = new TermQuery(new Term("category", "electronics"));
Query knnQuery = new KnnFloatVectorQuery("vector_field", queryVector, 10, filter);
// Standard analyzer
Analyzer analyzer = new StandardAnalyzer();
// Custom analyzer
Analyzer customAnalyzer = new Analyzer() {
@Override
protected TokenStreamComponents createComponents(String fieldName) {
Tokenizer tokenizer = new StandardTokenizer();
TokenStream filter = new LowerCaseFilter(tokenizer);
filter = new StopFilter(filter, StandardAnalyzer.STOP_WORDS_SET);
return new TokenStreamComponents(tokenizer, filter);
}
};