lucene-java-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <>
Subject [Lucene-java Wiki] Update of "ReleaseNote40" by RobertMuir
Date Wed, 26 Sep 2012 02:20:47 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-java Wiki" for change notification.

The "ReleaseNote40" page has been changed by RobertMuir:

first stab (can we make this more concise?, the links will look better on the site of course)

  See the CHANGES.txt file included with the release for a full list of
- Highlights of changes since 4.0-beta:
+ Lucene 4.0 Release Highlights:
-   * TODO
+  * The index formats for terms, postings lists, stored fields, term vectors, etc 
+    are pluggable via the Codec api. You can select from the provided 
+    implementations or customize the index format with your own Codec to meet your needs.
+  * Similarity has been decoupled from the vector space model (TF/IDF). Additional models
+    such as BM25, Divergence from Randomness, Language Models, and Information-based models
+    are provided (see
+  * Added support for per-document values (DocValues). DocValues can be used for custom 
+    scoring factors (accessible via Similarity), for pre-sorted Sort values, and more.
+  * When indexing via multiple threads, each IndexWriter thread now flushes its own segment
+    to disk concurrently, resulting in substantial performance improvements
+    (see
+  * Per-document normalization factors ("norms") are no longer limited to a single byte.
+    Similarity implementations can use any DocValues type to store norms.
+  * Added index statistics such as the number of tokens for a term or field, number of postings
+    for a field, and number of documents with a posting for a field: these support additional
+    scoring models (see
+  * Implemented a new default term dictionary/index (BlockTree) that indexes shared prefixes
+    instead of every n'th term. This is not only more time- and space- efficient, but can
+    also sometimes avoid going to disk at all for terms that do not exist. Alternative term
+    dictionary implementions are provided and pluggable via the Codec api.
+  * Indexed terms are no longer UTF-16 char sequences, instead terms can be any binary
+    value encoded as byte arrays. By default, text terms are now encoded as UTF-8
+    bytes. Sort order of terms is now defined by their binary value, which is identical
+    to UTF-8 sort order.
+  * Substantially faster performance when using a Filter during searching.
+  * File-system based directories can rate-limit the IO (MB/sec) of merge
+    threads, to reduce IO contention between merging and searching threads.
+  * Added a number of alternative Codecs and components for different use-cases: "Appending"
+    works with append-only filesystems (such as Hadoop DFS), "Memory" writes the entire 
+    terms+postings as an FST read into RAM (see
+    "Pulsing" inlines the postings for low-frequency terms into the term dictionary (see
+    "SimpleText" writes all files in plain-text for easy debugging/transparency (see
+    "Bloom" uses a bloom filter to sometimes avoid disk seeks when looking up terms,
+    "Direct" holds all postings as simple byte[] and int[] for very fast performance at the

+    cost of very high RAM consumption, "Block" use a new index layout and compression scheme
+    improved performance, among others.
+  * Term offsets can be optionally encoded into the postings lists and can be retrieved
+    per-position.
+  * A new AutomatonQuery returns all documents containing any term matching a provided
+    finite-state automaton (see
+  * FuzzyQuery is 100-200 times faster than in past releases (see
+  * A new spell checker, DirectSpellChecker, finds possible corrections directly against
+    main search index without requiring a separate index.
+  * Various in-memory data structures such as the term dictionary and FieldCache are represented
+    more efficiently with less object overhead (see
+  * All search logic is now required to work per segment, IndexReader was therefore refactored
+    differentiate between atomic and composite readers
+    (see
+  * Lucene 4.0 provides a modular API, consolidating components such as Analyzers and Queries

+    that were previously scattered across Lucene core, contrib, and Solr. These modules also
+    include additional functionality such as UIMA analyzer integration and a completely reworked

+    spatial search implementation.
+ Noteworthy changes since 4.0-BETA:
+  * A new "Block" PostingsFormat offering improved search performance and index compression.

+    This will likely become the default format in a future release.
+    (see
+  * All non-default codec implementations were moved to a separated codecs module. Just add
+    lucene-codecs-4.0.0.jar to your classpath to test these out.
+  * Payloads can be optionally stored on the term vectors.
+  * Many bugfixes and optimizations.
  Please read CHANGES.txt and MIGRATE.txt for a full list of new features and notes on upgrading.

  Particularly, the new apis are not compatible with previous versions of Lucene, however,

View raw message