lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Qiang Zhou <>
Subject gsoc proposal
Date Fri, 06 Apr 2012 02:37:39 GMT
I'm trying to work on lucene-2335 as a gsoc project.
This is my proposal. Some parts reference Toke Eskildsen's blog. Please feel free to comment.

Background knowledge:
Given an ordinal, the term is returned by querying the index. This is just a logical mapping
and requires practically no memory.
The ordinals are sorted, typically with respect to a locale, and the sorted lists is called
the indirects list. If an index in the indirect is lower than another, it means that its corresponding
term comes before the other indirect entry’s term with respect to sorting. We always need
to sort in order to have indirects, even if the terms in the segments are already in order.
     For each document id, a list of corresponding indirects is kept. By following the
indirects through the ordinals, the corresponding terms can be resolved. Memory wise this
requires a list of integers as long as the number of documents plus a list of integers as
long as the total number of indirects for all documents. 
     Now Lucene loads ordinals and strings together whether they are in cache or not. However,
there's one circumstance where strings are not needed that index is only one segment, and
the search does not require fields being filled by term strings. It would save some memory
if Lucene only loads strings when necessary without losing the ordinal information already
     Class involved: 
     FieldCacheImpl implements interface FieldCache. StringIndexCache.createValue(...)
is where StringIndex objects are created.   
      StringIndex contains two related fields, String[] lookup and  int[] order. The
former contains "All the term values, in natural order". The latter is "For each document,
an index into the lookup array. "
   StringOrdValComparator.setNextReader(...) calls FieldCache.getStringIndex(....).
   When StringOrdValComparator is initialized, it has allocated space for ords and values.
   FieldValueHitQueue is a priority queue that should hold values.
  So the idea should be to add condition before values are copied into cache.  If a StringIndex
is created via two ways, one is to resolve term.text and the other doesn't resolve. It might
work. This is my initial thought. I haven't got to how to take care of sharing ords across
View raw message