lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "John Wang (JIRA)" <>
Subject [jira] Commented: (LUCENE-2252) stored field retrieve slow
Date Sun, 07 Feb 2010 03:18:30 GMT


John Wang commented on LUCENE-2252:

bq. I do not understand, I think the fdx index is the raw offset into fdt for some doc, and
must remain a long if you have more than 2GB total across all docs.

as stated earlier,  assuming we are not storing 2GB of data per doc, you don't need to keep
a long per doc. There are many ways of representing this without paying much performance penalty.
Off the top of my head, this would work:

since positions are always positive, you can indicate using the first bit to see if MAX_INT
is reached, if so, add MAX_INT to the masked bits. You get away with int per doc.

I am sure with there are other tons of neat stuff for this the Mikes or Yonik can come up
with :)

bq. John, do you have a specific use case where this is the bottleneck, or are you just looking
for places to optimize in general?

Hi Yonik, I understand this may not be a common use case. I am trying to use Lucene as a store
solution. e.g. supporting just get()/put() operations as a content store. We wrote something
simple in house and I compared it against lucene, and the difference was dramatic. So after
profiling, just seems this is an area with lotsa room for improvement. (posted earlier)

1) Our current setup is that the content is stored outside of the search cluster. It just
seems being able to fetch the data for rendering/highlighting within our search cluster would
be good.
2) If the index contains the original data, changing indexing schema, e.g. reindexing can
be done within each partition/node. Getting data from our authoratative datastore is expensive.

Perhaps LUCENE-1912 is the right way to go rather than "fixing" stored fields. If you also
agree, I can just dup it over.



> stored field retrieve slow
> --------------------------
>                 Key: LUCENE-2252
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 3.0
>            Reporter: John Wang
> IndexReader.document() on a stored field is rather slow. Did a simple multi-threaded
test and profiled it:
> 40+% time is spent in getting the offset from the index file
> 30+% time is spent in reading the count (e.g. number of fields to load)
> Although I ran it on my lap top where the disk isn't that great, but still seems to be
much room in improvement, e.g. load field index file into memory (for a 5M doc index, the
extra memory footprint is 20MB, peanuts comparing to other stuff being loaded)
> A related note, are there plans to have custom segments as part of flexible indexing

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message