lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From robert engels <>
Subject Re: Realtime Search
Date Wed, 24 Dec 2008 18:02:24 GMT
As I pointed out in another email, I understand the benefits of  
compression (compressed disks vs. uncompressed, etc.). PFOR is  
definitely a winner !

As I understood this discussion though, it was an attempt to remove  
the in memory 'skip to' index, to avoid the reading of this during  
index open/reopen.

I was attempting to point out that this in-memory index is still  
needed, but there are ways to improve the current process.

I don't think a mapped file for the term index is going to work for a  
variety of reasons. Mapped files are designed as a programming  
simplification - mainly for older systems that use line delimited  
files - rather than having to create "page/section" caches when  
processing very large files (when only a small portion is used at any  
given time - ie. the data visible on the screen). When you end up  
visiting a large portion of the file anyway (to do a full  
repagination), an in-process intelligent cache is going to be far  

My review of the Java Buffer related classes does not give me the  
impression it is going to be faster - in fact it will be slower- than  
a single copy into user space, and process/decompress there. The  
Buffer system is suitable when perform little inspection, and then  
direct copy to another buffer (think reading from a file, and sending  
out on a socket). If you end up inspecting the buffer, it is going to  
be very slow.

On Dec 24, 2008, at 11:33 AM, Paul Elschot wrote:

> Op Wednesday 24 December 2008 17:51:04 schreef robert engels:
>> Thinking about this some more, you could use fixed length pages for
>> the term index, with a page header containing a count of entries, and
>> use key compression (to avoid the constant entry size).
>> The problem with this is that you still have to decode the entries
>> (slowing the processing - since a simple binary search on the page is
>> not possible).
> The cache between the pages and the cpu is also a bottleneck nowadays.
> See here:
> Super-Scalar RAM-CPU Cache Compression
> M Zukowski, S Heman, N Nes, P Boncz -
> currently available from this link:
> request=pdfgz&key=ZuHeNeBo:ICDE:06
> Also, some preliminary results on lucene indexes
> are available at LUCENE-1410.
> Regards,
> Paul Elschot
>> But, if you also add a 'least term and greatest term' to the page
>> header (you can avoid the duplicate storage of these entries as
>> well), you can perform a binary search of the term index much faster.
>> You only need to decode the index page containing (maybe) the desired
>> entry.
>> If you were doing a prefix/range search, you will still end up
>> decoding lots of pages...
>> This is why a database has their own page cache, and usually caches
>> the decoded form (for index pages) for faster processing - at the
>> expense of higher memory usage. Usually data pages are not cached in
>> the decoded/uncompressed form. In most cases the database vendor will
>> recommend removing the OS page cache on the database server, and
>> allocating all of the memory to the database process.
>> You may be able to avoid some of the warm-up of an index using memory
>> mapped files, but with proper ordering of the writing of the index,
>> it probably isn't necessary. Beyond that, processing the term index
>> directly using NIO does not appear that it will be faster than using
>> an in-process cache of the term index (similar to the skip-to memory
>> index now).
>> The BEST approach is probably to have the index writer build the
>> memory "skip to" structure as it writes the segment, and then include
>> this in the segment during the reopen - no warming required !. As
>> long as the reader and writer are in the same process, it will be a
>> winner !
>> On Dec 23, 2008, at 11:02 PM, robert engels wrote:
>>> Seems doubtful you will be able to do this without increasing the
>>> index size dramatically. Since it will need to be stored
>>> "unpacked" (in order to have random access), yet the terms are
>>> variable length - leading to using a maximum=minimum size for every
>>> term.
>>> In the end I highly doubt it will make much difference in speed -
>>> here's several reasons why...
>>> 1. with "fixed" size terms, the additional IO (larger pages)
>>> probably offsets a lot of the random access benefit. This is why
>>> "compressed" disks on a fast machine (CPU) are often faster than
>>> "uncompressed" - more data is read during every IO access.
>>> 2. with a reopen, only new segments are "read", and since it is a
>>> new segment, it is most likely already in the disk cache (from the
>>> write), so the reopen penalty is negligible (especially if the term
>>> index "skip to" is written last).
>>> 3. If the reopen is after an optimize - when the OS cache has
>>> probably been obliterated, then the warm up time is going to be
>>> similar in most cases anyway, since the "index" pages will also not
>>> be in core (in the case of memory mapped files). Again, writing the
>>> "skip to" last can help with this.
>>> Just because a file is memory mapped does not mean its pages will
>>> have an greater likelihood to be in the cache. The locality of
>>> reference is going to control this, just as the most/often access
>>> controls it in the OS disk cache.  Also, most OSs will take real
>>> memory from the virtual address space and add it to the disk cache
>>> if the process is doing lots of IO.
>>> If you have a memory mapped "term index", you are still going to
>>> need to perform a binary search to find the correct term "page",
>>> and after an optimize the visited pages will not be in the cache
>>> (or in core).
>>> On Dec 23, 2008, at 9:20 PM, Marvin Humphrey wrote:
>>>> On Tue, Dec 23, 2008 at 08:36:24PM -0600, robert engels wrote:
>>>>> Is there something that I am missing?
>>>> Yes.
>>>>> I see lots of references to  using "memory mapped" files to
>>>>> "dramatically"
>>>>> improve performance.
>>>> There have been substantial discussions about this design in JIRA,
>>>> notably LUCENE-1458.
>>>> The "dramatic" improvement is WRT to opening/reopening an
>>>> IndexReader.
>>>> Presently in both KS and Lucene, certain data structures have to
>>>> be read at
>>>> IndexReader startup and unpacked into process memory -- in
>>>> particular, the
>>>> term dictionary index and sort caches.  If those data structures
>>>> can be
>>>> represented by a memory mapped file rather than built up from
>>>> scratch, we save
>>>> big.
>>>> Marvin Humphrey
>>>> ------------------------------------------------------------------
>>>> --- To unsubscribe, e-mail:
>>>> For additional commands, e-mail:
>>> -------------------------------------------------------------------
>>> -- To unsubscribe, e-mail:
>>> For additional commands, e-mail:
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message