lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Burton-West, Tom" <tburt...@umich.edu>
Subject RE: Flex indexing : Hybrid index maintnenance for faster indexing
Date Tue, 05 Oct 2010 15:12:58 GMT
Thanks Mike,

I suspected the approach might require architectural changes beyond flex, but since our indexes
are so huge and disk I/O is our main bottleneck both for searching and indexing, I'm always
looking for ways to deal with very large postings and positions lists that might reduce I/O.

I haven't looked in detail into PFOR and Simple9 and some of the other new encodings, but
my understanding is that they trade off compression for decompression speed. i.e. they take
up a bit more space, but are more efficient to decompress.   In our case, where we have underutilized
CPU, mostly because the processors are waiting on disk I/O, I'll be curious to find out whether
the slight increase in disk I/O time due to lower compression is still outweighed by the increase
in decompression speed. (Don't know if we'll find the time to try flex for a while though:)


BTW: have you seen this paper looking at 64-bit words?
 "Index Compression Using 64-Bit Words", Anh, Moffat. Software -- Practice & Experience,
40(2):131-148, February 2010


Tom 
-----Original Message-----
From: Michael McCandless [mailto:lucene@mikemccandless.com] 
Sent: Tuesday, October 05, 2010 6:21 AM
To: dev@lucene.apache.org
Subject: Re: Flex indexing : Hybrid index maintnenance for faster indexing

Nice paper!

It's a neat trick to index the large postings as separate files, ie
let the fileystem handle the growth as new postings are appended
over time.

But, unfortunately, we can't easily do this in Lucene, since Lucene
assumes index files are write once, and derives its transactional
semantics from this approach.  Ie, this would require sizable changes,
beyond just swapping in a different Codec.

Still, the idea that small/big postings lists should be handled
differently is something we can take advantage of in a Codec, and I
think we should.  I think likely we will switch to a default codec
that uses pulsing (storing term's postiugs directly in terms dict) for
very low freq terms, maybe vInt for medium freq terms, and FOR/PFOR
for high freq terms.

Mike

On Mon, Oct 4, 2010 at 6:42 PM, Burton-West, Tom <tburtonw@umich.edu> wrote:
> Hi all,
>
> Would it be possible to implement something like this in Flex?
>
>
> Büttcher, S., & Clarke, C. L. A. (2008). Hybrid index maintenance for contiguous
inverted lists. Information Retrieval, 11(3), 175-207. doi:10.1007/s10791-007-9042-8
>
> The approach takes advantage of having a different policy for large postings lists (ie
frequent terms)  versus small postings lists for flushing the buffer and writing to disk.
>
>
> Tom Burton-West
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message