lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Flex indexing : Hybrid index maintnenance for faster indexing
Date Tue, 05 Oct 2010 23:46:59 GMT
I'm also very curious how your index will be affected by upgrading to flex.

I know the terms index gets alot smaller :)  But I'm curious about the
query performance...

I haven't seen that paper -- it sounds very interesting!  I'll add to
the list to read...

Mike

On Tue, Oct 5, 2010 at 11:12 AM, Burton-West, Tom <tburtonw@umich.edu> wrote:
> Thanks Mike,
>
> I suspected the approach might require architectural changes beyond flex, but since our
indexes are so huge and disk I/O is our main bottleneck both for searching and indexing, I'm
always looking for ways to deal with very large postings and positions lists that might reduce
I/O.
>
> I haven't looked in detail into PFOR and Simple9 and some of the other new encodings,
but my understanding is that they trade off compression for decompression speed. i.e. they
take up a bit more space, but are more efficient to decompress.   In our case, where we have
underutilized CPU, mostly because the processors are waiting on disk I/O, I'll be curious
to find out whether the slight increase in disk I/O time due to lower compression is still
outweighed by the increase in decompression speed. (Don't know if we'll find the time to try
flex for a while though:)
>
>
> BTW: have you seen this paper looking at 64-bit words?
>  "Index Compression Using 64-Bit Words", Anh, Moffat. Software -- Practice & Experience,
40(2):131-148, February 2010
>
>
> Tom
> -----Original Message-----
> From: Michael McCandless [mailto:lucene@mikemccandless.com]
> Sent: Tuesday, October 05, 2010 6:21 AM
> To: dev@lucene.apache.org
> Subject: Re: Flex indexing : Hybrid index maintnenance for faster indexing
>
> Nice paper!
>
> It's a neat trick to index the large postings as separate files, ie
> let the fileystem handle the growth as new postings are appended
> over time.
>
> But, unfortunately, we can't easily do this in Lucene, since Lucene
> assumes index files are write once, and derives its transactional
> semantics from this approach.  Ie, this would require sizable changes,
> beyond just swapping in a different Codec.
>
> Still, the idea that small/big postings lists should be handled
> differently is something we can take advantage of in a Codec, and I
> think we should.  I think likely we will switch to a default codec
> that uses pulsing (storing term's postiugs directly in terms dict) for
> very low freq terms, maybe vInt for medium freq terms, and FOR/PFOR
> for high freq terms.
>
> Mike
>
> On Mon, Oct 4, 2010 at 6:42 PM, Burton-West, Tom <tburtonw@umich.edu> wrote:
>> Hi all,
>>
>> Would it be possible to implement something like this in Flex?
>>
>>
>> Büttcher, S., & Clarke, C. L. A. (2008). Hybrid index maintenance for contiguous
inverted lists. Information Retrieval, 11(3), 175-207. doi:10.1007/s10791-007-9042-8
>>
>> The approach takes advantage of having a different policy for large postings lists
(ie frequent terms)  versus small postings lists for flushing the buffer and writing to disk.
>>
>>
>> Tom Burton-West
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message