lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Antonio Gulli <gu...@di.unipi.it>
Subject Re: mg4j - Managing Gigabyte for Java
Date Thu, 16 Sep 2004 16:15:25 GMT
David Spencer wrote:

> Anson Lau wrote:
>
>> Hi All,
>>
>> Has anyone seen the project MG4J (Managing Gigabyte for Java)
>> http://mg4j.dsi.unimi.it/ ?  Anybody knows enough about both Lucene
>> and MG4J to comment on how the two compares?
>
>
> I've wondered if Lucene does comparable (key/index) compression to 
> what the related book (Managing Gigabytes, excellent BTW) describes...

Just a question: my personal experience with a commercial engine i 
partly developed is the the "continuation bit" (aka altavista solution)  
is a good and efficient solution w.r.t gamma code, delta code and other 
codes used for variable lenght int rappresentation (see MG).

Given an int say n, continuation bit is just to consider a byte as 7 bit 
+ 1 bit used to say if the next byte is also used to rappresent n.

On the average you will loose some bit on small gaps between contiguos 
integer in the posting list, but they are not that much since on large 
collections gaps are large. But you can operate on machine oriented word 
lenght instead of bit operations which are much more expensive.

I saw a small increment on the index size, but a big saving on query 
time. Any similiar / opposite experience?

-- 
"We have no credible evidence that Iraq and Al Qaeda 
cooperated on attacks against the United States."
Staff report of the commission investigating the Sept. 
11 attacks.


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message