lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: int vs long and document ids on 64bit machines.
Date Thu, 11 Mar 2004 17:36:09 GMT
Kevin A. Burton wrote:
> A discussion I had a while back had someone note (Doug?) that the 
> decision to go with 32bit ints for document IDs was that on 32 bit 
> machines that 64bits weren't threadsafe.

Somone, not me, perhaps provided that rationalization, which isn't a bad 
one.  In fact, the situation was more that, in 1997, when I started 
Lucene, 2 billion documents seemed like a lot for a Java-based search 
engine which was designed to scale to perhaps millions of documents, but 
probably not to the world.  Java was slow then, remember?

> Does anyone know how JDK 1.4.2 works on Itanium, Opteron (AMD64)?
> How hard would it be to build a lucene64 that used 64bit document 
> handles (longs) for 64bit procesors?!  Is it just a recompile?  Will the 
> file format break and need updating?!

I think the file format is 64-bit safe.  But the code changes would be 
quite numerous.  No doubt we should make this change someday.  Do you 
anticipate more than 2 billion documents in your Lucene index sometime 
soon, e.g., this year?

Also, with Java, it's not just a recompile, it's a lot of code changes.

> Also ... what are the symptoms of a Lucene build using 64bit ints on 
> 32bit processors.  Right now we're personally stuck on 32bit machines 
> but I would like to see us migrate to 64 bit boxes over the next 6 
> months...

Java's int datatype is defined as 32 bit.  So there are no 64-bit ints. 
  There are longs.  I doubt longs are much slower than ints to deal with 
on most JVMs today.  However a long[] is twice as big as an int[], and 
an array may only be indexed by an int.  Currently Lucene uses a byte[] 
indexed by document number to store normalization factors.  This would 
not work if document numbers are longs.  Filters index bit vectors with 
document numbers, and that also would not work if document numbers were 
longs.  Working around these will not only take some code, it may also 
impact performance a bit.

I suspect that Java will soon evolve to better embrace 64-bit machines. 
  Someday assignment of longs will be atomic.  (This is hinted at in the 
language spec.)  Someday arrays will probably be indexable by longs. 
I'd prefer to wait until these changes happen before changing Lucene's 
document numbers to longs.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message