Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 88182 invoked from network); 11 Mar 2004 17:36:55 -0000 Received: from daedalus.apache.org (HELO mail.apache.org) (208.185.179.12) by minotaur-2.apache.org with SMTP; 11 Mar 2004 17:36:55 -0000 Received: (qmail 69576 invoked by uid 500); 11 Mar 2004 17:36:41 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 69536 invoked by uid 500); 11 Mar 2004 17:36:40 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 69520 invoked from network); 11 Mar 2004 17:36:39 -0000 Received: from unknown (HELO sccrmhc12.comcast.net) (204.127.202.56) by daedalus.apache.org with SMTP; 11 Mar 2004 17:36:39 -0000 Received: from apache.org (c-24-5-145-151.client.comcast.net[24.5.145.151]) by comcast.net (sccrmhc12) with ESMTP id <2004031117364201200qh10ae>; Thu, 11 Mar 2004 17:36:42 +0000 Message-ID: <4050A389.4080204@apache.org> Date: Thu, 11 Mar 2004 09:36:09 -0800 From: Doug Cutting User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040116 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Lucene Users List Subject: Re: int vs long and document ids on 64bit machines. References: <4050450A.1080400@newsmonster.org> In-Reply-To: <4050450A.1080400@newsmonster.org> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N Kevin A. Burton wrote: > A discussion I had a while back had someone note (Doug?) that the > decision to go with 32bit ints for document IDs was that on 32 bit > machines that 64bits weren't threadsafe. Somone, not me, perhaps provided that rationalization, which isn't a bad one. In fact, the situation was more that, in 1997, when I started Lucene, 2 billion documents seemed like a lot for a Java-based search engine which was designed to scale to perhaps millions of documents, but probably not to the world. Java was slow then, remember? > Does anyone know how JDK 1.4.2 works on Itanium, Opteron (AMD64)? > How hard would it be to build a lucene64 that used 64bit document > handles (longs) for 64bit procesors?! Is it just a recompile? Will the > file format break and need updating?! I think the file format is 64-bit safe. But the code changes would be quite numerous. No doubt we should make this change someday. Do you anticipate more than 2 billion documents in your Lucene index sometime soon, e.g., this year? Also, with Java, it's not just a recompile, it's a lot of code changes. > Also ... what are the symptoms of a Lucene build using 64bit ints on > 32bit processors. Right now we're personally stuck on 32bit machines > but I would like to see us migrate to 64 bit boxes over the next 6 > months... Java's int datatype is defined as 32 bit. So there are no 64-bit ints. There are longs. I doubt longs are much slower than ints to deal with on most JVMs today. However a long[] is twice as big as an int[], and an array may only be indexed by an int. Currently Lucene uses a byte[] indexed by document number to store normalization factors. This would not work if document numbers are longs. Filters index bit vectors with document numbers, and that also would not work if document numbers were longs. Working around these will not only take some code, it may also impact performance a bit. I suspect that Java will soon evolve to better embrace 64-bit machines. Someday assignment of longs will be atomic. (This is hinted at in the language spec.) Someday arrays will probably be indexable by longs. I'd prefer to wait until these changes happen before changing Lucene's document numbers to longs. Doug --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org