Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 83517 invoked from network); 11 Mar 2004 19:29:11 -0000 Received: from daedalus.apache.org (HELO mail.apache.org) (208.185.179.12) by minotaur-2.apache.org with SMTP; 11 Mar 2004 19:29:11 -0000 Received: (qmail 39853 invoked by uid 500); 11 Mar 2004 19:28:56 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 39824 invoked by uid 500); 11 Mar 2004 19:28:56 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 39811 invoked from network); 11 Mar 2004 19:28:56 -0000 Received: from unknown (HELO germany.prod.thop-ny.triplehop.com) (216.74.150.80) by daedalus.apache.org with SMTP; 11 Mar 2004 19:28:56 -0000 Received: from hui2000 ([208.246.29.6]) by germany.prod.thop-ny.triplehop.com with Microsoft SMTPSVC(5.0.2195.6713); Thu, 11 Mar 2004 14:28:25 -0500 From: "hui" To: "'Lucene Users List'" Subject: RE: int vs long and document ids on 64bit machines. Date: Thu, 11 Mar 2004 14:32:17 -0500 MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Office Outlook, Build 11.0.5510 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1165 Thread-Index: AcQHnQSAJaH3pucKSHaeVUICTC0/VAAAeU4Q In-Reply-To: <4050BA90.1060201@newsmonster.org> Message-ID: X-OriginalArrivalTime: 11 Mar 2004 19:28:25.0359 (UTC) FILETIME=[05E959F0:01C4079F] X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N If the document id is going to be changed, is it possible to define an interface so the user could provide other implementation to replace the default one? For example, the document unique timestamp or other fields as long as they are long could be used. In this way, we can easily get another sorting option rather than the default score sorting. Still not sure whether it is a good idea or not though. Regards, hui -----Original Message----- From: Kevin A. Burton [mailto:burton@newsmonster.org] Sent: Thursday, March 11, 2004 2:14 PM To: Lucene Users List Subject: Re: int vs long and document ids on 64bit machines. Doug Cutting wrote: > > Somone, not me, perhaps provided that rationalization, which isn't a > bad one. In fact, the situation was more that, in 1997, when I > started Lucene, 2 billion documents seemed like a lot for a Java-based > search engine which was designed to scale to perhaps millions of > documents, but probably not to the world. Java was slow then, remember? Yes... agreed. >> Does anyone know how JDK 1.4.2 works on Itanium, Opteron (AMD64)? >> How hard would it be to build a lucene64 that used 64bit document >> handles (longs) for 64bit procesors?! Is it just a recompile? Will >> the file format break and need updating?! > > > I think the file format is 64-bit safe. But the code changes would be > quite numerous. No doubt we should make this change someday. Do you > anticipate more than 2 billion documents in your Lucene index sometime > soon, e.g., this year? > > Also, with Java, it's not just a recompile, it's a lot of code changes. Weill ... the refactor should at LEAST be pretty easy... just start changing int->long and follow up until the code compiles. Not sure if it's that easy. >> Also ... what are the symptoms of a Lucene build using 64bit ints on >> 32bit processors. Right now we're personally stuck on 32bit machines >> but I would like to see us migrate to 64 bit boxes over the next 6 >> months... > > > Java's int datatype is defined as 32 bit. So there are no 64-bit > ints. There are longs. I doubt longs are much slower than ints to > deal with on most JVMs today. However a long[] is twice as big as an > int[], and an array may only be indexed by an int. Currently Lucene > uses a byte[] indexed by document number to store normalization > factors. This would not work if document numbers are longs. Filters > index bit vectors with document numbers, and that also would not work > if document numbers were longs. Working around these will not only > take some code, it may also impact performance a bit. > > I suspect that Java will soon evolve to better embrace 64-bit > machines. Someday assignment of longs will be atomic. (This is > hinted at in the language spec.) Someday arrays will probably be > indexable by longs. I'd prefer to wait until these changes happen > before changing Lucene's document numbers to longs. > At some point I might take a look at the code and see how hard it would be... Thanks for you notes... I'll probably use these in the future. The main problem that with indexes that have lots of SMALL documents you could see yourself running out of ints. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org