Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 36265 invoked from network); 29 Apr 2009 21:21:32 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 29 Apr 2009 21:21:32 -0000 Received: (qmail 98191 invoked by uid 500); 29 Apr 2009 21:21:30 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 98105 invoked by uid 500); 29 Apr 2009 21:21:30 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 98095 invoked by uid 99); 29 Apr 2009 21:21:30 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 29 Apr 2009 21:21:30 +0000 X-ASF-Spam-Status: No, hits=-1.0 required=10.0 tests=RCVD_IN_DNSWL_LOW X-Spam-Check-By: apache.org Received-SPF: unknown (nike.apache.org: error in processing during lookup of MMastroianni@glgroup.com) Received: from [216.32.181.13] (HELO WA4EHSOBE003.bigfish.com) (216.32.181.13) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 29 Apr 2009 21:21:18 +0000 Received: from mail218-wa4-R.bigfish.com (10.8.14.243) by WA4EHSOBE003.bigfish.com (10.8.40.23) with Microsoft SMTP Server id 8.1.340.0; Wed, 29 Apr 2009 21:20:56 +0000 Received: from mail218-wa4 (localhost.localdomain [127.0.0.1]) by mail218-wa4-R.bigfish.com (Postfix) with ESMTP id A7084208318 for ; Wed, 29 Apr 2009 21:20:56 +0000 (UTC) X-BigFish: VPS-56(zz542NfadR1432Ra0dJ62a3L98dR168aJ4015L148cM1805M936fJ9371P4f15k19c2k8b9bk10e6izz1202hzz5a6ciz32i6bh61h) X-Spam-TCS-SCL: 0:0 X-FB-SS: 5, Received: by mail218-wa4 (MessageSwitch) id 1241040053687126_10166; Wed, 29 Apr 2009 21:20:53 +0000 (UCT) Received: from GLGNYEXSMTP01.glgroup.com (glgnyexsmtp01.glgroup.com [38.112.210.160]) by mail218-wa4.bigfish.com (Postfix) with ESMTP id 7A12F860050 for ; Wed, 29 Apr 2009 21:20:53 +0000 (UTC) Received: from GLGNYEXBE02.glgroup.com ([10.45.50.82]) by GLGNYEXSMTP01.glgroup.com with Microsoft SMTPSVC(6.0.3790.3959); Wed, 29 Apr 2009 17:20:53 -0400 X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-Class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Subject: RE: kamikaze Date: Wed, 29 Apr 2009 17:22:53 -0400 Message-ID: <9008E1E80EE8A340BACCA3DFF41B21562A5017@GLGNYEXBE02.glgroup.com> In-Reply-To: <9008E1E80EE8A340BACCA3DFF41B21562A5007@GLGNYEXBE02.glgroup.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: kamikaze thread-index: AcnIZft9Hs1X51GaTJ+8gQga0TS0PAAoszRAAAHqcqA= From: Michael Mastroianni To: X-OriginalArrivalTime: 29 Apr 2009 21:20:53.0210 (UTC) FILETIME=[607AF3A0:01C9C910] X-Virus-Checked: Checked by ClamAV on apache.org Hi Anmol-- Sorry, there was a typo in the main function of my unit test: here is a correct version (the utility functions remain the same). public void testMultipleIntersections() { ArrayList obs =3D new ArrayList(); ArrayList docs =3D new ArrayList(); Random rand =3D new Random(System.currentTimeMillis()); int maxDoc =3D 350000; for(int i=3D0; i < 3; ++i) { int numdocs =3D rand.nextInt(maxDoc); ArrayList nums =3D new ArrayList(); HashSet seen =3D new HashSet(); for (int j =3D 0; j < numdocs; j++) = { int nextDoc =3D rand.nextInt(maxDoc); if(seen.contains(nextDoc)) { while(seen.contains(nextDoc)) { nextDoc =3D rand.nextInt(maxDoc); } } nums.add(nextDoc); seen.add(nextDoc); } Collections.sort(nums); obs.add(createObs(nums, maxDoc)); docs.add(createDocSet(nums)); } OpenBitSet base =3D obs.get(0); for(int i =3D 1; i < obs.size(); ++i) { base.intersect(obs.get(i)); } = AndDocIdSet ands =3D new AndDocIdSet(docs); long card1 =3D base.cardinality(); long card2 =3D ands.size(); = assertEquals(card1, card2); } -----Original Message----- From: Michael Mastroianni [mailto:MMastroianni@glgroup.com] = Sent: Wednesday, April 29, 2009 4:28 PM To: java-user@lucene.apache.org Subject: RE: kamikaze Hi Anmol-- I think I may have found a problem in AndDocIdSet. I got it to pass some simple tests, and was in the process of integration, when some of my tests started to fail right after I had replaced a bunch of OpenBitSet intersections with creating a list of P4DocIdSets and then creating an AndDocIdSet from the list. I've created a unit test which I think illustrates the problem: regards, Michael import java.io.IOException; import java.util.ArrayList; import java.util.Collections; import java.util.HashSet; import java.util.Random; import org.apache.lucene.search.DocIdSet; import org.apache.lucene.search.DocIdSetIterator; import org.apache.lucene.util.OpenBitSet; import com.kamikaze.docidset.api.DocSet; import com.kamikaze.docidset.impl.AndDocIdSet; import com.kamikaze.docidset.impl.OrDocIdSet; import com.kamikaze.docidset.impl.P4DDocIdSet; import com.kamikaze.docidset.utils.DocSetFactory; import junit.framework.TestCase; public class KamikazeTest extends TestCase { public static OpenBitSet createObs(ArrayList ints, int maxDoc) { OpenBitSet obs =3D new OpenBitSet(maxDoc); for (Integer integer : ints) = { obs.set(integer); } return obs; } = public static DocSet createDocSet(ArrayList ints) { DocSet ret =3D new P4DDocIdSet(); for (Integer integer : ints) = { ret.addDoc(integer); } return ret; } = public void testMultipleIntersections() { ArrayList obs =3D new ArrayList(); ArrayList docs =3D new ArrayList(); Random rand =3D new Random(System.currentTimeMillis()); int maxDoc =3D 350000; { int numdocs =3D rand.nextInt(maxDoc); ArrayList nums =3D new ArrayList(); HashSet seen =3D new HashSet(); for (int j =3D 0; j < numdocs; j++) = { int nextDoc =3D rand.nextInt(maxDoc); if(seen.contains(nextDoc)) { while(seen.contains(nextDoc)) { nextDoc =3D rand.nextInt(maxDoc); } } nums.add(nextDoc); } Collections.sort(nums); obs.add(createObs(nums, maxDoc)); docs.add(createDocSet(nums)); } OpenBitSet base =3D obs.get(0); for(int i =3D 1; i < obs.size(); ++i) { base.intersect(obs.get(i)); } = AndDocIdSet ands =3D new AndDocIdSet(docs); long card1 =3D base.cardinality(); long card2 =3D ands.size(); = assertTrue(card1 =3D=3D card2); //When I run this, it fails every time, where I = //would expect openBitSet and AndDocIdSet to produce //the same cardinalities from this run of intersections //one example run got card1=3D101946 and card2=3D120384 //card2 was larger on all runs i did } } -----Original Message----- From: molz [mailto:anmol.bhasin@gmail.com] = Sent: Tuesday, April 28, 2009 9:00 PM To: java-user@lucene.apache.org Subject: RE: kamikaze Hi Micheal, Thanks for trying out Kamikaze for starters. So I guess there are a few issues here 1. getDocSetInstance(int min, max, count,DocSetFactory.FOCUS) assumes that count < max. I guess thats an API check we should add anyways to improve usability. That is not to say that it will not work if count > max but we have not done the due diligence on that one. 2. The way you are inserting the elements is not quite right. The addDoc method assumes you insert the elements in a sorted fashion. Calling doc.addDoc(rand.nextInt(maxDoc) does not quite ensure you are loading the docSet in a sorted fashion. This is specially useful in BitSet and P4D set cases as P4D encodes only delta values between conscutive integers. 3. I would recommend using FOCUS.OPTIMAL for best performance/space tradeoff, albeit SPACE should work too, if you find any issues with that let us know, we will be glad to fix it. 4. Finally, I believe you want to just get a plain vanilla docSet from one of the OR/AND sets. This would be cool to do, however the idea with Boolean Sets are that they are never really materialized, they are iterated over on the fly. I believe we could do an enhancement to construct the docSet on the fly while iterating the Boolean DocSet but as of now there is no established way of doing that. Hope I covered all your concerns. I rewrote and run your test case like this public class KamikazeTest extends TestCase { public void testGrowingP4() { DocSet doc =3D DocSetFactory.getDocSetInstance(0, 35000000, 200000, DocSetFactory.FOCUS.SPACE); Random rand =3D new Random(System.currentTimeMillis()); // int maxDoc =3D 3500000; //doc.addDoc(0); = int i =3D 0; try { while(i < 500000) { int nextDoc =3D i; doc.addDoc(nextDoc); i+=3Drand.nextInt(50); } = } catch(Exception e) { e.printStackTrace(); return; } assertTrue(true); = } = = } = Thanks, Anmol Software Engineer Anmol Bhasin www.linkedin.com Michael Mastroianni wrote: > = > Hi-- > = > I just got kamikaze somewhat integrated into a project of mine. I'm > having problems growing the DocIdSets, though. Up to the point where the > first regrow happens, everything is fine. Once the regrow happens, I get > an ArrayOutOfBoundsException. The following unit test will exhibit this > behavior. If I change the third param of getDocSetInstance to be > something lower, I get a p4Doc, if I leave it as is, I get an OpenBitSet > doc, in either case, I get the same crash. Do I need to initialize the > docs in some way other than just creating them? > = > regards, > Michael > = > import org.apache.lucene.search.DocIdSet; > import org.apache.lucene.util.OpenBitSet; > = > = > import com.kamikaze.docidset.api.DocSet; > import com.kamikaze.docidset.impl.AndDocIdSet; > import com.kamikaze.docidset.impl.OrDocIdSet; > import com.kamikaze.docidset.utils.DocSetFactory; > = > import junit.framework.TestCase; > = > = > public class KamikazeTest extends TestCase > { > public void testGrowingP4() > { > DocSet doc =3D > DocSetFactory.getDocSetInstance(0, 350000, 3000000, > DocSetFactory.FOCUS.SPACE); > Random rand =3D new Random(System.currentTimeMillis()); > int maxDoc =3D 350000; > doc.addDoc(rand.nextInt(maxDoc)); > int i =3D 0; > try > { > while(i < 256) > { > int nextDoc =3D rand.nextInt(maxDoc); > doc.addDoc(nextDoc); > ++i; > } = > } > catch(Exception e) > { > return; > } > assertTrue(false); > } > } > = > -----Original Message----- > From: John Wang [mailto:john.wang@gmail.com] = > Sent: Friday, April 24, 2009 7:50 PM > To: java-user@lucene.apache.org > Subject: Re: kamikaze > = > Hi Michael: > We are using it internally here at LinkedIn for both our search > engine > as well as our social graph engine. And we have a team developing > actively > on it. Let us know how we can help you. > = > -John > = > On Fri, Apr 24, 2009 at 1:56 PM, Michael Mastroianni < > MMastroianni@glgroup.com> wrote: > = >> Hi-- >> >> >> >> Has anyone here used kamikaze much? I'm interested in using it in >> situations where I'll have several docidsets of >2M, plus several in > the >> 10s of thousands. >> >> >> >> On prototype basis, I got something running nicely using OpenBitSet, > but >> I can't use that much memory for my real application. >> >> >> >> regards, >> >> Michael Mastroianni >> >> >> >> This e-mail message, and any attachments, is intended only for the use > of >> the individual or entity identified in the alias address of this > message and >> may contain information that is confidential, privileged and subject > to >> legal restrictions and penalties regarding its unauthorized disclosure > and >> use. Any unauthorized review, copying, disclosure, use or distribution > is >> strictly prohibited. If you have received this e-mail message in > error, >> please notify the sender immediately by reply e-mail and delete this >> message, and any attachments, from your system. Thank you. >> >> > = > This e-mail message, and any attachments, is intended only for the use of > the individual or entity identified in the alias address of this message > and may contain information that is confidential, privileged and subject > to legal restrictions and penalties regarding its unauthorized disclosure > and use. Any unauthorized review, copying, disclosure, use or distribution > is strictly prohibited. If you have received this e-mail message in error, > please notify the sender immediately by reply e-mail and delete this > message, and any attachments, from your system. Thank you. > = > = > = > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > = > = > = -- = View this message in context: http://www.nabble.com/kamikaze-tp23224760p23288825.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org This e-mail message, and any attachments, is intended only for the use of the individual or entity identified in the alias address of this message and may contain information that is confidential, privileged and subject to legal restrictions and penalties regarding its unauthorized disclosure and use. Any unauthorized review, copying, disclosure, use or distribution is strictly prohibited. If you have received this e-mail message in error, please notify the sender immediately by reply e-mail and delete this message, and any attachments, from your system. Thank you. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org This e-mail message, and any attachments, is intended only for the use of t= he individual or entity identified in the alias address of this message and= may contain information that is confidential, privileged and subject to le= gal restrictions and penalties regarding its unauthorized disclosure and us= e. Any unauthorized review, copying, disclosure, use or distribution is str= ictly prohibited. If you have received this e-mail message in error, please= notify the sender immediately by reply e-mail and delete this message, and= any attachments, from your system. Thank you. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org