Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 86685 invoked from network); 18 Dec 2008 10:04:27 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 18 Dec 2008 10:04:27 -0000 Received: (qmail 97572 invoked by uid 500); 18 Dec 2008 10:04:32 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 97543 invoked by uid 500); 18 Dec 2008 10:04:32 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 97532 invoked by uid 99); 18 Dec 2008 10:04:32 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 18 Dec 2008 02:04:32 -0800 X-ASF-Spam-Status: No, hits=-4.0 required=10.0 tests=RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [171.71.176.117] (HELO sj-iport-6.cisco.com) (171.71.176.117) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 18 Dec 2008 10:04:09 +0000 X-IronPort-AV: E=Sophos;i="4.36,242,1228089600"; d="scan'208";a="215375317" Received: from sj-dkim-4.cisco.com ([171.71.179.196]) by sj-iport-6.cisco.com with ESMTP; 18 Dec 2008 10:03:47 +0000 Received: from india-core-1.cisco.com (india-core-1.cisco.com [64.104.129.221]) by sj-dkim-4.cisco.com (8.12.11/8.12.11) with ESMTP id mBIA3jBk000960 for ; Thu, 18 Dec 2008 02:03:46 -0800 Received: from xbh-blr-411.apac.cisco.com (xbh-blr-411.cisco.com [64.104.140.150]) by india-core-1.cisco.com (8.13.8/8.13.8) with ESMTP id mBIA3cgX018637 for ; Thu, 18 Dec 2008 10:03:42 GMT Received: from xfe-blr-411.apac.cisco.com ([64.104.140.151]) by xbh-blr-411.apac.cisco.com with Microsoft SMTPSVC(6.0.3790.1830); Thu, 18 Dec 2008 15:33:44 +0530 Received: from [10.65.78.248] ([10.65.78.248]) by xfe-blr-411.apac.cisco.com with Microsoft SMTPSVC(6.0.3790.1830); Thu, 18 Dec 2008 15:33:43 +0530 Message-ID: <494A1FE2.8010101@cisco.com> Date: Thu, 18 Dec 2008 15:33:14 +0530 From: Preetham Kajekar User-Agent: Thunderbird 2.0.0.6 (Windows/20070728) MIME-Version: 1.0 To: java-user@lucene.apache.org Subject: Re: Combining results of multiple indexes References: <494894D2.9010901@cisco.com> <5B9C9A9A-FFA3-409B-9114-B916D9B0C1AA@apache.org> <49490F4B.1030607@cisco.com> <359a92830812170720m8fc2d0bw4ad837bd6a4f051e@mail.gmail.com> <49491EED.4070403@cisco.com> <359a92830812171021o183713c9h5b0914539660383a@mail.gmail.com> <4949D6D6.7000402@cisco.com> In-Reply-To: <4949D6D6.7000402@cisco.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-OriginalArrivalTime: 18 Dec 2008 10:03:43.0414 (UTC) FILETIME=[E8B51560:01C960F7] DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; l=10900; t=1229594626; x=1230458626; c=relaxed/simple; s=sjdkim4002; h=Content-Type:From:Subject:Content-Transfer-Encoding:MIME-Version; d=cisco.com; i=preetham@cisco.com; z=From:=20Preetham=20Kajekar=20 |Subject:=20Re=3A=20Combining=20results=20of=20multiple=20i ndexes |Sender:=20; bh=+8qkGGxwwRCLuRHCF1vApBXuV/n+IP8/7gRpIFXejsY=; b=TVlLW8cOl81vZismGRTYHPkqb5f57EqLVFZLNiWFaDNWdvH90RfgW75bYg 05TEdTh8He5l43EOUEd3W27nep9mPRkiBl+DjnCUCpkOFdvjCQml0PCaMsmV ixs86Qmjmu; Authentication-Results: sj-dkim-4; header.From=preetham@cisco.com; dkim=pass ( sig from cisco.com/sjdkim4002 verified; ); X-Virus-Checked: Checked by ClamAV on apache.org Hi, I tried out a single IndexWriter used by two threads to index different fields. It is slower than using two separate IndexWriters. These are my findings All Fields (9) using 1 IndexWriter 1 Thread - 38,000 object per sec 5 Fields using 1 IndexWriter 1 Thread - 62,000 object per sec All Fields (9) using 1 IndexWriter 2 Thread - 29,000 object per sec All Fields (9) using 2 IndexWriter 2 Thread - 55,000 object per sec So, it looks like I will have figure how to combine results of multiple indexes. Thanks, ~preetham Preetham Kajekar wrote: > Thanks Erick and Michael. > I will try out these suggestions and post my findings. > > ~preetham > > Erick Erickson wrote: >> Well, maybe if I'd read the original post more carefully I'd have >> figured >> that out, >> sorry 'bout that. >> >> I *think* I remember reading somewhere on the email lists that your >> indexing >> speed goes up pretty linearly as the number of indexing tasks approaches >> the number of CPUs. Are you, perhaps, on a dual-core machine? But do >> search >> the mail archives because my memory may not be accurate. >> >> You can easily combine indexes by IndexWriter.addIndexes BTW. Personally >> I prefer fewer indexes if you can get away with it. But I'd only try >> this >> after >> Michael's suggestion of using multiple threads on a single underlying >> writer. >> >> You could even think about using N machines to create M fragments then >> combining them all afterwards if your logs are static enough to make >> that >> reasonable. Combining indexes may take a while though..... >> >> Best >> Erick >> >> On Wed, Dec 17, 2008 at 10:46 AM, Preetham Kajekar >> wrote: >> >> >>> Hi Erick, >>> Thanks for the response. Replies inline. >>> >>> Erick Erickson wrote: >>> >>> >>>> The very first question is always "are you opening a new searcher >>>> each time you query"? But you've looked at the Wiki so I assume not. >>>> This question is closely tied to what kind of latency you can >>>> tolerate. >>>> >>>> A few more details, please. What's slow? Queries? Indexing? >>>> >>>> >>>> >>> Indexing. Again, it is not slow. It is just faster with two separate >>> indexers in two threads. >>> >>> >>>> How slow? 100ms? 100s? What are your target times and >>>> what are you seeing? >>>> >>>> >>>> >>> With a single indexer in a single thread, I can index about 20,000 >>> event >>> objects per second. With 2 thread and 2 indexers, it is close to >>> 50,000. :-) >>> >>> >>>> How big is your index? 100M? 100G? What kind of VM >>>> parameters are you specifying? >>>> >>>> >>>> >>> The index will have about 20mil entries. The size of the index lands up >>> being about 500M. >>> I start the VM with 1G of heap. No other options for GC etc is used. >>> >>> >>>> As an aside, do note that there's no requirement in Lucene that >>>> each document have the same fields, so it's unclear why you >>>> need two indexes, but perhaps some of the answers to the above >>>> will help us understand. >>>> >>>> >>>> >>> Like I mentioned, Lucene does the job much faster with two indexes. >>> >>> >>>> Also, be very very careful what you measure when you measure >>>> queries. You absolutely *have* to put some instrumentation in >>>> the code since "slow queries" can result from things other than >>>> searching. For instance, iterating over a Hits object for 100s of >>>> documents.... >>>> >>>> >>>> >>> The Query speeds are much faster than what I need :-) So no >>> complains here. >>> >>> >>>> Show the code, man ! >>>> >>>> >>>> >>> Code below. EvIndexer is the base class. There are two subclasses which >>> implement addEvFieldsToIndexDoc() (template pattern) to add >>> different fields >>> to the index. that code is also pasted below >>> >>> --Code --- >>> >>> BaseClass >>> >>> public EvIndexer(String indexName) throws Exception { >>> this.name = indexName; >>> a = new KeywordAnalyzer(); >>> INDEX_PATH = System.getProperty(StoreManager.PROP_DB_DB_LOC, >>> "./index/"); >>> FSDirectory directory = FSDirectory.getDirectory(INDEX_PATH + >>> File.separatorChar + indexName, NoLockFactory.getNoLockFactory()); >>> indexWriter = new IndexWriter(directory, a, >>> IndexWriter.MaxFieldLength.LIMITED); >>> //indexWriter.setUseCompoundFile(false); >>> //indexWriter.setRAMBufferSizeMB(256); >>> } >>> /** Method implemented by extending classes to add data into the >>> index document for the >>> * given event >>> * >>> * @param d >>> */ >>> protected abstract void addEvFieldsToIndexDoc(Document d, Ev event); >>> public void addToIndex(Ev ev) throws Exception { >>> noOfEventsIndexed++; >>> Document d = new Document(); addEvFieldsToIndexDoc(d, >>> ev); >>> indexWriter.addDocument(d); >>> if ((noOfEventsIndexed % COMMIT_INTERVAL) == 0) { >>> System.out.println(name + " indexed " + >>> NumberFormat.getInstance().format(noOfEventsIndexed) + " Commiting >>> them"); >>> commit(); >>> } } >>> >>> DerievdClass1 >>> protected void addEvFieldsToIndexDoc(Document d, Ev ev) { >>> //noOfEventsIndexed++; >>> Field id = new Field(EV_ID, Long.toString(ev.getId()), >>> Field.Store.YES, Field.Index.NO); >>> Field src = new Field(EV_SRC, Long.toString(ev.getSrcId()), >>> Field.Store.NO, Field.Index.NOT_ANALYZED); >>> Field type = new Field(EV_TYPE, >>> Integer.toString(ev.getEventTypeId()), Field.Store.NO, >>> Field.Index.NOT_ANALYZED); >>> Field pri = new Field(EV_PRI, Short.toString(ev.getPriority()) , >>> Field.Store.NO, Field.Index.NOT_ANALYZED); >>> Field time = new Field(EV_TIME, getHexString(ev.getRecvTime()) , >>> Field.Store.NO, Field.Index.NOT_ANALYZED); >>> d.add(id); >>> d.add(src); >>> d.add(type); >>> d.add(pri); >>> d.add(time); >>> //noOfFieldsIndexed += 4; >>> } >>> >>> >>> >>> >>> Thanks for the support. >>> ~preetham >>> >>> >>> Best >>> >>>> Erick >>>> >>>> >>>> On Wed, Dec 17, 2008 at 9:40 AM, Preetham Kajekar >>> >>>>> wrote: >>>>> >>>> >>>> >>>>> Hi Grant, >>>>> Thanks four response. Replies inline. >>>>> >>>>> Grant Ingersoll wrote: >>>>> >>>>> >>>>> >>>>> >>>>>> On Dec 17, 2008, at 12:57 AM, Preetham Kajekar wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> >>>>>> >>>>>>> I am new to Lucene. I am not using it as a pure text indexer. >>>>>>> >>>>>>> I am trying to index a Java object which has about 10 fields >>>>>>> (like id, >>>>>>> time, srcIp, dstIp) - most of them being numerical values. >>>>>>> In order to speed up indexing, I figured that having two separate >>>>>>> indexers, each of them indexing different set of fields works >>>>>>> great. So >>>>>>> I >>>>>>> have the first 5 fields in index1 and the remaining in index2. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> Can you explain this a bit more? Are those two fields really >>>>>> large org >>>>>> something? How are you obtaining them? How are you correlating the >>>>>> documents between the two indexes? Did you actually try a single >>>>>> index >>>>>> and >>>>>> it was too slow? >>>>>> >>>>>> >>>>>> >>>>>> >>>>> I have a java object which has about 10 fields. However, the >>>>> fields are >>>>> not >>>>> fixed. The java object is essentially a representation of Syslogs >>>>> from >>>>> network devices. So different syslogs have different fields. Each >>>>> field >>>>> has >>>>> a unique id and a value (mostly numeric types, so i convert it to >>>>> string). >>>>> There are some fixed fields. So the object is a list of fields >>>>> which is >>>>> produced by a parser. >>>>> I am trying to index using two indexers in two separate threads- >>>>> one for >>>>> fixed and another for the non-fixed fields. Except for a unique >>>>> id, I do >>>>> not >>>>> store the fields in Lucene - i just index them. From the index, i >>>>> get the >>>>> unique id which is all I care about. (the objects are stored >>>>> elsewhere >>>>> and >>>>> can be looked up based on this unique id). >>>>> I did try using a single indexer, but things were quite slow. Getting >>>>> high >>>>> throughput is crucial and having two indexers seemed to do very well. >>>>> (more >>>>> than twice as fast) >>>>> >>>>> Further, the index will never be modified and I can have just one >>>>> thread >>>>> writing to the index. If there are any other performance tips >>>>> would be >>>>> very >>>>> helpful. I have already looked at the wiki link regarding >>>>> performance and >>>>> using some of them. >>>>> >>>>> Thanks, >>>>> ~preetham >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>> Now, I want to have boolean AND query's looking for values in both >>>>>> >>>>>>> indexes. Like f1=1234 AND f7=ABCD.f1 and f7 and present in two >>>>>>> separate >>>>>>> indexes. Would using the MultiIndexReader help ? Since I am >>>>>>> doing an >>>>>>> AND, I >>>>>>> dont expect that it would work. >>>>>>> >>>>>>> Thanks, >>>>>>> ~preetham >>>>>>> >>>>>>> --------------------------------------------------------------------- >>>>>>> >>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> -------------------------- >>>>>> Grant Ingersoll >>>>>> >>>>>> Lucene Helpful Hints: >>>>>> http://wiki.apache.org/lucene-java/BasicsOfPerformance >>>>>> http://wiki.apache.org/lucene-java/LuceneFAQ >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> >>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>>>> For additional commands, e-mail: java-user-help@lucene.apache.org >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>> For additional commands, e-mail: java-user-help@lucene.apache.org >>> >>> >>> >> >> > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org