Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: local policy)
Message-ID: <494A1FE2.8010101@cisco.com>
Date: Thu, 18 Dec 2008 15:33:14 +0530
From: Preetham Kajekar <preetham@cisco.com>
User-Agent: Thunderbird 2.0.0.6 (Windows/20070728)
MIME-Version: 1.0
To: java-user@lucene.apache.org
Subject: Re: Combining results of multiple indexes
References: <494894D2.9010901@cisco.com>
	 <5B9C9A9A-FFA3-409B-9114-B916D9B0C1AA@apache.org>
	 <49490F4B.1030607@cisco.com>
	 <359a92830812170720m8fc2d0bw4ad837bd6a4f051e@mail.gmail.com>
	 <49491EED.4070403@cisco.com>
 <359a92830812171021o183713c9h5b0914539660383a@mail.gmail.com>
 <4949D6D6.7000402@cisco.com>
In-Reply-To: <4949D6D6.7000402@cisco.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Hi,
 I tried out a single IndexWriter used by two threads to index different 
fields. It is slower than using two separate IndexWriters. These are my 
findings

All Fields (9) using 1 IndexWriter 1 Thread - 38,000 object per sec
5 Fields       using 1 IndexWriter 1 Thread - 62,000 object per sec
All Fields (9) using 1 IndexWriter 2 Thread - 29,000 object per sec
All Fields (9) using 2 IndexWriter 2 Thread - 55,000 object per sec

So, it looks like I will have figure how to combine results of multiple 
indexes.

Thanks,
 ~preetham

Preetham Kajekar wrote:
> Thanks Erick and Michael.
> I will try out these suggestions and post my findings.
>
> ~preetham
>
> Erick Erickson wrote:
>> Well, maybe if I'd read the original post more carefully I'd have 
>> figured
>> that out,
>> sorry 'bout that.
>>
>> I *think* I remember reading somewhere on the email lists that your 
>> indexing
>> speed goes up pretty linearly as the number of indexing tasks approaches
>> the number of CPUs. Are you, perhaps, on a dual-core machine? But do 
>> search
>> the mail archives because my memory may not be accurate.
>>
>> You can easily combine indexes by IndexWriter.addIndexes BTW. Personally
>> I prefer fewer indexes if you can get away with it. But I'd only try 
>> this
>> after
>> Michael's suggestion of using multiple threads on a single underlying
>> writer.
>>
>> You could even think about using N machines to create M fragments then
>> combining them all afterwards if your logs are static enough to make 
>> that
>> reasonable. Combining indexes may take a while though.....
>>
>> Best
>> Erick
>>
>> On Wed, Dec 17, 2008 at 10:46 AM, Preetham Kajekar 
>> <preetham@cisco.com>wrote:
>>
>>  
>>> Hi Erick,
>>> Thanks for the response. Replies inline.
>>>
>>> Erick Erickson wrote:
>>>
>>>    
>>>> The very first question is always "are you opening a new searcher
>>>> each time you query"? But you've looked at the Wiki so I assume not.
>>>> This question is closely tied to what kind of latency you can 
>>>> tolerate.
>>>>
>>>> A few more details, please. What's slow? Queries? Indexing?
>>>>
>>>>
>>>>       
>>> Indexing. Again, it is not slow. It is just faster with two separate
>>> indexers in two threads.
>>>
>>>    
>>>> How slow? 100ms? 100s? What are your target times and
>>>> what are you seeing?
>>>>
>>>>
>>>>       
>>> With a single indexer in a single thread, I can index about 20,000 
>>> event
>>> objects per second. With 2 thread and 2 indexers, it is close to 
>>> 50,000. :-)
>>>
>>>    
>>>> How big is your index? 100M? 100G? What kind of VM
>>>> parameters are you specifying?
>>>>
>>>>
>>>>       
>>> The index will have about 20mil entries. The size of the index lands up
>>> being about 500M.
>>> I start the VM with 1G of heap. No other options for GC etc is used.
>>>
>>>    
>>>> As an aside, do note that there's no requirement in Lucene that
>>>> each document have the same fields, so it's unclear why you
>>>> need two indexes, but perhaps some of the answers to the above
>>>> will help us understand.
>>>>
>>>>
>>>>       
>>> Like I mentioned, Lucene does the job much faster with two indexes.
>>>
>>>    
>>>> Also, be very very careful what you measure when you measure
>>>> queries. You absolutely *have* to put some instrumentation in
>>>> the code since "slow queries" can result from things other than
>>>> searching. For instance, iterating over a Hits object for 100s of
>>>> documents....
>>>>
>>>>
>>>>       
>>> The Query speeds are much faster than what I need :-) So no 
>>> complains here.
>>>
>>>    
>>>> Show the code, man <G>!
>>>>
>>>>
>>>>       
>>> Code below. EvIndexer is the base class. There are two subclasses which
>>> implement addEvFieldsToIndexDoc() (template pattern) to add 
>>> different fields
>>> to the index. that code is also pasted below
>>>
>>> --Code ---
>>>
>>> BaseClass
>>>
>>>   public EvIndexer(String indexName) throws Exception {
>>>       this.name = indexName;
>>>       a = new KeywordAnalyzer();
>>>       INDEX_PATH = System.getProperty(StoreManager.PROP_DB_DB_LOC,
>>> "./index/");
>>>       FSDirectory directory = FSDirectory.getDirectory(INDEX_PATH +
>>> File.separatorChar + indexName, NoLockFactory.getNoLockFactory());
>>>       indexWriter = new IndexWriter(directory, a,
>>> IndexWriter.MaxFieldLength.LIMITED);
>>> //indexWriter.setUseCompoundFile(false);
>>>       //indexWriter.setRAMBufferSizeMB(256);
>>>         }
>>>       /** Method implemented by extending classes to add data into the
>>> index document for the
>>>    *  given event
>>>    *
>>>    * @param d
>>>    */
>>>   protected abstract void addEvFieldsToIndexDoc(Document d, Ev event);
>>>     public void addToIndex(Ev ev) throws Exception {
>>>       noOfEventsIndexed++;
>>>       Document d = new Document();             addEvFieldsToIndexDoc(d,
>>> ev);
>>>       indexWriter.addDocument(d);
>>>             if ((noOfEventsIndexed % COMMIT_INTERVAL) == 0) {
>>>           System.out.println(name + " indexed " +
>>> NumberFormat.getInstance().format(noOfEventsIndexed) + " Commiting 
>>> them");
>>>           commit();
>>>       }                   }
>>>
>>> DerievdClass1
>>>   protected void addEvFieldsToIndexDoc(Document d, Ev ev) {
>>>       //noOfEventsIndexed++;
>>>             Field id = new Field(EV_ID, Long.toString(ev.getId()),
>>> Field.Store.YES, Field.Index.NO);
>>>       Field src = new Field(EV_SRC, Long.toString(ev.getSrcId()),
>>> Field.Store.NO, Field.Index.NOT_ANALYZED);
>>>       Field type = new Field(EV_TYPE,
>>> Integer.toString(ev.getEventTypeId()), Field.Store.NO,
>>> Field.Index.NOT_ANALYZED);
>>>       Field pri = new Field(EV_PRI, Short.toString(ev.getPriority()) ,
>>> Field.Store.NO, Field.Index.NOT_ANALYZED);
>>>       Field time = new Field(EV_TIME, getHexString(ev.getRecvTime()) ,
>>> Field.Store.NO, Field.Index.NOT_ANALYZED);
>>>       d.add(id);
>>>       d.add(src);
>>>       d.add(type);
>>>       d.add(pri);
>>>       d.add(time);
>>>       //noOfFieldsIndexed +=  4;
>>>                   }
>>>
>>>
>>>
>>>
>>> Thanks for the support.
>>> ~preetham
>>>
>>>
>>>  Best
>>>    
>>>> Erick
>>>>
>>>>
>>>> On Wed, Dec 17, 2008 at 9:40 AM, Preetham Kajekar <preetham@cisco.com
>>>>      
>>>>> wrote:
>>>>>         
>>>>
>>>>      
>>>>> Hi Grant,
>>>>> Thanks four response. Replies inline.
>>>>>
>>>>> Grant Ingersoll wrote:
>>>>>
>>>>>
>>>>>
>>>>>        
>>>>>> On Dec 17, 2008, at 12:57 AM, Preetham Kajekar wrote:
>>>>>>
>>>>>>  Hi,
>>>>>>
>>>>>>
>>>>>>          
>>>>>>> I am new to Lucene. I am not using it as a pure text indexer.
>>>>>>>
>>>>>>> I am trying to index a Java object which has about 10 fields 
>>>>>>> (like id,
>>>>>>> time, srcIp, dstIp) - most of them being numerical values.
>>>>>>> In order to speed up indexing, I figured that having two separate
>>>>>>> indexers, each of them indexing different set of fields works 
>>>>>>> great. So
>>>>>>> I
>>>>>>> have the first 5 fields in index1 and the remaining in index2.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>             
>>>>>> Can you explain this a bit more?  Are those two fields really 
>>>>>> large org
>>>>>> something?  How are you obtaining them?  How are you correlating the
>>>>>> documents between the two indexes?  Did you actually try a single 
>>>>>> index
>>>>>> and
>>>>>> it was too slow?
>>>>>>
>>>>>>
>>>>>>
>>>>>>           
>>>>> I have a java object which has about 10 fields. However, the 
>>>>> fields are
>>>>> not
>>>>> fixed. The java object is essentially a representation of Syslogs 
>>>>> from
>>>>> network devices. So different syslogs have different fields. Each 
>>>>> field
>>>>> has
>>>>> a unique id and a value (mostly numeric types, so i convert it to
>>>>> string).
>>>>> There are some fixed fields. So the object is a list of fields 
>>>>> which is
>>>>> produced by a parser.
>>>>> I am trying to index using two indexers in two separate threads- 
>>>>> one for
>>>>> fixed and another for the non-fixed fields. Except for a unique 
>>>>> id, I do
>>>>> not
>>>>> store the fields in Lucene - i just index them. From the index, i 
>>>>> get the
>>>>> unique id which is all I care about. (the objects are stored 
>>>>> elsewhere
>>>>> and
>>>>> can be looked up based on this unique id).
>>>>> I did try using a single indexer, but things were quite slow. Getting
>>>>> high
>>>>> throughput is crucial and having two indexers seemed to do very well.
>>>>> (more
>>>>> than twice as fast)
>>>>>
>>>>> Further, the index will never be modified and I can have just one 
>>>>> thread
>>>>> writing to the index. If there are any other performance tips 
>>>>> would be
>>>>> very
>>>>> helpful. I have already looked at the wiki link regarding 
>>>>> performance and
>>>>> using some of them.
>>>>>
>>>>> Thanks,
>>>>> ~preetham
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>        
>>>>>> Now, I want to have boolean AND query's looking for values in both
>>>>>>          
>>>>>>> indexes. Like f1=1234 AND f7=ABCD.f1 and f7 and present in two 
>>>>>>> separate
>>>>>>> indexes. Would using the MultiIndexReader help ? Since I am 
>>>>>>> doing an
>>>>>>> AND, I
>>>>>>> dont expect that it would work.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> ~preetham
>>>>>>>
>>>>>>> --------------------------------------------------------------------- 
>>>>>>>
>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>             
>>>>>> --------------------------
>>>>>> Grant Ingersoll
>>>>>>
>>>>>> Lucene Helpful Hints:
>>>>>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>>>>>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --------------------------------------------------------------------- 
>>>>>>
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>           
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>         
>>>>
>>>>       
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>     
>>
>>   
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org