lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Combining results of multiple indexes
Date Thu, 18 Dec 2008 17:59:16 GMT

These results are surprising.

I'd expect single IndexWriter with 2 threads to do better than a  
single thread, but in your test two threads are significantly worse  
than one.

Is it possible there's a bottleneck outside of Lucene in sourcing the  
documents?

How many segments are produced at the end of your indexing run?

Your use case is somewhat unusual because (I think?) you have no  
tokenized fields; I would expect the "overhead" of handling the  
document to be higher in this case, so maybe you are hitting some  
concurrency problem.

Mike

Preetham Kajekar wrote:

> Hi,
> I tried out a single IndexWriter used by two threads to index  
> different fields. It is slower than using two separate IndexWriters.  
> These are my findings
>
> All Fields (9) using 1 IndexWriter 1 Thread - 38,000 object per sec
> 5 Fields       using 1 IndexWriter 1 Thread - 62,000 object per sec
> All Fields (9) using 1 IndexWriter 2 Thread - 29,000 object per sec
> All Fields (9) using 2 IndexWriter 2 Thread - 55,000 object per sec
>
> So, it looks like I will have figure how to combine results of  
> multiple indexes.
>
> Thanks,
> ~preetham
>
> Preetham Kajekar wrote:
>> Thanks Erick and Michael.
>> I will try out these suggestions and post my findings.
>>
>> ~preetham
>>
>> Erick Erickson wrote:
>>> Well, maybe if I'd read the original post more carefully I'd have  
>>> figured
>>> that out,
>>> sorry 'bout that.
>>>
>>> I *think* I remember reading somewhere on the email lists that  
>>> your indexing
>>> speed goes up pretty linearly as the number of indexing tasks  
>>> approaches
>>> the number of CPUs. Are you, perhaps, on a dual-core machine? But  
>>> do search
>>> the mail archives because my memory may not be accurate.
>>>
>>> You can easily combine indexes by IndexWriter.addIndexes BTW.  
>>> Personally
>>> I prefer fewer indexes if you can get away with it. But I'd only  
>>> try this
>>> after
>>> Michael's suggestion of using multiple threads on a single  
>>> underlying
>>> writer.
>>>
>>> You could even think about using N machines to create M fragments  
>>> then
>>> combining them all afterwards if your logs are static enough to  
>>> make that
>>> reasonable. Combining indexes may take a while though.....
>>>
>>> Best
>>> Erick
>>>
>>> On Wed, Dec 17, 2008 at 10:46 AM, Preetham Kajekar <preetham@cisco.com 
>>> >wrote:
>>>
>>>
>>>> Hi Erick,
>>>> Thanks for the response. Replies inline.
>>>>
>>>> Erick Erickson wrote:
>>>>
>>>>
>>>>> The very first question is always "are you opening a new searcher
>>>>> each time you query"? But you've looked at the Wiki so I assume  
>>>>> not.
>>>>> This question is closely tied to what kind of latency you can  
>>>>> tolerate.
>>>>>
>>>>> A few more details, please. What's slow? Queries? Indexing?
>>>>>
>>>>>
>>>>>
>>>> Indexing. Again, it is not slow. It is just faster with two  
>>>> separate
>>>> indexers in two threads.
>>>>
>>>>
>>>>> How slow? 100ms? 100s? What are your target times and
>>>>> what are you seeing?
>>>>>
>>>>>
>>>>>
>>>> With a single indexer in a single thread, I can index about  
>>>> 20,000 event
>>>> objects per second. With 2 thread and 2 indexers, it is close to  
>>>> 50,000. :-)
>>>>
>>>>
>>>>> How big is your index? 100M? 100G? What kind of VM
>>>>> parameters are you specifying?
>>>>>
>>>>>
>>>>>
>>>> The index will have about 20mil entries. The size of the index  
>>>> lands up
>>>> being about 500M.
>>>> I start the VM with 1G of heap. No other options for GC etc is  
>>>> used.
>>>>
>>>>
>>>>> As an aside, do note that there's no requirement in Lucene that
>>>>> each document have the same fields, so it's unclear why you
>>>>> need two indexes, but perhaps some of the answers to the above
>>>>> will help us understand.
>>>>>
>>>>>
>>>>>
>>>> Like I mentioned, Lucene does the job much faster with two indexes.
>>>>
>>>>
>>>>> Also, be very very careful what you measure when you measure
>>>>> queries. You absolutely *have* to put some instrumentation in
>>>>> the code since "slow queries" can result from things other than
>>>>> searching. For instance, iterating over a Hits object for 100s of
>>>>> documents....
>>>>>
>>>>>
>>>>>
>>>> The Query speeds are much faster than what I need :-) So no  
>>>> complains here.
>>>>
>>>>
>>>>> Show the code, man <G>!
>>>>>
>>>>>
>>>>>
>>>> Code below. EvIndexer is the base class. There are two subclasses  
>>>> which
>>>> implement addEvFieldsToIndexDoc() (template pattern) to add  
>>>> different fields
>>>> to the index. that code is also pasted below
>>>>
>>>> --Code ---
>>>>
>>>> BaseClass
>>>>
>>>>  public EvIndexer(String indexName) throws Exception {
>>>>      this.name = indexName;
>>>>      a = new KeywordAnalyzer();
>>>>      INDEX_PATH = System.getProperty(StoreManager.PROP_DB_DB_LOC,
>>>> "./index/");
>>>>      FSDirectory directory = FSDirectory.getDirectory(INDEX_PATH +
>>>> File.separatorChar + indexName, NoLockFactory.getNoLockFactory());
>>>>      indexWriter = new IndexWriter(directory, a,
>>>> IndexWriter.MaxFieldLength.LIMITED);
>>>> //indexWriter.setUseCompoundFile(false);
>>>>      //indexWriter.setRAMBufferSizeMB(256);
>>>>        }
>>>>      /** Method implemented by extending classes to add data into  
>>>> the
>>>> index document for the
>>>>   *  given event
>>>>   *
>>>>   * @param d
>>>>   */
>>>>  protected abstract void addEvFieldsToIndexDoc(Document d, Ev  
>>>> event);
>>>>    public void addToIndex(Ev ev) throws Exception {
>>>>      noOfEventsIndexed++;
>>>>      Document d = new Document();              
>>>> addEvFieldsToIndexDoc(d,
>>>> ev);
>>>>      indexWriter.addDocument(d);
>>>>            if ((noOfEventsIndexed % COMMIT_INTERVAL) == 0) {
>>>>          System.out.println(name + " indexed " +
>>>> NumberFormat.getInstance().format(noOfEventsIndexed) + "  
>>>> Commiting them");
>>>>          commit();
>>>>      }                   }
>>>>
>>>> DerievdClass1
>>>>  protected void addEvFieldsToIndexDoc(Document d, Ev ev) {
>>>>      //noOfEventsIndexed++;
>>>>            Field id = new Field(EV_ID, Long.toString(ev.getId()),
>>>> Field.Store.YES, Field.Index.NO);
>>>>      Field src = new Field(EV_SRC, Long.toString(ev.getSrcId()),
>>>> Field.Store.NO, Field.Index.NOT_ANALYZED);
>>>>      Field type = new Field(EV_TYPE,
>>>> Integer.toString(ev.getEventTypeId()), Field.Store.NO,
>>>> Field.Index.NOT_ANALYZED);
>>>>      Field pri = new Field(EV_PRI,  
>>>> Short.toString(ev.getPriority()) ,
>>>> Field.Store.NO, Field.Index.NOT_ANALYZED);
>>>>      Field time = new Field(EV_TIME,  
>>>> getHexString(ev.getRecvTime()) ,
>>>> Field.Store.NO, Field.Index.NOT_ANALYZED);
>>>>      d.add(id);
>>>>      d.add(src);
>>>>      d.add(type);
>>>>      d.add(pri);
>>>>      d.add(time);
>>>>      //noOfFieldsIndexed +=  4;
>>>>                  }
>>>>
>>>>
>>>>
>>>>
>>>> Thanks for the support.
>>>> ~preetham
>>>>
>>>>
>>>> Best
>>>>
>>>>> Erick
>>>>>
>>>>>
>>>>> On Wed, Dec 17, 2008 at 9:40 AM, Preetham Kajekar <preetham@cisco.com
>>>>>
>>>>>> wrote:
>>>>>>
>>>>>
>>>>>
>>>>>> Hi Grant,
>>>>>> Thanks four response. Replies inline.
>>>>>>
>>>>>> Grant Ingersoll wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> On Dec 17, 2008, at 12:57 AM, Preetham Kajekar wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> I am new to Lucene. I am not using it as a pure text indexer.
>>>>>>>>
>>>>>>>> I am trying to index a Java object which has about 10 fields
 
>>>>>>>> (like id,
>>>>>>>> time, srcIp, dstIp) - most of them being numerical values.
>>>>>>>> In order to speed up indexing, I figured that having two
 
>>>>>>>> separate
>>>>>>>> indexers, each of them indexing different set of fields works
 
>>>>>>>> great. So
>>>>>>>> I
>>>>>>>> have the first 5 fields in index1 and the remaining in index2.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> Can you explain this a bit more?  Are those two fields really
 
>>>>>>> large org
>>>>>>> something?  How are you obtaining them?  How are you  
>>>>>>> correlating the
>>>>>>> documents between the two indexes?  Did you actually try a  
>>>>>>> single index
>>>>>>> and
>>>>>>> it was too slow?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> I have a java object which has about 10 fields. However, the  
>>>>>> fields are
>>>>>> not
>>>>>> fixed. The java object is essentially a representation of  
>>>>>> Syslogs from
>>>>>> network devices. So different syslogs have different fields.  
>>>>>> Each field
>>>>>> has
>>>>>> a unique id and a value (mostly numeric types, so i convert it to
>>>>>> string).
>>>>>> There are some fixed fields. So the object is a list of fields  
>>>>>> which is
>>>>>> produced by a parser.
>>>>>> I am trying to index using two indexers in two separate  
>>>>>> threads- one for
>>>>>> fixed and another for the non-fixed fields. Except for a unique 

>>>>>> id, I do
>>>>>> not
>>>>>> store the fields in Lucene - i just index them. From the index, 

>>>>>> i get the
>>>>>> unique id which is all I care about. (the objects are stored  
>>>>>> elsewhere
>>>>>> and
>>>>>> can be looked up based on this unique id).
>>>>>> I did try using a single indexer, but things were quite slow.  
>>>>>> Getting
>>>>>> high
>>>>>> throughput is crucial and having two indexers seemed to do very 

>>>>>> well.
>>>>>> (more
>>>>>> than twice as fast)
>>>>>>
>>>>>> Further, the index will never be modified and I can have just  
>>>>>> one thread
>>>>>> writing to the index. If there are any other performance tips  
>>>>>> would be
>>>>>> very
>>>>>> helpful. I have already looked at the wiki link regarding  
>>>>>> performance and
>>>>>> using some of them.
>>>>>>
>>>>>> Thanks,
>>>>>> ~preetham
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Now, I want to have boolean AND query's looking for values in
 
>>>>>>> both
>>>>>>>
>>>>>>>> indexes. Like f1=1234 AND f7=ABCD.f1 and f7 and present in
 
>>>>>>>> two separate
>>>>>>>> indexes. Would using the MultiIndexReader help ? Since I
am  
>>>>>>>> doing an
>>>>>>>> AND, I
>>>>>>>> dont expect that it would work.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> ~preetham
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> --------------------------
>>>>>>> Grant Ingersoll
>>>>>>>
>>>>>>> Lucene Helpful Hints:
>>>>>>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>>>>>>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>> For additional commands, e-mail: java-user- 
>>>>>>> help@lucene.apache.org
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>
>>>
>>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message