lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: Combining results of multiple indexes
Date Thu, 18 Dec 2008 13:19:10 GMT
You will be stunned at how easy it is. The merging code should be
a dozen lines (and that only if you are merging 6 or so indexes)....

See IndexWriter.addIndexes or
IndexWriter.addIndexesNoOptimize

Best
Erick

On Thu, Dec 18, 2008 at 5:03 AM, Preetham Kajekar <preetham@cisco.com>wrote:

> Hi,
> I tried out a single IndexWriter used by two threads to index different
> fields. It is slower than using two separate IndexWriters. These are my
> findings
>
> All Fields (9) using 1 IndexWriter 1 Thread - 38,000 object per sec
> 5 Fields       using 1 IndexWriter 1 Thread - 62,000 object per sec
> All Fields (9) using 1 IndexWriter 2 Thread - 29,000 object per sec
> All Fields (9) using 2 IndexWriter 2 Thread - 55,000 object per sec
>
> So, it looks like I will have figure how to combine results of multiple
> indexes.
>
> Thanks,
> ~preetham
>
>
> Preetham Kajekar wrote:
>
>> Thanks Erick and Michael.
>> I will try out these suggestions and post my findings.
>>
>> ~preetham
>>
>> Erick Erickson wrote:
>>
>>> Well, maybe if I'd read the original post more carefully I'd have figured
>>> that out,
>>> sorry 'bout that.
>>>
>>> I *think* I remember reading somewhere on the email lists that your
>>> indexing
>>> speed goes up pretty linearly as the number of indexing tasks approaches
>>> the number of CPUs. Are you, perhaps, on a dual-core machine? But do
>>> search
>>> the mail archives because my memory may not be accurate.
>>>
>>> You can easily combine indexes by IndexWriter.addIndexes BTW. Personally
>>> I prefer fewer indexes if you can get away with it. But I'd only try this
>>> after
>>> Michael's suggestion of using multiple threads on a single underlying
>>> writer.
>>>
>>> You could even think about using N machines to create M fragments then
>>> combining them all afterwards if your logs are static enough to make that
>>> reasonable. Combining indexes may take a while though.....
>>>
>>> Best
>>> Erick
>>>
>>> On Wed, Dec 17, 2008 at 10:46 AM, Preetham Kajekar <preetham@cisco.com
>>> >wrote:
>>>
>>>
>>>
>>>> Hi Erick,
>>>> Thanks for the response. Replies inline.
>>>>
>>>> Erick Erickson wrote:
>>>>
>>>>
>>>>
>>>>> The very first question is always "are you opening a new searcher
>>>>> each time you query"? But you've looked at the Wiki so I assume not.
>>>>> This question is closely tied to what kind of latency you can tolerate.
>>>>>
>>>>> A few more details, please. What's slow? Queries? Indexing?
>>>>>
>>>>>
>>>>>
>>>>>
>>>> Indexing. Again, it is not slow. It is just faster with two separate
>>>> indexers in two threads.
>>>>
>>>>
>>>>
>>>>> How slow? 100ms? 100s? What are your target times and
>>>>> what are you seeing?
>>>>>
>>>>>
>>>>>
>>>>>
>>>> With a single indexer in a single thread, I can index about 20,000 event
>>>> objects per second. With 2 thread and 2 indexers, it is close to 50,000.
>>>> :-)
>>>>
>>>>
>>>>
>>>>> How big is your index? 100M? 100G? What kind of VM
>>>>> parameters are you specifying?
>>>>>
>>>>>
>>>>>
>>>>>
>>>> The index will have about 20mil entries. The size of the index lands up
>>>> being about 500M.
>>>> I start the VM with 1G of heap. No other options for GC etc is used.
>>>>
>>>>
>>>>
>>>>> As an aside, do note that there's no requirement in Lucene that
>>>>> each document have the same fields, so it's unclear why you
>>>>> need two indexes, but perhaps some of the answers to the above
>>>>> will help us understand.
>>>>>
>>>>>
>>>>>
>>>>>
>>>> Like I mentioned, Lucene does the job much faster with two indexes.
>>>>
>>>>
>>>>
>>>>> Also, be very very careful what you measure when you measure
>>>>> queries. You absolutely *have* to put some instrumentation in
>>>>> the code since "slow queries" can result from things other than
>>>>> searching. For instance, iterating over a Hits object for 100s of
>>>>> documents....
>>>>>
>>>>>
>>>>>
>>>>>
>>>> The Query speeds are much faster than what I need :-) So no complains
>>>> here.
>>>>
>>>>
>>>>
>>>>> Show the code, man <G>!
>>>>>
>>>>>
>>>>>
>>>>>
>>>> Code below. EvIndexer is the base class. There are two subclasses which
>>>> implement addEvFieldsToIndexDoc() (template pattern) to add different
>>>> fields
>>>> to the index. that code is also pasted below
>>>>
>>>> --Code ---
>>>>
>>>> BaseClass
>>>>
>>>>  public EvIndexer(String indexName) throws Exception {
>>>>      this.name = indexName;
>>>>      a = new KeywordAnalyzer();
>>>>      INDEX_PATH = System.getProperty(StoreManager.PROP_DB_DB_LOC,
>>>> "./index/");
>>>>      FSDirectory directory = FSDirectory.getDirectory(INDEX_PATH +
>>>> File.separatorChar + indexName, NoLockFactory.getNoLockFactory());
>>>>      indexWriter = new IndexWriter(directory, a,
>>>> IndexWriter.MaxFieldLength.LIMITED);
>>>> //indexWriter.setUseCompoundFile(false);
>>>>      //indexWriter.setRAMBufferSizeMB(256);
>>>>        }
>>>>      /** Method implemented by extending classes to add data into the
>>>> index document for the
>>>>   *  given event
>>>>   *
>>>>   * @param d
>>>>   */
>>>>  protected abstract void addEvFieldsToIndexDoc(Document d, Ev event);
>>>>    public void addToIndex(Ev ev) throws Exception {
>>>>      noOfEventsIndexed++;
>>>>      Document d = new Document();             addEvFieldsToIndexDoc(d,
>>>> ev);
>>>>      indexWriter.addDocument(d);
>>>>            if ((noOfEventsIndexed % COMMIT_INTERVAL) == 0) {
>>>>          System.out.println(name + " indexed " +
>>>> NumberFormat.getInstance().format(noOfEventsIndexed) + " Commiting
>>>> them");
>>>>          commit();
>>>>      }                   }
>>>>
>>>> DerievdClass1
>>>>  protected void addEvFieldsToIndexDoc(Document d, Ev ev) {
>>>>      //noOfEventsIndexed++;
>>>>            Field id = new Field(EV_ID, Long.toString(ev.getId()),
>>>> Field.Store.YES, Field.Index.NO);
>>>>      Field src = new Field(EV_SRC, Long.toString(ev.getSrcId()),
>>>> Field.Store.NO, Field.Index.NOT_ANALYZED);
>>>>      Field type = new Field(EV_TYPE,
>>>> Integer.toString(ev.getEventTypeId()), Field.Store.NO,
>>>> Field.Index.NOT_ANALYZED);
>>>>      Field pri = new Field(EV_PRI, Short.toString(ev.getPriority()) ,
>>>> Field.Store.NO, Field.Index.NOT_ANALYZED);
>>>>      Field time = new Field(EV_TIME, getHexString(ev.getRecvTime()) ,
>>>> Field.Store.NO, Field.Index.NOT_ANALYZED);
>>>>      d.add(id);
>>>>      d.add(src);
>>>>      d.add(type);
>>>>      d.add(pri);
>>>>      d.add(time);
>>>>      //noOfFieldsIndexed +=  4;
>>>>                  }
>>>>
>>>>
>>>>
>>>>
>>>> Thanks for the support.
>>>> ~preetham
>>>>
>>>>
>>>>  Best
>>>>
>>>>
>>>>> Erick
>>>>>
>>>>>
>>>>> On Wed, Dec 17, 2008 at 9:40 AM, Preetham Kajekar <preetham@cisco.com
>>>>>
>>>>>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Hi Grant,
>>>>>> Thanks four response. Replies inline.
>>>>>>
>>>>>> Grant Ingersoll wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> On Dec 17, 2008, at 12:57 AM, Preetham Kajekar wrote:
>>>>>>>
>>>>>>>  Hi,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> I am new to Lucene. I am not using it as a pure text indexer.
>>>>>>>>
>>>>>>>> I am trying to index a Java object which has about 10 fields
(like
>>>>>>>> id,
>>>>>>>> time, srcIp, dstIp) - most of them being numerical values.
>>>>>>>> In order to speed up indexing, I figured that having two
separate
>>>>>>>> indexers, each of them indexing different set of fields works
great.
>>>>>>>> So
>>>>>>>> I
>>>>>>>> have the first 5 fields in index1 and the remaining in index2.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> Can you explain this a bit more?  Are those two fields really
large
>>>>>>> org
>>>>>>> something?  How are you obtaining them?  How are you correlating
the
>>>>>>> documents between the two indexes?  Did you actually try a single
>>>>>>> index
>>>>>>> and
>>>>>>> it was too slow?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> I have a java object which has about 10 fields. However, the fields
>>>>>> are
>>>>>> not
>>>>>> fixed. The java object is essentially a representation of Syslogs
from
>>>>>> network devices. So different syslogs have different fields. Each
>>>>>> field
>>>>>> has
>>>>>> a unique id and a value (mostly numeric types, so i convert it to
>>>>>> string).
>>>>>> There are some fixed fields. So the object is a list of fields which
>>>>>> is
>>>>>> produced by a parser.
>>>>>> I am trying to index using two indexers in two separate threads-
one
>>>>>> for
>>>>>> fixed and another for the non-fixed fields. Except for a unique id,
I
>>>>>> do
>>>>>> not
>>>>>> store the fields in Lucene - i just index them. From the index, i
get
>>>>>> the
>>>>>> unique id which is all I care about. (the objects are stored elsewhere
>>>>>> and
>>>>>> can be looked up based on this unique id).
>>>>>> I did try using a single indexer, but things were quite slow. Getting
>>>>>> high
>>>>>> throughput is crucial and having two indexers seemed to do very well.
>>>>>> (more
>>>>>> than twice as fast)
>>>>>>
>>>>>> Further, the index will never be modified and I can have just one
>>>>>> thread
>>>>>> writing to the index. If there are any other performance tips would
be
>>>>>> very
>>>>>> helpful. I have already looked at the wiki link regarding performance
>>>>>> and
>>>>>> using some of them.
>>>>>>
>>>>>> Thanks,
>>>>>> ~preetham
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Now, I want to have boolean AND query's looking for values in
both
>>>>>>>
>>>>>>>
>>>>>>>> indexes. Like f1=1234 AND f7=ABCD.f1 and f7 and present in
two
>>>>>>>> separate
>>>>>>>> indexes. Would using the MultiIndexReader help ? Since I
am doing an
>>>>>>>> AND, I
>>>>>>>> dont expect that it would work.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> ~preetham
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>
>>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> --------------------------
>>>>>>> Grant Ingersoll
>>>>>>>
>>>>>>> Lucene Helpful Hints:
>>>>>>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>>>>>>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>>
>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message