lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Indexing slower in trunk
Date Fri, 17 Jun 2011 01:27:24 GMT
OK, I tried using separate machines for indexing and running Solr, connected
my work and personal Macs with an ethernet cable. Poor-man's network...

I also changed things a bit to parse the Wiki data and store 1.5M docs in memory
then then try to send them to Solr in various-sized batches, thus removing
all the work associated with reading/parsing the XML from the timings.

And the results are...ambiguous. So I re-read some of the blog posts by
Simon and Mark, and think that where I'm missing out is the phrase
"... computers with highly concurrent hardware ...". I don't have that, and
what I'm seeing is that DWPT doesn't seem to make much difference in
this situation. Of course my situation is probably totally irrelevant,
since I've
got to believe that people indexing *serious* data will have, shall we say,
more impressive hardware than I have.

Or perhaps I should say that whatever I do, I can get trunk and 3x to
perform pretty equivalently. Do note that what I'm really looking for is
the time until I can search the last document sent to Solr, so included
in here is a "commit" step. If I take that out, I'm seeing very substantial
gains in trunk. So presumably with a run that lasted longer than just
a couple of minutes I'd see impressive speedups.

I suspect that I just don't have enough hardware to consistently
encounter the situations where DWPT really shines.

It's also possible that I'm doing something stupid, but until some kind
person sets me up with sufficient hardware I'm afraid I'll have to drop
it <G>....

Best
Erick

On Thu, Jun 16, 2011 at 12:05 PM, Erick Erickson
<erickerickson@gmail.com> wrote:
> OK, after more tests I'm pretty sure that my personal machine
> that I'm testing on is just resource-constrained, leading to the
> results I mentioned before. After all, I'm running my Solr
> instance, the indexing program, etc on a Macbook
> with 1 CPU and 2 cores. The indexing program is parsing the
> XML.
>
> On a proper setup, where the indexing machine was separate
> from the machine(s) feeding the index process I suspect this would
> be a different story. Hmmmm, I may try that sometime too....
>
> Best
> Erick
>
> On Tue, Jun 14, 2011 at 9:25 AM, Uwe Schindler <uwe@thetaphi.de> wrote:
>> For simple removing deletes, there is also IW.expungeDeletes(), which is
>> less intensive! Not sure if solr support this, too, but as far as I know
>> there is an issue open.
>>
>> Also please note: As soon as one segment is selected for merging (the merge
>> policy may also do this dependent on the number of deletes in a segment), it
>> will reclaim all deleted ressources - that's what merging does. So expunging
>> deletes once per week is a good idea, if your index consists of very old and
>> large segments that are rarely merged anymore and lots of documents are
>> deleted from them.
>>
>> -----
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: uwe@thetaphi.de
>>
>>
>>> -----Original Message-----
>>> From: Erick Erickson [mailto:erickerickson@gmail.com]
>>> Sent: Tuesday, June 14, 2011 3:19 PM
>>> To: dev@lucene.apache.org
>>> Subject: Re: Indexing slower in trunk
>>>
>>> Optimization used to have a very noticeable impact on search speed prior
>> to
>>> some index format changes from quite a while ago.
>>>
>>> At this point the effect is much less noticeable, but the thing optimize
>> does
>>> do is reclaim resources from deleted documents. If you have lots of
>>> deletions, it's a good idea to periodically optimize, but in that case
>> it's often
>>> done pretty infrequently (once a
>>> day/week/month) rather than as part of any ongoing indexing process.
>>>
>>> Best
>>> Erick
>>>
>>> 2011/6/14 Yury Kats <yurykats@yahoo.com>:
>>> > On 6/14/2011 4:28 AM, Uwe Schindler wrote:
>>> >> indexing and optimizing was only a
>>> >> good idea pre Lucene-2.9, now it's mostly obsolete)
>>> >
>>> > Could you please elaborate on this? Is optimizing obsolete in general
>>> > or after indexing new documents? Is it obsolete after deletions? And
>>> > what it "mostly"?
>>> >
>>> > Thanks!
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For
>>> > additional commands, e-mail: dev-help@lucene.apache.org
>>> >
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional
>>> commands, e-mail: dev-help@lucene.apache.org
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message