lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: indexing performance problems
Date Mon, 08 Jun 2009 11:27:47 GMT
This isn't normal.

A mergeFactor of 150 is way too high; I'd put that back to 10 and see
if the problem persists.  Also make sure you're using
autoCommit=false, and try the suggestions here:

    http://wiki.apache.org/lucene-java/ImproveIndexingSpeed

You're sure the JRE's heap size is big enough?

If the problem persists... can you turn on IndexWriter's infoStream
and post the resulting output leading up to the 100% CPU?  You might
also try "kill -QUIT" when the 100% CPU problem is happening, to catch
the stack trace of all threads, and post that too...

Mike

On Mon, Jun 8, 2009 at 6:23 AM, Mateusz Berezecki<mateuszb@gmail.com> wrote:
> Hi list,
>
> I'm having a trouble with achieving good performance when indexing XML
> wikipedia dump.
> The indexing process works as follows
>
> 1. setup FSDirectory
> 2. setup IndexWriter
> 3. setup custom analyzer chaining wikipediatokenizer, lowercasefilter,
> porterstemmer, stopfilter and lengthfilter
> 3. create XMLStreamReader that reads from XML file
> 4. run the parser and get <text> tag contents as well as <title>
> contents and insert them into Document
> 5. add document to the index
>
> the options for the writer are
> - compound file is turned off
> - merge factor set to 150
> - ram buffer size is set to 300 MB
>
> in addition to that the XML stream is read using bufferedfilereader
> with buffer size of 100 MB
>
> This all works good for the first couple of minutes indexing extracted
> articles very quickly but later on some problems start to show. The
> symptoms are:
> - the CPU is at 100% and the stream reading and indexing seems to be stopped
> - the application seems to be dead
> - it resumes after some time (anywhere between 1 to 40 minutes)
>
>  I've double checked my code for any problems and even rewritten it a
> couple of times so this makes me think that there's some problem in
> lucene itself. The problem is persistent in both 2.4.1 and 2.9-dev
> versions.
>
> Is there any known bug related to long running batch indexing
> processes that operate on large documents? In my case the single XML
> file is 20 GB and I'm just surprised how quickly the performance of
> the indexer degrades.
>
> Do you have any suggestions?
>
> best,
> Mateusz Berezecki
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message