lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Toke Eskildsen ...@statsbiblioteket.dk>
Subject RE: Best practices for searcher memory usage?
Date Thu, 15 Jul 2010 08:23:05 GMT
On Wed, 2010-07-14 at 20:28 +0200, Christopher Condit wrote:

[Toke: No frequent updates]

> Correct - in fact there are no updates and no deletions. We index 
> everything offline when necessary and just swap the new index in...

So everything is rebuild from scratch each time? Or do you mean that
you're only adding new documents, not changing old ones?

Either way, optimizing to a single 140GB segment is heavy. Ignoring the
relatively light processing of the data, the I/O for merging is still at
the very minimum to read and write the 140GB. Even if you can read and
write 100MB/sec it still takes an hour. This is of course not that
relevant if you're fine with a nightly batch job.

> By more segments do you mean not call optimize() at index time?

Either that or calling it with maxNumSegments 10, where 10 is just a
wild guess. Your mileage will vary:
http://lucene.apache.org/java/3_0_0/api/all/org/apache/lucene/index/IndexWriter.html#optimize%28int%29

[Toke: 10GB is a lot for such an index. What about sorting & faceting?]

> No faceting and no sorting (other than score) for this index...

[Toke: What queries?]

> Typically just TermQueries or BooleanQueries: (Chip OR Nacho OR Foo) 
> AND (Salsa OR Sauce) AND (This OR That)
> The latter is most typical.
> 
> With a single keyword it will execute in < 1 second. In a case where 
> there are 10 clauses it becomes much slower (which I understand, 
> just looking for ways to speed it up)...

As Erick Erickson recently wrote: "Since it doesn't make sense to me,
that must mean I don't understand the problem very thoroughly".

Your queries seems simple enough and I would expect response times well
under a second with warmed index and conventional local harddrives.
Together with the unexpected high memory requirement my guess is that
there's something going on with your terms. If you try opening the index
with luke, it'll tell you the number of terms. If that is very high for
the fields you search on, this would explain the memory usage.

You can also take a look at the rank for the most common terms. If it is
very high this would explain the long execution times for compound
queries that uses one or more of these terms. A stopword filter would
help in this case if such a filter is acceptable for you.

Regards,
Toke Eskildsen


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message