lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "NearRealtimeSearchTuning" by Peter
Date Mon, 15 Nov 2010 20:30:31 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "NearRealtimeSearchTuning" page has been changed by Peter.
http://wiki.apache.org/solr/NearRealtimeSearchTuning?action=diff&rev1=3&rev2=4

--------------------------------------------------

- Original text of this wiki page is from Peter Sturge and can be found in [[http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201009.mbox/%3CAANLkTinCgekJLbxe_BSaAhLCt_hLr_KwUxM5ZxOvt_GJ@mail.gmail.com%3E|this
thread]]
+ Original text of this wiki page is from Peter Sturge and can be found in [[http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201009.mbox/<AANLkTinCgekJLbxe_BSaAhLCt_hLr_KwUxM5ZxOvt_GJ@mail.gmail.com>|this
thread]]
  
  Go to NearRealtimeSearch for more information about this topic.
  
  == Solr with frequent commits ==
- Example environment: commit every 30secs, large index (>20 mio docs), solr 1.4.1 or branch_3x
+ Example environment: commit every 30secs, large index (>20 mio docs), heavy use of facets,
solr 1.4.1 or branch_3x
  
  Tuning an index where you are adding new data, never changing existing data.
  
@@ -25, +25 @@

  === Solution ===
  Here are some setup steps we've used that allow lots of faceting (we typically search with
at least 20-35 different facet fields, and date faceting/sorting) on large indexes, and still
keep decent search performance:
  
- 1. Firstly, you should consider using the enum method for facet searches (facet.method=enum)
unless you've got A LOT of memory on your machine. In our tests, this method uses a lot less
memory and autowarms more quickly than fc. (Note, I've not tried the new segement-based 'fcs'
option, as I can't find support for it in branch_3x - looks nice for 4.x though) Admittedly,
for our data, enum is not quite as fast for searching as fc, but short of purchsing a Thaiwanese
RAM factory, it's a worthwhile tradeoff. If you do have access to LOTS of memory, AND you
can guarantee that the index won't grow beyond the memory capacity (i.e. you have some sort
of deletion policy in place), fc can be a lot faster than enum when searching with lots of
facets across many terms.
+  1. Firstly, you should consider using the enum method for facet searches (facet.method=enum)
unless you've got A LOT of memory on your machine. In our tests, this method uses a lot less
memory and autowarms more quickly than fc. (Note, I've not tried the new segement-based 'fcs'
option, as I can't find support for it in branch_3x - looks nice for 4.x though) Admittedly,
for our data, enum is not quite as fast for searching as fc, but short of purchsing a Thaiwanese
RAM factory, it's a worthwhile tradeoff. If you do have access to LOTS of memory, AND you
can guarantee that the index won't grow beyond the memory capacity (i.e. you have some sort
of deletion policy in place), fc can be a lot faster than enum when searching with lots of
facets across many terms.
- 
   1. Secondly, we've found that LRUCache is faster at autowarming than FastLRUCache - in
our tests, about 20% faster. Maybe this is just our environment - your mileage may vary. So,
our filterCache section in solrconfig.xml looks like this: <filterCache class="solr.LRUCache"
size="3600" initialSize="1400" autowarmCount="3600"/>
-  2. For a 28GB index, running in a quad-core x64 VMWare instance, 30 warmed facet fields,
Solr is running at ~4GB. Stats filterCache size shows usually in the region of ~2400.
+  1. For a 28GB index, running in a quad-core x64 VMWare instance, 30 warmed facet fields,
Solr is running at ~4GB. Stats filterCache size shows usually in the region of ~2400.
-  3. It's also a good idea to have some sort of firstSearcher/newSearcher event listener
queries to allow new data to populate the caches. Of course, what you put in these is dependent
on the facets you need/use. We've found a good combination is a firstSearcher with as many
facets in the search as your environment can handle, then a subset of the most common facets
for the newSearcher.
+  1. It's also a good idea to have some sort of firstSearcher/newSearcher event listener
queries to allow new data to populate the caches. Of course, what you put in these is dependent
on the facets you need/use. We've found a good combination is a firstSearcher with as many
facets in the search as your environment can handle, then a subset of the most common facets
for the newSearcher.
-  4. We also set: <useColdSearcher>true</useColdSearcher> just in case.
+  1. We also set: <useColdSearcher>true</useColdSearcher> just in case.
-  5. Another key area for search performance with high commits is to use 2 Solr instances
- one for the high commit rate indexing, and one for searching. The read-only searching instance
can be a remote replica, or a local read-only instance that reads the same core as the indexing
instance (for the latter, you'll need something that periodically refreshes - i.e. runs commit()).
This way, you can tune the indexing instance for writing performance and the searching instance
as above for max read performance.
+  1. Another key area for search performance with high commits is to use 2 Solr instances
- one for the high commit rate indexing, and one for searching. The read-only searching instance
can be a remote replica, or a local read-only instance that reads the same core as the indexing
instance (for the latter, you'll need something that periodically refreshes - i.e. runs commit()).
This way, you can tune the indexing instance for writing performance and the searching instance
as above for max read performance.
  
  Using the setup above, we get fantastic searching speed for small facet sets (well under
1sec), and really good searching for large facet sets (a couple of secs depending on index
size, number of facets, unique terms etc. etc.), even when searching against largeish indexes
(>20million docs). We have yet to see any OOM or GC errors using the techniques above,
even in low memory conditions.
  
- Regarding Point 5
+ === Notes ===
+  1. Regarding the point of two solr instances: one for the high commit rate indexing, and
one for searching (read only == RO)
  
- You can run multiple Solr instances in separate JVMs, with both having their solr.xml configured
to use the same index folder. You need to be careful that one and only one of these instances
will ever update the index at a time. The best way to ensure this is to use one for writing
only, and the other is read-only and never writes to the index. This read-only instance is
the one to use for tuning for high search performance. Even though the RO instance doesn't
write to the index, it still needs periodic (albeit empty) commits to kick off autowarming/cache
refresh.
+   * You  can run multiple Solr instances in separate JVMs, with both having  their solr.xml
configured to use the same index folder. You need to be  careful that one and only one of
these instances will ever update the  index at a time. The best way to ensure this is to use
one for writing  only, and the other is RO and never writes to the index. This RO instance
is the one to use for tuning for high search  performance. Even though the RO instance doesn't
write to the index, it  still needs periodic (albeit empty) commits to kick off  autowarming/cache
refresh.
+   * Depending on your  needs, you might not need to have 2 separate instances. We need it
 because the 'write' instance is also doing a lot of metadata pre-write  operations in the
same jvm as Solr, and so has its own memory  requirements.
+   * We use sharding all the time, and  it works just fine with this scenario, as the RO
instance is simply  another shard in the pack.
+  2. The first one to note is that the techniques/setup described in this thread don't fix
the underlying potential for OutOfMemory errors - there can always be an index large enough
to ask of its JVM more memory than is available for cache. These techniques, however, mitigate
the risk, and provide an efficient balance between memory use and search performance. There
are some interesting discussions going on for both Lucene and Solr regarding the '2 pounds
of baloney into a 1 pound bag' issue of unbounded caches, with a number of interesting strategies.
One strategy that I like, but haven't found in discussion lists is auto-limiting cache size/warming
based on available resources (similar to the way file system caches use free memory). This
would allow caches to adjust to their memory environment as indexes grow.
+  3. use 'simple' for lockType
+  4. maxWarmingSearchers = 1 as a way of minimizing the number of onDeckSearchers
  
- Depending on your needs, you might not need to have 2 separate instances. We need it because
the 'write' instance is also doing a lot of metadata pre-write operations in the same jvm
as Solr, and so has its own memory requirements.
- 
- We use sharding all the time, and it works just fine with this scenario, as the RO instance
is simply another shard in the pack.
- 
- === Notes ===
- 
- 1. The first one to note is that the techniques/setup described in this thread don't fix
the underlying potential for OutOfMemory errors - there can always be an index large enough
to ask of its JVM more memory than is available for cache. These techniques, however, mitigate
the risk, and provide an efficient balance between memory use and search performance. There
are some interesting discussions going on for both Lucene and Solr regarding the '2 pounds
of baloney into a 1 pound bag' issue of unbounded caches, with a number of interesting strategies.
One strategy that I like, but haven't found in discussion lists is auto-limiting cache size/warming
based on available resources (similar to the way file system caches use free memory). This
would allow caches to adjust to their memory environment as indexes grow.
- 
- 2. use 'simple' for lockType
- 
- 3. maxWarmingSearchers = 1 as a way of minimizing the number of onDeckSearchers
- 

Mime
View raw message