lucene-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Buttler (Confluence)" <>
Subject [CONF] Apache Solr Reference Guide > Running Solr on HDFS
Date Thu, 19 Sep 2013 16:52:00 GMT
Space: Apache Solr Reference Guide (
Page: Running Solr on HDFS (

Comment added by David Buttler:

My main concern when I was initially looking at this is that standard Solr uses memory-mapped
files for performance reasons, and takes a lot of burden off of Java heap allocation.  Does
the HDFS block cache Directory perform the same type of function as the memory mapped files?
 In other words, if I am currently using 200GB of virtual memory (as reported by top) in each
of my solr instances, should I set -XX:MaxDirectMemorySize=200g (I am currently allocating
a heap of 32GB so that I have 16 GB for a searcher with enough room to start a new searcher
when necessary).  If so, I think a note describing how to transition from current memory-mapped
files to HDFS would be useful.  For example, "To determine the amount of direct memory needed
after the transition, check the current amount of virtual memory your solr processes are using"

Other than that, this seems fantastic

In reply to a comment by Mark Miller:
There are a variety of reasons you might want to put Solr indexes into HDFS.

As Greg mentions above, one of those reasons might be the ease of dealing with disk space
if you are already using HDFS or intend to.

It also does allow you to offer different trade offs in terms of fault-tolerance. This HDFS
integration is just the beginning - once you can work with a shared filesystem, it becomes
easy to reassign indexes to new or existing nodes without standard recovery - in this case
you could count on HDFS for fault-tolerance, which is much more hardened than the standard
SolrCloud replication fault tolerance at this point.

There are other synergies as well. If you are making indexes with MapReduce, it can really
make things nice and simple to just write the indexes to HDFS. Then serve them from HDFS.

It's really just another storage option to consider, especially if you are already using HDFS,
and we hope that it is just the start.

In terms of performance, as in most cases with Hadoop, data will favor being local where you
can use things like HDFS local reads and no network trip. In terms of writing, even with a
network trip, if your pipe is large enough, HDFS is not really the bottleneck. For reads,
the HDFS block cache Directory impl does a pretty good job of taking over for the local filesystem
cache. In addition, many HDFS nodes are outfitted with multiple drives, which comes with it's
own benefits that local file system options cannot easily match without setting up some sort
of RAID system.

We have not focused on performance yet, so I'm sure their are many improvements to come, but
initial comparison one off benchmarks are not bad at all.

Stop watching space:
Change email notification preferences:

View raw message