lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ted Dunning (JIRA)" <>
Subject [jira] Commented: (SOLR-1301) Solr + Hadoop
Date Wed, 03 Feb 2010 05:47:19 GMT


Ted Dunning commented on SOLR-1301:

Based on these observation, I have few questions. (I am a beginner to the Hadoop & Solr
world. So, please forgive me if my questions are silly):
1. As per above observation, SOLR-1045 patch is functionally better (performance I have not
verified yet ). Can anyone tell me, whats the actual advantage SOLR-1301 patch offers over
SOLR-1045 patch?
2. If both the jira issues are trying to solve the same problem, do we really need 2 separate

In the katta community, the recommended practice started with SOLR-1045 (what I call map-side
indexing) behavior, but I think that the consensus now is that SOLR-1301 behavior (what I
call reduce side indexing) is much, much better.  This is not necessarily the obvious result
given your observations.  There are some operational differences between katta and SOLR that
might make the conclusions different, but what I have observed is the following:

a) index merging is a really bad idea that seems very attractive to begin with because it
is actually pretty expensive and doesn't solve the real problems of bad document distribution
across shards.  It is much better to simply have lots of shards per machine (aka micro-sharding)
and use one reducer per shard.  For large indexes, this gives entirely acceptable performance.
 On a pretty small cluster, we can index 50-100 million large documents in multiple ways in
2-3 hours.  Index merging gives you no benefit compared to reduce side indexing and just increases
code complexity.

b) map-side indexing leaves you with indexes that are heavily skewed by being composed of
of documents from a single input split.  At retrieval time, this means that different shards
have very different term frequency profiles and very different numbers of relevant documents.
 This makes lots of statistics very difficult including term frequency computation, term weighting
and determining the number of documents to retrieve.  Map-side merge virtually guarantees
that you have to do two cluster queries, one to gather term frequency statistics and another
to do the actual query.  With reduce side indexing, you can provide strong probabilistic bounds
on how different the statistics in each shard can be so you can use local term statistics
and you can depend on the score distribution being this same which radically decreases the
number of documents you need to retrieve from each shard.

c) reduce-side indexing improves the balance of computation during retrieval.  If (as is the
rule) some document subset is hotter than other document subset due, say to data-source boosting
or recency boosting, you will have very bad cluster utilization with skewed shards from map-side
indexing while all shards will cost about the same for any query leading to good cluster utilization
and faster queries with reduce-side indexing.

d) with reduce-side indexing has properties that can be mathematically stated and proved.
 Map-side indexing only has comparable properties if you make unrealistic assumptions about
your original data.

e) micro-sharding allows very simple and very effective use of multiple cores on multiple
machines in a search cluster.  This can be very difficult to do with large shards or a single

Now, as you say, these advantages may evaporate if you are looking to produce a single output
index.  That seems, however, to contradict the whole point of scaling.   If you need to scale
indexing, presumably you also need to scale search speed and throughput.  As such you probably
want to have many shards rather than few.  Conversely, if you can stand to search a single
index, then you probably can stand to index on a single machine. 

Another thing to think about is the fact SOLR doesn't yet do micro-sharding or clustering
very well and, in particular, doesn't handle multiple shards per core.  That will be changing
before long, however, and it is very dangerous to design for the past rather than the future.

In case, you didn't notice, I strongly suggest you stick with reduce-side indexing.

> Solr + Hadoop
> -------------
>                 Key: SOLR-1301
>                 URL:
>             Project: Solr
>          Issue Type: Improvement
>    Affects Versions: 1.4
>            Reporter: Andrzej Bialecki 
>             Fix For: 1.5
>         Attachments: commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar,
hadoop.patch, log4j-1.2.15.jar, README.txt, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch,
SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch,
> This patch contains  a contrib module that provides distributed indexing (using Hadoop)
to Solr EmbeddedSolrServer. The idea behind this module is twofold:
> * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat
> * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat
consumes data produced by reduce tasks directly, without storing it in intermediate files.
Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts
as there are reducers, and the data to be indexed is not sent over the network.
> Design
> ----------
> Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn
uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer,
and it also instantiates an implementation of SolrDocumentConverter, which is responsible
for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch,
which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the
OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer.
> The API provides facilities to specify an arbitrary existing solr.home directory, from
which the conf/ and lib/ files will be taken.
> This process results in the creation of as many partial Solr home directories as there
were reduce tasks. The output shards are placed in the output directory on the default filesystem
(e.g. HDFS). Such part-NNNNN directories can be used to run N shard servers. Additionally,
users can specify the number of reduce tasks, in particular 1 reduce task, in which case the
output will consist of a single shard.
> An example application is provided that processes large CSV files and uses this API.
It uses a custom CSV processing to avoid (de)serialization overhead.
> This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should
put it in contrib/hadoop/lib.
> Note: the development of this patch was sponsored by an anonymous contributor and approved
for release under Apache License.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message