hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <tdunn...@maprtech.com>
Subject Re: Hadoop/Lucene + Solr architecture suggestions?
Date Fri, 12 Oct 2012 05:23:18 GMT
Your mileage will vary.  I find that this exact motion is what makes reduce
side indexing very attractive since it allows precise sharding control.

On Thu, Oct 11, 2012 at 10:20 PM, Lance Norskog <goksron@gmail.com> wrote:

> The problem is finding the documents in the search cluster. Making the
> indexes in the reducer means the final m/r pass has to segregate the
> documents according to the distribution function. This uses network
> bandwidth for Hadoop coordination, while using SolrCloud only pushes
> each document once. It's a disk bandwidth v.s. network bandwidth
> problem. Although a combiner in the final m/r pass would really speed
> up the hadoop shuffle.
>
> Lance
>
>     From: "Ted Dunning" <tdunning@maprtech.com>
>     To: user@hadoop.apache.org
>     Sent: Wednesday, October 10, 2012 11:13:36 PM
>     Subject: Re: Hadoop/Lucene + Solr architecture suggestions?
>
>     Having solr cloud do the indexing instead of having map-reduce do
> the indexing causes substantial write amplification.
>
>     The issue is that cores are replicated in solr cloud.  To keep the
> cores in sync, all of the replicas index the same documents leading to
> amplification.  Further amplification occurs when all of the logs get
> a copy of the document as well.
>
>     Indexing in map-reduce provides considerably higher latency, but
> it drops the write amplification dramatically because only one copy of
> the index needs to be created.  For MapR, this index is created
> directly in the dfs, in vanilla Hadoop you would typically create the
> index in local disk and copy to HDFS.  The major expense, however, is
> the indexing which is nice to do only once.  Reduce-side indexing has
> the advantage that your indexing bandwidth naturally increases with
> your increasing cluster size.
>
>     Deployment of indexes can be done by copying from HDFS, or
> directly deploying using NFS from MapR.  Either way, all of the shard
> replicas appear under the live SolR.  Obviously, if you copy from
> HDFS, you have some issues with making sure that things appear
> correctly.  One way to deal with this is to double the copy steps by
> copying from HDFS and then using a differential copy to leave files as
> unchanged as possible. You want to do that to allow as much of the
> memory image to stay live as possible.  With NFS and transactional
> updates, that isn't an issue, of course.
>
>     On the extreme side, you can host all of the searchers on NFS
> hosted index shards.  You can have one Solr instance devoted to
> indexing each shard.  This will cause the shards to update and each
> Solr index will detect these changes after a few seconds.  This gives
> near real-time search with high isolation between indexing and search
> loads.
>
>
>     On Wed, Oct 10, 2012 at 10:38 PM, JAY <jayunit100@gmail.com> wrote:
>
>         How can you store Solr shards in hadoop? Is each data node
> running a Solr server?  If so - is the reducer doing a trick to write
> to a local fs?
>
>         Sent from my iPad
>
>         On Oct 11, 2012, at 12:04 AM, "M. C. Srivas" <mcsrivas@gmail.com>
> wrote:
>
>             Interestingly, a few MapR customers have gone the other
> way, deliberately having the indexer put  the Solr shards directly
> into MapR and letting it distribute it. Has made index-management a
> cinch.
>
>             Otherwise they do run into what Tim alludes to.
>
>             On Wed, Oct 10, 2012 at 7:27 PM, Tim Williams
> <williamstw@gmail.com> wrote:
>
>                 On Wed, Oct 10, 2012 at 10:15 PM, Lance Norskog
> <goksron@gmail.com> wrote:
>                 > In the LucidWorks Big Data product, we handle this
> with a reducer that sends documents to a SolrCloud cluster. This way
> the index files are not managed by Hadoop.
>
>                 Hi Lance,
>                 I'm curious if you've gotten that to work with a
> decent-sized (e.g. >
>                 250 node) cluster?  Even a trivial cluster seems to
> crush SolrCloud
>                 from a few months ago at least...
>
>                 Thanks,
>                 --tim
>
>                 > ----- Original Message -----
>                 > | From: "Ted Dunning" <tdunning@maprtech.com>
>                 > | To: user@hadoop.apache.org
>                 > | Cc: "Hadoop User" <user@hadoop.apache.org>
>                 > | Sent: Wednesday, October 10, 2012 7:58:57 AM
>                 > | Subject: Re: Hadoop/Lucene + Solr architecture
> suggestions?
>                 > |
>                 > | I prefer to create indexes in the reducer personally.
>                 > |
>                 > | Also you can avoid the copies if you use an
> advanced hadoop-derived
>                 > | distro. Email me off list for details.
>                 > |
>                 > | Sent from my iPhone
>                 > |
>                 > | On Oct 9, 2012, at 7:47 PM, Mark Kerzner
> <mark.kerzner@shmsoft.com>
>                 > | wrote:
>                 > |
>                 > | > Hi,
>                 > | >
>                 > | > if I create a Lucene index in each mapper,
> locally, then copy them
>                 > | > to under /jobid/mapid1, /jodid/mapid2, and then
> in the reducers
>                 > | > copy them to some Solr machine (perhaps even
> merging), does such
>                 > | > architecture makes sense, to create a searchable
> index with
>                 > | > Hadoop?
>                 > | >
>                 > | > Are there links for similar architectures and
> questions?
>                 > | >
>                 > | > Thank you. Sincerely,
>                 > | > Mark
>                 > |
>

Mime
View raw message