lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amey - codeinventory <>
Subject Re: Moving to HDFS, How to merge indices from 8 servers ?‏‏
Date Mon, 15 Sep 2014 17:29:13 GMT
well, i have 8 m1.large ec2 having 2 core 7gb ram and 1tb ebs attached to each server for index.

in my case i dont expect index to be store in ram neither a quick reply as its not a real
time application, i just want fault tolerance in application and availability of full data.

Is it good to use HDFS over normal solr cloud?


--- Original Message ---

From: "Michael Della Bitta" <>
Sent: September 15, 2014 9:26 PM
Subject: Re: Moving to HDFS, How to merge indices from 8 servers ?‏‏

There's not much about Solr Cloud or HDFS indexes that suggests you should
only have one logical shard. If your goal is better uptime with a sharded
index, you should add more replicas.

If your collection is small enough that one machine can serve one query
with acceptable performance, but you want to scale to many queries, then
just adding mirrors of a single-sharded collection is fine. But that's a
big "if."

Switching to HDFS is an option if you have enough RAM for your whole
collection, and have a lot of existing storage devoted to HDFS, or if you
want to batch create indexes. It's not really aimed at preserving uptime as
far as I know.

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <> | g+:
w: <>

On Mon, Sep 15, 2014 at 11:23 AM, Amey Jadiye <>

> Thanks for reply Erik,
> I think i have some misconfusion about how SOLR works with HDFS, and
> solution i am thinking could be reorganised  by user community :)
> Here is the actual solution/situation which is implemented by me
> *Usecase* : I need a google like search engine which should be work in
> distributed and fault tolerant mode, we are collecting the health related
> URLs from a third party system in large amount, approx 1Million/hour. we
> want to build an inventory which contains all of there detail. now i am
> fetching that URL data breaking it in H1, P, Div like tags with help of
> Jsoup lib and putting in Solr as a documents with different boost to
> different fields.
> Now after the putting this data, i have a custom program with which we
> categorise all the data Example. All the cancer related pages, i am
> querying the SOLR and fetching all URL related to cancer with CursorMark
> and putting in a file for further use of our system.
> *Old Solution* : For this i have build the 8 SOLR servers with 3
> zookeepers on the individual AWS Ec2 instances with one collection:8 shards
> problem with this solution is whenever any instance go down i am loosing
> that data for a moment. link of current solution
> *New _OR_ could be faulty solution* : I am thinking that if i use HDFS
> which is virtually only one file system is better so if my server go down
> that data is available through another server, below is steps i am thinking
> to do.
> 1 > I will merge all the 8 server  indices somewhere in to one.2 > Make
> setting for HDFS on same 8 servers.3 > Put the merged index folder in HDFS
> so it will be distributed in 8 servers physically it self.4 > Restart 8
> servers pointing to HDFS on each instance.5 > and now i am ready to go for
> putting data on 8 servers and fetching through any one of SOLR , if that is
> down choose another so it will be guaranteed to get all the data.
> So is this solution sounds good, OR you guys suggest me another better
> solution ?
> Regards,Amey
> > Date: Thu, 11 Sep 2014 14:41:48 -0700
> > Subject: Re: Moving to HDFS, How to merge indices from 8 servers ?‏‏
> > From:
> > To:
> >
> > Um, I really think this is pretty likely to not be a great solution.
> > When you say "merge indexes", I'm thinking you want to go from 8
> > shards to 1 shard. Now, this can be done with the "merge indexes" core
> > admin API, see:
> >
> >
> > BUT.
> > 1>  This will break all things SolrCloud-ish assuming you created your
> > 8 shards under SolrCloud.
> > 2> Solr is usually limited by memory, so trying to fit enough of your
> > single huge index into memory may be problematical.
> >
> > This feels like an XY problem, _why_ are you asking about this? What
> > is the use-case you want to handle by this?
> >
> > Best,
> > Erick
> >
> > On Thu, Sep 11, 2014 at 7:44 AM, Amey Jadiye
> > <> wrote:
> > > FYI, I searched the google for this problem but didn't find any
> satisfactory answer.Here is the current situation : I have the 8 shards in
> my solr cloud backed up with 3 zookeeper all are setup on AWS EC2
> instances, all 8 are leader with no replicas.I have only 1 collection say
> collection1 divided in 8 shards, i have configured the index and tlog
> folder on each server pointing into 1TB EBS disk attached to each servers,
> all 8 servers are having around 100GB for index folder each. so total index
> files i have is ~800Gb.Now, i want to move all the data to HDFS, so I am
> going to setup the HDFS on all 8 serversMerge all the indexes from 8
> serversPut in HDFS.Stop  and Start my all solr servers on HDFS to access
> that common index data with setting  below cp parameter and few
> more.-Dsolr.directoryFactory=HdfsDirectoryFactory
>  -Dsolr.lock.type=hdfs
>  -Dsolr.updatelog=hdfs://host:port/path -jarNow could you tell me is this
> correct approach? if yes how can i merge all indices from 8 server
> ?Regards,Amey

View raw message