lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ephraim Ofir" <Ephra...@icq.com>
Subject RE: Very very large scale Solr Deployment = how to do (Expert Question)?
Date Thu, 07 Apr 2011 07:32:48 GMT
You can't view it online, but you should be able to download it from:
https://docs.google.com/leaf?id=0BwOEbnJ7oeOrNmU5ZThjODUtYzM5MS00YjRlLWI
2OTktZTEzNDk1YmVmOWU4&hl=en&authkey=COGel4gP

Enjoy,
Ephraim Ofir


-----Original Message-----
From: Jens Mueller [mailto:supidupi007@googlemail.com] 
Sent: Thursday, April 07, 2011 8:30 AM
To: solr-user@lucene.apache.org
Subject: Re: Very very large scale Solr Deployment = how to do (Expert
Question)?

Hello Ephraim, hello Lance, hello Walter,

thanks for your replies:

Ephraim, thanks very much for the further detailed explanation. I will
try
to setup a demo system in the next few days and use your advice.
LoadBalancers are an important aspect of your design. Can you recommend
one
LB specificallly? (I would be using haproxy.1wt.eu) . I think the Idea
with
uploading your document is very good. However Google-Docs seemed not be
be
working (at least for me with the docx format?), but maybe you can
simply
output the document as PDF and then I think Google Docs is working, so
all
the others can also have a look at your concept. The best approach would
be
if you could upload your advice directly somewhere to the solr wiki as
it is
really helpful.I found some other documents meanwhile, but yours is much
clearer and more complete, with the LBs and the Aggregators (
http://lucene-eurocon.org/slides/Solr-In-The-Cloud_Mark-Miller.pdf)

Lance, thanks I will have a look at what linkedin is doing.

Walter, thanks for the advice: Well you are right, mentioning google. My
question was also to understand how such large systems like
google/facebook
are actually working. So my numbers are just theoretical and made up. My
system will be smaller,  but I would be very happy to understand how
such
large systems are build and I think the approach Ephraim showd should be
working quite well at large scale. If you know a good documents (besides
the
bigtable research paper that I already know) that technically describes
how
google is working in detail that would be of great interest. You seem to
be
working for a company that handles large datasets. Does google use this
approach, sharing the index into N writers, and the procuded index is
then
replicated to N "read only searchers"?

thank you all.
best regards
jens



2011/4/7 Walter Underwood <wunder@wunderwood.org>

> The bigger answer is that you cannot get to this size by just
configuring
> Solr. You may have to invent a lot of stuff. Like all of Google.
>
> Where did you get these numbers? The proposed query rate is twice as
big as
> Google (Feb 2010 estimate, 34K qps).
>
> I work at MarkLogic, and we scale to 100's of terabytes, with fast
update
> and query rates. If you want a real system that handles that, you
might want
> to look at our product.
>
> wunder
>
> On Apr 6, 2011, at 8:06 PM, Lance Norskog wrote:
>
> > I would not use replication. LinkedIn consumer search is a flat
system
> > where one process indexes new entries and does queries
simultaneously.
> > It's a custom Lucene app called Zoie. Their stuff is on Github..
> >
> > I would get documents to indexers via a multicast IP-based queueing
> > system. This scales very well and there's a lot of hardware support.
> >
> > The problem with distributed search is that it is a) inherently
slower
> > and b) has inherently more and longer jitter. The "airplane wing"
> > distribution of query times becomes longer and flatter.
> >
> > This is going to have to be a "federated" system, where the
front-end
> > app aggregates results rather than Solr.
> >
> > On Mon, Apr 4, 2011 at 6:25 PM, Jens Mueller
<supidupi007@googlemail.com>
> wrote:
> >> Hello Experts,
> >>
> >>
> >>
> >> I am a Solr newbie but read quite a lot of docs. I still do not
> understand
> >> what would be the best way to setup very large scale deployments:
> >>
> >>
> >>
> >> Goal (threoretical):
> >>
> >>  A.) Index-Size: 1 Petabyte (1 Document is about 5 KB in Size)
> >>
> >>  B) Queries: 100000 Queries/ per Second
> >>
> >>  C) Updates: 100000 Updates / per Second
> >>
> >>
> >>
> >>
> >> Solr offers:
> >>
> >> 1.)    Replication => Scales Well for B)  BUT  A) and C) are not
> satisfied
> >>
> >>
> >> 2.)    Sharding => Scales well for A) BUT B) and C) are not
satisfied
> (=> As
> >> I understand the Sharding approach all goes through a central
server,
> that
> >> dispatches the updates and assembles the quries retrieved from the
> different
> >> shards. But this central server has also some capacity limits...)
> >>
> >>
> >>
> >>
> >> What is the right approach to handle such large deployments? I
would be
> >> thankfull for just a rough sketch of the concepts so I can
> experiment/search
> >> further...
> >>
> >>
> >> Maybe I am missing something very trivial as I think some of the
"Solr
> >> Users/Use Cases" on the homepage are that kind of large
deployments. How
> are
> >> they implemented?
> >>
> >>
> >>
> >> Thanky very much!!!
> >>
> >> Jens
> >>
> >
>
>
>
>
>

Mime
View raw message