lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Charlie Hull <>
Subject Re: [scottchu] What kind of configuration to use for this size ofnews data?
Date Wed, 11 May 2016 12:05:44 GMT
On 11/05/2016 10:55, scott.chu wrote:
> I just find maillist seems not accept colorful fonts (cause I receive
> my own letter from maillist and see blue colors are gone!). I use
> asterisk row to highlight my questions  and send this again.

Answers inline below.

> ----- Original Message ----- From: scott(自己) To: solr-user To: Date:
> 2016/5/11 (週三) 17:34 Subject: Re: [scottchu] What kind of
> configuration to use for this size ofnews data?
> Hi, Charlie,
> Thanks first for your concrete answer. I have further questions as
> written in blue color below.
> scott.chu, 2016/5/11 (週三) ----- Original
> Message ----- From: Charlie Hull To: CC:
> Date: 2016/5/11 (週三) 16:21 Subject: Re: [scottchu] What kind of
> configuration to use for this size ofnews data?
> On 11/05/2016 04:27, scott.chu wrote:
>> Fix some typos, add some words and resend same question =>
>> I want to build a Solr engine for over 60-year news articles. My
>> requests are (I use Solr 5.4.1):
> Hi Scott,
> We've actually done something very similar for the our client NLA
> Media Access in the UK, who handle licensing of most UK newspaper
> content. They have over 45m docs going back to 2006.
>> 1> Currently over 10M no. of docs. 2> Currently over 60GB total
>> data size. 3> The no. of docs and data size will keep growing at
>> the rate of 1000 no. of docs(or 8MB size) per day. 4> There are
>> totally 5-6 different newspaper types.
>> My questions are: 1> Is it wokable enough just to use master-slave
>> model? Or should I turn to SolrCloud? (I ask this due to our
>> system management group never manage a distributed system before
>> and they also have no knowedge of Zookeeper, shards, etc. Also they
>> don't know how to backup/restore distributed data.)
> Workable yes, advisable no. You should get much better reliability &
> performance with SolrCloud once it's set up. Also, if you have
> replication set up correctly the need for backup/restore will be
> significantly reduced and may be unnecessary.
> We used master-slave for News UK's Solr setup (articles from The
> Times and other papers) but this was before SolrCloud had properly
> arrived. We'd only use master-slave rarely now.
> *************************************************************************************************************************************************************
If I use SolrCloud, I know I have to setup Zookeeper. I know there're 
something called 'quorum' or 'ensemble' in Zookeeper terminologies. I 
also know there is a need for (2n+1) Zookeeper nodes per n SolrCloud 
nodes.  Is your case running one SolrCloud node per one machine (Whether 
PM or VM).  According to your experiences, how many nodes , including 
SolrCloud's and Zookeeper's, do I need to setup? Is Replication in 
SolrCloud easy to setup as that in old version? (I setup replication 
solrconfig.xml and use file to setup/switch roles in 
Solr node, rather than defining role directly in solrconfig.xml)
> *************************************************************************************************************************************************************

You need at least 3 ZK nodes to form a quorum. How many SolrClouds you 
need will depend on how you decide to shard and replicate your data. 
There is no single answer to this - it depends on various factors 
including query load, query complexity, source data size, indexing should read this page.

You can run more than one Solr node per machine, but if that machine 
dies then your failover setup must be able to cope.

The *only* sensible way to figure out how many nodes you need is to try 
out a prototype system. I would guesstimate it will be less than 10 
nodes but don't hold me to that! Doing this will also teach you a lot 
about ZK and SolrCloud - you're not going to be able to avoid some 
learning here. Don't avoid looking at SolrCloud just because it involves 
ZK, the advantages outweigh the learning curve IMO.

>> 3> If I wish to create another Solr engine with one or two
>> particular paper types. Is it possible to copy their index data
>> directly from the big central Solr engine? Or I have to rebuild
>> index from raw articles data? (Our business has this possibility
>> of needs.)
> Yes, I guess so, but why copy it when you could just search it with
> a filter for the paper types?
> *************************************************************************************************************************************************************We
> have a special biz case called 'buyout newspaper search service'.
> Customers buy intranet license to use search service for articles of
> some newspaper types and some range of  publish dates, e.g. paper
> type 'A' for 2010-2012 and paper type 'B' for 2015. The buyout means
> we have to install who search service at customer site and customer
> can only use search service within their enterprise intranet
> environment. So you know, I have to build a special Solr server for
> each of such customers. Your idea of filtering is very much like
> ElasticSearch's multitenancy, which both are not fit in our buyout
> biz model. Do you have any suggestion for building Solr server in
> such condition?
> *************************************************************************************************************************************************************

You could use Solr's API to extract the subset of articles for 
papers/dates for reindexing into a new Solr core.


I'd like to hear and use some well suggestion and experiences.
>> Thanks in advance and best regards.
>> Scott Chu @ 2016/5/11 11:26 GMT+8
> Hope this helps!
> Cheers
> Charlie

Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828

View raw message