lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill Bell <billnb...@gmail.com>
Subject Re: SolrCloud Scale Struggle
Date Sat, 02 Aug 2014 17:11:46 GMT
Seems way overkill. Are you using /get at all ? If you need the docs avail right away - why
? How about after 30 seconds ? How many docs do you get added per second during peak ? Even
Google has a delay when you do Adwords. 

One idea is yo have an empty core that you insert into and then shard into the queries. So
one fire would be called newdocs and then you would add this core into your query. There are
a couple issues with this with scoring but it works nicely. I would not even use Solrcloud
for that core.

Try to reduce number of Java running. Reduce memory and use one java per machine. 

Then if you need faster avail if docs you really need to ask why. Why not later? If it got
search or just showing the user the info ? If for showing maybe query a not indexes table
for the few not yet indexed ?? Or just store in a db to show the user the info and index later?

Bill Bell
Sent from mobile


> On Aug 1, 2014, at 4:19 AM, "anand.mahajan" <anand@zerebral.co.in> wrote:
> 
> Hello all,
> 
> Struggling to get this going with SolrCloud - 
> 
> Requirement in brief :
> - Ingest about 4M Used Cars listings a day and track all unique cars for
> changes
> - 4M automated searches a day (during the ingestion phase to check if a doc
> exists in the index (based on values of 4-5 key fields) or it is a new one
> or an updated version)
> - Of the 4 M - About 3M Updates to existing docs (for every non-key value
> change)
> - About 1M inserts a day (I'm assuming these many new listings come in
> every day)
> - Daily Bulk CSV exports of inserts / updates in last 24 hours of various
> snapshots of the data to various clients
> 
> My current deployment : 
> i) I'm using Solr 4.8 and have set up a SolrCloud with 6 dedicated machines
> - 24 Core + 96 GB RAM each.
> ii)There are over 190M docs in the SolrCloud at the moment (for all
> replicas its consuming overall disk 2340GB which implies - each doc is at
> about 5-8kb in size.)
> iii) The docs are split into 36 Shards - and 3 replica per shard (in all
> 108 Solr Jetty processes split over 6 Servers leaving about 18 Jetty JVMs
> running on each host)
> iv) There are 60 fields per doc and all fields are stored at the moment  :( 
> (The backend is only Solr at the moment)
> v) The current shard/routing key is a combination of Car Year, Make and
> some other car level attributes that help classify the cars
> vi) We are mostly using the default Solr config as of now - no heavy caching
> as the search is pretty random in nature 
> vii) Autocommit is on - with maxDocs = 1
> 
> Current throughput & Issues :
> With the above mentioned deployment the daily throughout is only at about
> 1.5M on average (Inserts + Updates) - falling way short of what is required.
> Search is slow - Some queries take about 15 seconds to return - and since
> insert is dependent on at least one Search that degrades the write
> throughput too. (This is not a Solr issue - but the app demands it so)
> 
> Questions :
> 
> 1. Autocommit with maxDocs = 1 - is that a goof up and could that be slowing
> down indexing? Its a requirement that all docs are available as soon as
> indexed.
> 
> 2. Should I have been better served had I deployed a Single Jetty Solr
> instance per server with multiple cores running inside? The servers do start
> to swap out after a couple of days of Solr uptime - right now we reboot the
> entire cluster every 4 days.
> 
> 3. The routing key is not able to effectively balance the docs on available
> shards - There are a few shards with just about 2M docs - and others over
> 11M docs. Shall I split the larger shards? But I do not have more nodes /
> hardware to allocate to this deployment. In such case would splitting up the
> large shards give better read-write throughput? 
> 
> 4. To remain with the current hardware - would it help if I remove 1 replica
> each from a shard? But that would mean even when just 1 node goes down for a
> shard there would be only 1 live node left that would not serve the write
> requests.
> 
> 5. Also, is there a way to control where the Split Shard replicas would go?
> Is there a pattern / rule that Solr follows when it creates replicas for
> split shards?
> 
> 6. I read somewhere that creating a Core would cost the OS one thread and a
> file handle. Since a core repsents an index in its entirty would it not be
> allocated the configured number of write threads? (The dafault that is 8)
> 
> 7. The Zookeeper cluster is deployed on the same boxes as the Solr instance
> - Would separating the ZK cluster out help?
> 
> Sorry for the long thread _ I thought of asking these all at once rather
> than posting separate ones.
> 
> Thanks,
> Anand
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-Scale-Struggle-tp4150592.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Mime
View raw message