lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephen Weiss <Steve.We...@wgsn.com>
Subject Re: SolrCloud replicas consistently out of sync
Date Mon, 16 May 2016 22:52:22 GMT
Each node has one JVM with 16GB of RAM.  Are you suggesting we would put each shard into a
separate JVM (something like 32 nodes)?

We aren't encountering any OOMs.  We are testing this in a separate cloud which no one is
even using, the only activity is this very small amount of indexing and still we see this
problem.  In the logs, there are no errors at all.  It's almost like none of the recovery
features that people say are in Solr, are actually there at all.  I can't find any evidence
that Solr is even attempting to keep the shards together.

There are no real errors in the solr log.  I do see some warnings at system startup:

http://pastie.org/private/thz0fbzcxgdreeeune8w

These lines in particular look interesting:

16925 INFO  (recoveryExecutor-3-thread-4-processing-n:172.20.140.173:8983_solr x:instock_shard15_replica1
s:shard15 c:instock r:core_node31) [c:instock s:shard15 r:core_node31 x:instock_shard15_replica1]
o.a.s.u.PeerSync PeerSync: core=instock_shard15_replica1 url=http://172.20.140.173:8983/solr
 Received 0 versions from http://172.20.140.172:8983/solr/instock_shard15_replica2/ fingerprint:{maxVersionSpecified=9223372036854775807,
maxVersionEncountered=1534492620385943552, maxInHash=1534492620385943552, versionsHash=-6845461210912808581,
numVersions=30888332, numDocs=30888332, maxDoc=37699007}
16925 INFO  (recoveryExecutor-3-thread-4-processing-n:172.20.140.173:8983_solr x:instock_shard15_replica1
s:shard15 c:instock r:core_node31) [c:instock s:shard15 r:core_node31 x:instock_shard15_replica1]
o.a.s.u.PeerSync PeerSync: core=instock_shard15_replica1 url=http://172.20.140.173:8983/solr
DONE. sync failed
16925 INFO  (recoveryExecutor-3-thread-4-processing-n:172.20.140.173:8983_solr x:instock_shard15_replica1
s:shard15 c:instock r:core_node31) [c:instock s:shard15 r:core_node31 x:instock_shard15_replica1]
o.a.s.c.RecoveryStrategy PeerSync Recovery was not successful - trying replication.

This is the first node to start up, so most of the other shards are not there yet.

On another node (the last node to start up), it looks similar but a little different:

http://pastie.org/private/xjw0ruljcurdt4xpzqk6da

74090 INFO  (recoveryExecutor-3-thread-1-processing-n:172.20.140.177:8983_solr x:instock_shard25_replica2
s:shard25 c:instock r:core_node60) [c:instock s:shard25 r:core_node60 x:instock_shard25_replica2]
o.a.s.c.RecoveryStrategy Attempting to PeerSync from [http://172.20.140.170:8983/solr/instock_shard25_replica1/]
- recoveringAfterStartup=[true]
74091 INFO  (recoveryExecutor-3-thread-1-processing-n:172.20.140.177:8983_solr x:instock_shard25_replica2
s:shard25 c:instock r:core_node60) [c:instock s:shard25 r:core_node60 x:instock_shard25_replica2]
o.a.s.u.PeerSync PeerSync: core=instock_shard25_replica2 url=http://172.20.140.177:8983/solr
START replicas=[http://172.20.140.170:8983/solr/instock_shard25_replica1/] nUpdates=100
74091 WARN  (recoveryExecutor-3-thread-1-processing-n:172.20.140.177:8983_solr x:instock_shard25_replica2
s:shard25 c:instock r:core_node60) [c:instock s:shard25 r:core_node60 x:instock_shard25_replica2]
o.a.s.u.PeerSync no frame of reference to tell if we've missed updates
74091 INFO  (recoveryExecutor-3-thread-1-processing-n:172.20.140.177:8983_solr x:instock_shard25_replica2
s:shard25 c:instock r:core_node60) [c:instock s:shard25 r:core_node60 x:instock_shard25_replica2]
o.a.s.c.RecoveryStrategy PeerSync Recovery was not successful - trying replication.

Every single replica shows errors like this (either one or the other).

I should add, beyond the block joins / nested children & grandchildren, there's really
nothing unusual about this cloud at all.  It's a very basic collection (simple enough it can
be created in the GUI) and a dist installation of Solr 6.  There are 3 independent zookeeper
servers (again, vanilla from dist), and there don't appear to be any zookeeper issues.

--
Steve

On Mon, May 16, 2016 at 12:02 PM, Erick Erickson <erickerickson@gmail.com<mailto:erickerickson@gmail.com>>
wrote:
8 nodes, 4 shards apiece? All in the same JVM? People have gotten by
the GC pain by running in separate JVMs with less Java memory each on
big beefy machines.... That's not a recommendation as much as an
observation.

That aside, unless you have some very strange stuff going on this is
totally weird. Are you hitting OOM errors at any time you have this
problem? Once you hit an OOM error, all bets are off about how Java
behaves. If you are hitting those, you can't hope for stability until
you fix that issue. In your writeup there's some evidence for this
when you say that if you index multiple docs at a time you get
failures.

Do your Solr logs show any anomalies? My guess is that you'll see
exceptions in your Solr logs that will shed light on the issue.

Best,
Erick

On Mon, May 16, 2016 at 8:03 AM, Stephen Weiss <Steve.Weiss@wgsn.com<mailto:Steve.Weiss@wgsn.com>>
wrote:
> Hi everyone,
>
> I'm running into a problem with SolrCloud replicas and thought I would ask the list to
see if anyone else has seen this / gotten past it.
>
> Right now, we are running with only one replica per shard.  This is obviously a problem
because if one node goes down anywhere, the whole collection goes offline, and due to garbage
collection issues, this happens about once or twice a week, causing a great deal of instability.
 If we try to increase to 2 replicas per shard, once we index new documents and the shards
autocommit, the shards all get out of sync with each other, with different numbers of documents,
different numbers of documents deleted, different facet counts - pretty much totally divergent
indexes.  Shards always show green and available, and never go into recovery or any other
state as to indicate there's a mismatch.  There are also no errors in the logs to indicate
anything is going wrong.  Even long after indexing has finished, the replicas never come back
into sync.  The only way to get consistency again is to delete one set of replicas and then
add them back in.  Unfortunately, when we do this, we invariably discover that many documents
(2-3%) are missing from the index.
>
> We have tried setting the min_rf parameter, and have found that when setting min_rf=2,
we almost never get back rf=2.  We almost always get rf=1, resend the request, and it basically
just goes into an infinite loop.  The only way to get rf=2 to come back is to only index one
document at a time.  Unfortunately, we have to update millions of documents a day and it isn't
really feasible to index this way, and even when indexing one document at a time, we still
occasionally find ourselves in an infinite loop.  This doesn't appear to be related to the
documents we are indexing - if we stop the index process and bounce solr, the exact same document
will go through fine the next time until indexing stops up on another random document.
>
> We have 8 nodes, with 4 shards a piece, all running one collection with about 900M documents.
 An important note is that we have a block join system with 3 tiers of documents (products
-> skus -> sku_history).  During indexing, we are forced to delete all documents for
a product prior to adding the product back into the index, in order to avoid orphaned children
/ grandchildren.  All documents are consistently indexed with the top-level product ID so
that we can delete all child/grandchild documents prior to updating the document.  So, for
each updated document, we are sending through a delete call followed by an add call.  We have
tried putting both the delete and add in the same update request with the same results.
>
> All we see out there on Google is that none of what we're seeing should be happening.
>
> We are currently running Solr 6.0 with Zookeeper 3.4.6.  We experienced the same behavior
on 5.4 as well.
>
> --
> Steve
>
> ________________________________
>
> WGSN is a global foresight business. Our experts provide deep insight and analysis of
consumer, fashion and design trends. We inspire our clients to plan and trade their range
with unparalleled confidence and accuracy. Together, we Create Tomorrow.
>
> WGSN<http://www.wgsn.com/> is part of WGSN Limited, comprising of market-leading
products including WGSN.com<http://www.wgsn.com>, WGSN Lifestyle & Interiors<http://www.wgsn.com/en/lifestyle-interiors>,
WGSN INstock<http://www.wgsninstock.com/>, WGSN StyleTrial<http://www.wgsn.com/en/styletrial/>
and WGSN Mindset<http://www.wgsn.com/en/services/consultancy/>, our bespoke consultancy
services.
>
> The information in or attached to this email is confidential and may be legally privileged.
If you are not the intended recipient of this message, any use, disclosure, copying, distribution
or any action taken in reliance on it is prohibited and may be unlawful. If you have received
this message in error, please notify the sender immediately by return email and delete this
message and any copies from your computer and network. WGSN does not warrant that this email
and any attachments are free from viruses and accepts no liability for any loss resulting
from infected email transmissions.
>
> WGSN reserves the right to monitor all email through its networks. Any views expressed
may be those of the originator and not necessarily of WGSN. WGSN is powered by Ascential plc<http://www.ascential.com>,
which transforms knowledge businesses to deliver exceptional performance.
>
> Please be advised all phone calls may be recorded for training and quality purposes and
by accepting and/or making calls from and/or to us you acknowledge and agree to calls being
recorded.
>
> WGSN Limited, Company number 4858491
>
> registered address:
>
> Ascential plc, The Prow, 1 Wilder Walk, London W1B 5AP
>
> WGSN Inc., tax ID 04-3851246, registered office c/o National Registered Agents, Inc.,
160 Greentree Drive, Suite 101, Dover DE 19904, United States
>
> 4C Serviços de Informação Ltda., CNPJ/MF (Taxpayer's Register): 15.536.968/0001-04,
Address: Avenida Cidade Jardim, 377, 7˚ andar CEP 01453-000, Itaim Bibi, São Paulo
>
> 4C Business Information Consulting (Shanghai) Co., Ltd, 富新商务信息咨询(上海)有限公司,
registered address Unit 4810/4811, 48/F Tower 1, Grand Gateway, 1 Hong Qiao Road, Xuhui District,
Shanghai


________________________________

WGSN is a global foresight business. Our experts provide deep insight and analysis of consumer,
fashion and design trends. We inspire our clients to plan and trade their range with unparalleled
confidence and accuracy. Together, we Create Tomorrow.

WGSN<http://www.wgsn.com/> is part of WGSN Limited, comprising of market-leading products
including WGSN.com<http://www.wgsn.com>, WGSN Lifestyle & Interiors<http://www.wgsn.com/en/lifestyle-interiors>,
WGSN INstock<http://www.wgsninstock.com/>, WGSN StyleTrial<http://www.wgsn.com/en/styletrial/>
and WGSN Mindset<http://www.wgsn.com/en/services/consultancy/>, our bespoke consultancy
services.

The information in or attached to this email is confidential and may be legally privileged.
If you are not the intended recipient of this message, any use, disclosure, copying, distribution
or any action taken in reliance on it is prohibited and may be unlawful. If you have received
this message in error, please notify the sender immediately by return email and delete this
message and any copies from your computer and network. WGSN does not warrant that this email
and any attachments are free from viruses and accepts no liability for any loss resulting
from infected email transmissions.

WGSN reserves the right to monitor all email through its networks. Any views expressed may
be those of the originator and not necessarily of WGSN. WGSN is powered by Ascential plc<http://www.ascential.com>,
which transforms knowledge businesses to deliver exceptional performance.

Please be advised all phone calls may be recorded for training and quality purposes and by
accepting and/or making calls from and/or to us you acknowledge and agree to calls being recorded.

WGSN Limited, Company number 4858491

registered address:

Ascential plc, The Prow, 1 Wilder Walk, London W1B 5AP

WGSN Inc., tax ID 04-3851246, registered office c/o National Registered Agents, Inc., 160
Greentree Drive, Suite 101, Dover DE 19904, United States

4C Serviços de Informação Ltda., CNPJ/MF (Taxpayer's Register): 15.536.968/0001-04, Address:
Avenida Cidade Jardim, 377, 7˚ andar CEP 01453-000, Itaim Bibi, São Paulo

4C Business Information Consulting (Shanghai) Co., Ltd, 富新商务信息咨询(上海)有限公司,
registered address Unit 4810/4811, 48/F Tower 1, Grand Gateway, 1 Hong Qiao Road, Xuhui District,
Shanghai

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message