lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Miller (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-9824) Documents indexed in bulk are replicated using too many HTTP requests
Date Wed, 28 Dec 2016 12:43:58 GMT

    [ https://issues.apache.org/jira/browse/SOLR-9824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15782818#comment-15782818
] 

Mark Miller commented on SOLR-9824:
-----------------------------------

bq. put that into an else branch. 

I'll do that.

bq. there's a race due to inPoll just being a volatile variable and so it might be false and
we might not interrupt when we actually wanted to, or vice versa... but I suppose it may not
be a big issue since the queue is poll'ed with timeouts that don't take forever. Adding comments
to this effect would be good.

Yeah, I don't think it's an issue. Distributed updates does use a very large timeout, but
our use of blockUntilFinished will loop and interrupt again. We should not technically need
this right now, but I like that it makes it safe for future code additions. For standard use
it's really just a best effort to cut off any wait. I've done a lot of extensive testing with
various update rates and update threads and such and have not seen an issue yet.

bq. CUSC

Yonik did almost a rewrite of it not too long ago to fix some bugs, and I don't have much
appetite to rework it. There are tons of subtle things that can go wrong. It's complex, but
I think they way it was written, it kind of is what it is. I think if we want a simpler model,
we should probably create a new class with a different streaming design.

I think the queue synchronize is really simple, and runners as well. That is fairly simple
multithreaded code. I think the complication is in other parts of the design myself.

This class is a bit advanced for sure though. You have to be willing to spend some time to
have confidence changing it.


> Documents indexed in bulk are replicated using too many HTTP requests
> ---------------------------------------------------------------------
>
>                 Key: SOLR-9824
>                 URL: https://issues.apache.org/jira/browse/SOLR-9824
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>    Affects Versions: 6.3
>            Reporter: David Smiley
>            Assignee: Mark Miller
>         Attachments: SOLR-9824.patch, SOLR-9824.patch, SOLR-9824.patch, SOLR-9824.patch,
SOLR-9824.patch, SOLR-9824.patch
>
>
> This takes awhile to explain; bear with me. While working on bulk indexing small documents,
I looked at the logs of my SolrCloud nodes.  I noticed that shards would see an /update log
message every ~6ms which is *way* too much.  These are requests from one shard (that isn't
a leader/replica for these docs but the recipient from my client) to the target shard leader
(no additional replicas).  One might ask why I'm not sending docs to the right shard in the
first place; I have a reason but it's besides the point -- there's a real Solr perf problem
here and this probably applies equally to replicationFactor>1 situations too.  I could
turn off the logs but that would hide useful stuff, and it's disconcerting to me that so many
short-lived HTTP requests are happening, somehow at the bequest of DistributedUpdateProcessor.
 After lots of analysis and debugging and hair pulling, I finally figured it out.  
> In SOLR-7333 ([~tpot]) introduced an optimization called {{UpdateRequest.isLastDocInBatch()}}
in which ConcurrentUpdateSolrClient will poll with a '0' timeout to the internal queue, so
that it can close the connection without it hanging around any longer than needed.  This part
makes sense to me.  Currently the only spot that has the smarts to set this flag is {{JavaBinUpdateRequestCodec.unmarshal.readOuterMostDocIterator()}}
at the last document.  So if a shard received docs in a javabin stream (but not other formats)
one would expect the _last_ document to have this flag.  There's even a test.  Docs without
this flag get the default poll time; for javabin it's 25ms.  Okay.
> I _suspect_ that if someone used CloudSolrClient or HttpSolrClient to send javabin data
in a batch, the intended efficiencies of SOLR-7333 would apply.  I didn't try. In my case,
I'm using ConcurrentUpdateSolrClient (and BTW DistributedUpdateProcessor uses CUSC too). 
CUSC uses the RequestWriter (defaulting to javabin) to send each document separately without
any leading marker or trailing marker.  For the XML format by comparison, there is a leading
and trailing marker (<stream> ... </stream>).  Since there's no outer container
for the javabin unmarshalling to detect the last document, it marks _every_ document as {{req.lastDocInBatch()}}!
 Ouch!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message