Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: solr-user@lucene.apache.org
MIME-Version: 1.0
In-Reply-To: <CAN4YXvcd0RwbLDE11UmUwv65+doX0qQWgprzOENyKVNZbf+7nA@mail.gmail.com>
References: <CAA70BoXCYskAf4dXGdRwmwKnB5SKPgpb43foSDPz9LK2yKmUZA@mail.gmail.com>
 <0e04d6c8-a3cc-3824-542d-e45024d7876a@elyograg.org> <C36D72F2-925B-491D-9FEA-03BF34D0613D@wunderwood.org>
 <CAA70BoXOJQ2Cywaw6_aBu6NnmgWHnZ6dOk+wPZa2vnSYuegaFg@mail.gmail.com>
 <CAF8TkC6URbxpn33Nm8N0n6DYSAB5=xF4etceP=ShHFNg5P-i-Q@mail.gmail.com>
 <CAA70BoWLuhcu0VTk=-Ua4jkN=GNsZZn1ZVUrk2Yj9ZEFW0rxcg@mail.gmail.com>
 <CAN4YXvf_NWEjqbmzwxgKj-oQ1bcQUajVVXe==snZJ7iKv6SHDA@mail.gmail.com>
 <CAA70BoVR9wiN2Zd4w+U=Yy=NzqEtatVMB-g3CSDLyERp_vfR-A@mail.gmail.com>
 <a9fc1473-a4ac-84f0-99ca-90e02aa97e40@elyograg.org> <CAN4YXvcd0RwbLDE11UmUwv65+doX0qQWgprzOENyKVNZbf+7nA@mail.gmail.com>
From: Chetas Joshi <chetas.joshi@gmail.com>
Date: Thu, 13 Apr 2017 10:51:19 -0700
Message-ID: <CAA70BoWS-rb24hZW3gRgLC70BMc_CtNBB6XkYtC_X+-o-JdBZA@mail.gmail.com>
Subject: Re: Long GC pauses while reading Solr docs using Cursor approach
To: solr-user@lucene.apache.org
Content-Type: multipart/alternative; boundary=001a11472176f74d8f054d0ff8c7
archived-at: Thu, 13 Apr 2017 17:51:28 -0000

--001a11472176f74d8f054d0ff8c7
Content-Type: text/plain; charset=UTF-8

Hi Shawn,

Thanks for the insights into the memory requirements. Looks like cursor
approach is going to require a lot of memory for millions of documents.
If I run a query that returns only 500K documents still keeping 100K docs
per page, I don't see long GC pauses. So it is not really the number of
rows per page but the overall number of docs. May be I can reduce the
document cache and the field cache. What do you think?

Erick,

I was using the streaming approach to get back results from Solr but I was
running into some run time exceptions. That bug has been fixed in solr 6.0.
But because of some reasons, I won't be able to move to Java 8 and hence I
will have to stick to solr 5.5.0. That is the reason I had to switch to the
cursor approach.

Thanks!

On Wed, Apr 12, 2017 at 8:37 PM, Erick Erickson <erickerickson@gmail.com>
wrote:

> You're missing the point of my comment. Since they already are
> docValues, you can use the /export functionality to get the results
> back as a _stream_ and avoid all of the overhead of the aggregator
> node doing a merge sort and all of that.
>
> You'll have to do this from SolrJ, but see CloudSolrStream. You can
> see examples of its usage in StreamingTest.java.
>
> this should
> 1> complete much, much faster. The design goal is 400K rows/second but YMMV
> 2> use vastly less memory on your Solr instances.
> 3> only require _one_ query
>
> Best,
> Erick
>
> On Wed, Apr 12, 2017 at 7:36 PM, Shawn Heisey <apache@elyograg.org> wrote:
> > On 4/12/2017 5:19 PM, Chetas Joshi wrote:
> >> I am getting back 100K results per page.
> >> The fields have docValues enabled and I am getting sorted results based
> on "id" and 2 more fields (String: 32 Bytes and Long: 8 Bytes).
> >>
> >> I have a solr Cloud of 80 nodes. There will be one shard that will get
> top 100K docs from each shard and apply merge sort. So, the max memory
> usage of any shard could be 40 bytes * 100K * 80 = 320 MB. Why would heap
> memory usage shoot up from 8 GB to 17 GB?
> >
> > From what I understand, Java overhead for a String object is 56 bytes
> > above the actual byte size of the string itself.  And each character in
> > the string will be two bytes -- Java uses UTF-16 for character
> > representation internally.  If I'm right about these numbers, it means
> > that each of those id values will take 120 bytes -- and that doesn't
> > include the size the actual response (xml, json, etc).
> >
> > I don't know what the overhead for a long is, but you can be sure that
> > it's going to take more than eight bytes total memory usage for each one.
> >
> > Then there is overhead for all the Lucene memory structures required to
> > execute the query and gather results, plus Solr memory structures to
> > keep track of everything.  I have absolutely no idea how much memory
> > Lucene and Solr use to accomplish a query, but it's not going to be
> > small when you have 200 million documents per shard.
> >
> > Speaking of Solr memory requirements, under normal query circumstances
> > the aggregating node is going to receive at least 100K results from
> > *every* shard in the collection, which it will condense down to the
> > final result with 100K entries.  The behavior during a cursor-based
> > request may be more memory-efficient than what I have described, but I
> > am unsure whether that is the case.
> >
> > If the cursor behavior is not more efficient, then each entry in those
> > results will contain the uniqueKey value and the score.  That's going to
> > be many megabytes for every shard.  If there are 80 shards, it would
> > probably be over a gigabyte for one request.
> >
> > Thanks,
> > Shawn
> >
>

--001a11472176f74d8f054d0ff8c7--