lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chetas Joshi <chetas.jo...@gmail.com>
Subject Re: Long GC pauses while reading Solr docs using Cursor approach
Date Thu, 13 Apr 2017 17:51:19 GMT
Hi Shawn,

Thanks for the insights into the memory requirements. Looks like cursor
approach is going to require a lot of memory for millions of documents.
If I run a query that returns only 500K documents still keeping 100K docs
per page, I don't see long GC pauses. So it is not really the number of
rows per page but the overall number of docs. May be I can reduce the
document cache and the field cache. What do you think?

Erick,

I was using the streaming approach to get back results from Solr but I was
running into some run time exceptions. That bug has been fixed in solr 6.0.
But because of some reasons, I won't be able to move to Java 8 and hence I
will have to stick to solr 5.5.0. That is the reason I had to switch to the
cursor approach.

Thanks!

On Wed, Apr 12, 2017 at 8:37 PM, Erick Erickson <erickerickson@gmail.com>
wrote:

> You're missing the point of my comment. Since they already are
> docValues, you can use the /export functionality to get the results
> back as a _stream_ and avoid all of the overhead of the aggregator
> node doing a merge sort and all of that.
>
> You'll have to do this from SolrJ, but see CloudSolrStream. You can
> see examples of its usage in StreamingTest.java.
>
> this should
> 1> complete much, much faster. The design goal is 400K rows/second but YMMV
> 2> use vastly less memory on your Solr instances.
> 3> only require _one_ query
>
> Best,
> Erick
>
> On Wed, Apr 12, 2017 at 7:36 PM, Shawn Heisey <apache@elyograg.org> wrote:
> > On 4/12/2017 5:19 PM, Chetas Joshi wrote:
> >> I am getting back 100K results per page.
> >> The fields have docValues enabled and I am getting sorted results based
> on "id" and 2 more fields (String: 32 Bytes and Long: 8 Bytes).
> >>
> >> I have a solr Cloud of 80 nodes. There will be one shard that will get
> top 100K docs from each shard and apply merge sort. So, the max memory
> usage of any shard could be 40 bytes * 100K * 80 = 320 MB. Why would heap
> memory usage shoot up from 8 GB to 17 GB?
> >
> > From what I understand, Java overhead for a String object is 56 bytes
> > above the actual byte size of the string itself.  And each character in
> > the string will be two bytes -- Java uses UTF-16 for character
> > representation internally.  If I'm right about these numbers, it means
> > that each of those id values will take 120 bytes -- and that doesn't
> > include the size the actual response (xml, json, etc).
> >
> > I don't know what the overhead for a long is, but you can be sure that
> > it's going to take more than eight bytes total memory usage for each one.
> >
> > Then there is overhead for all the Lucene memory structures required to
> > execute the query and gather results, plus Solr memory structures to
> > keep track of everything.  I have absolutely no idea how much memory
> > Lucene and Solr use to accomplish a query, but it's not going to be
> > small when you have 200 million documents per shard.
> >
> > Speaking of Solr memory requirements, under normal query circumstances
> > the aggregating node is going to receive at least 100K results from
> > *every* shard in the collection, which it will condense down to the
> > final result with 100K entries.  The behavior during a cursor-based
> > request may be more memory-efficient than what I have described, but I
> > am unsure whether that is the case.
> >
> > If the cursor behavior is not more efficient, then each entry in those
> > results will contain the uniqueKey value and the score.  That's going to
> > be many megabytes for every shard.  If there are 80 shards, it would
> > probably be over a gigabyte for one request.
> >
> > Thanks,
> > Shawn
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message