lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shawn Heisey <apa...@elyograg.org>
Subject Re: Long GC pauses while reading Solr docs using Cursor approach
Date Thu, 13 Apr 2017 02:36:25 GMT
On 4/12/2017 5:19 PM, Chetas Joshi wrote:
> I am getting back 100K results per page.
> The fields have docValues enabled and I am getting sorted results based on "id" and 2
more fields (String: 32 Bytes and Long: 8 Bytes).
>
> I have a solr Cloud of 80 nodes. There will be one shard that will get top 100K docs
from each shard and apply merge sort. So, the max memory usage of any shard could be 40 bytes
* 100K * 80 = 320 MB. Why would heap memory usage shoot up from 8 GB to 17 GB?

>From what I understand, Java overhead for a String object is 56 bytes
above the actual byte size of the string itself.  And each character in
the string will be two bytes -- Java uses UTF-16 for character
representation internally.  If I'm right about these numbers, it means
that each of those id values will take 120 bytes -- and that doesn't
include the size the actual response (xml, json, etc).

I don't know what the overhead for a long is, but you can be sure that
it's going to take more than eight bytes total memory usage for each one.

Then there is overhead for all the Lucene memory structures required to
execute the query and gather results, plus Solr memory structures to
keep track of everything.  I have absolutely no idea how much memory
Lucene and Solr use to accomplish a query, but it's not going to be
small when you have 200 million documents per shard.

Speaking of Solr memory requirements, under normal query circumstances
the aggregating node is going to receive at least 100K results from
*every* shard in the collection, which it will condense down to the
final result with 100K entries.  The behavior during a cursor-based
request may be more memory-efficient than what I have described, but I
am unsure whether that is the case.

If the cursor behavior is not more efficient, then each entry in those
results will contain the uniqueKey value and the score.  That's going to
be many megabytes for every shard.  If there are 80 shards, it would
probably be over a gigabyte for one request.

Thanks,
Shawn


Mime
View raw message