lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Long GC pauses while reading Solr docs using Cursor approach
Date Thu, 13 Apr 2017 03:37:13 GMT
You're missing the point of my comment. Since they already are
docValues, you can use the /export functionality to get the results
back as a _stream_ and avoid all of the overhead of the aggregator
node doing a merge sort and all of that.

You'll have to do this from SolrJ, but see CloudSolrStream. You can
see examples of its usage in StreamingTest.java.

this should
1> complete much, much faster. The design goal is 400K rows/second but YMMV
2> use vastly less memory on your Solr instances.
3> only require _one_ query

Best,
Erick

On Wed, Apr 12, 2017 at 7:36 PM, Shawn Heisey <apache@elyograg.org> wrote:
> On 4/12/2017 5:19 PM, Chetas Joshi wrote:
>> I am getting back 100K results per page.
>> The fields have docValues enabled and I am getting sorted results based on "id" and
2 more fields (String: 32 Bytes and Long: 8 Bytes).
>>
>> I have a solr Cloud of 80 nodes. There will be one shard that will get top 100K docs
from each shard and apply merge sort. So, the max memory usage of any shard could be 40 bytes
* 100K * 80 = 320 MB. Why would heap memory usage shoot up from 8 GB to 17 GB?
>
> From what I understand, Java overhead for a String object is 56 bytes
> above the actual byte size of the string itself.  And each character in
> the string will be two bytes -- Java uses UTF-16 for character
> representation internally.  If I'm right about these numbers, it means
> that each of those id values will take 120 bytes -- and that doesn't
> include the size the actual response (xml, json, etc).
>
> I don't know what the overhead for a long is, but you can be sure that
> it's going to take more than eight bytes total memory usage for each one.
>
> Then there is overhead for all the Lucene memory structures required to
> execute the query and gather results, plus Solr memory structures to
> keep track of everything.  I have absolutely no idea how much memory
> Lucene and Solr use to accomplish a query, but it's not going to be
> small when you have 200 million documents per shard.
>
> Speaking of Solr memory requirements, under normal query circumstances
> the aggregating node is going to receive at least 100K results from
> *every* shard in the collection, which it will condense down to the
> final result with 100K entries.  The behavior during a cursor-based
> request may be more memory-efficient than what I have described, but I
> am unsure whether that is the case.
>
> If the cursor behavior is not more efficient, then each entry in those
> results will contain the uniqueKey value and the score.  That's going to
> be many megabytes for every shard.  If there are 80 shards, it would
> probably be over a gigabyte for one request.
>
> Thanks,
> Shawn
>

Mime
View raw message