lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shawn Heisey <apa...@elyograg.org>
Subject Re: top 10 query overall vs shard
Date Fri, 22 Jun 2018 14:12:58 GMT
On 6/22/2018 6:50 AM, Arturas Mazeika wrote:
> I grabbed the 2.7.1 version of solr, created a 4 core setup with
> replication factor 2 on windows using [1], I've restarted the setup with
> 2GB for each node [2], inserted the html docs from the german wikipedia
> archive [3], and obtained top 10 terms for the whole collection vs one
> specific shard:
> http://localhost:9999/solr/de_wiki_all/terms?terms.limit=10&terms.fl=text&wt=json
> {
> "responseHeader":{
> "zkConnected":true,
>
>      "status":0,
>      "QTime":5287},
>    "terms":{
>      "text":[
>        "8",670564,
>        "application",670564,
>        "articles",670564,
>        "charset",670564,
>        "de",670564,
>        "f",670564,
>        "utf",670564,
>        "wiki",670564,
>        "xhtml",670564,
>        "xml",670564]}}
>
> http://localhost:9999/solr/de_wiki_all/terms?terms.limit=10&terms.fl=text&wt=json&shards=localhost:9999/solr/de_wiki_all_shard1_replica_n1&shards.qt=de_wiki_all_shard1_replica_n1
>
> {
>    "responseHeader":{
>      "zkConnected":true,
>      "status":0,
>      "QTime":20274},
>    "terms":{
>      "text":{
>        "8":671396,
>        "application":671396,
>        "articles":671396,
>        "charset":671396,
>        "de":671396,
>        "f":671396,
>        "utf":671396,
>        "wiki":671396,
>        "xhtml":671396,
>        "xml":671396}}}

The value of 'shards.qt' should be /terms, not the name of a core.  
Here's something you might want to try instead for the second query, so 
you won't need shards.qt at all:

http://localhost:9999/solr/de_wiki_all_shard1_replica_n1/terms?terms.limit=10&terms.fl=text&wt=json&distrib=false

You might actually want to add shards.qt=/terms to the first query, or 
even to the definition of the /terms handler in solrconfig.xml so that 
all distributed queries are sent to the same handler instead of going to 
/select.

> reveals:
> (1) querying one shard takes 20 secs vs 5 secs for the whole index

That is strange.  With the shards.qt parameter set to a core name, I'm 
surprised you got anything at all on the second query, but maybe when it 
couldn't find a handler with that name, it just defaulted to /select 
like it would if you didn't include the parameter.  I wonder if having 
an invalid handler contributed to the speed.

> (2) the counts for one shards are higher than for the whole index

If you're not changing the index between the requests, and it doesn't 
sound like you are, I have no idea why that might happen.

> (3) the f: hard drive is samsung SSD 850 evo 4TB (CrystalDeiskMark shows
> ~500MB/s seq and ~300MBs random read/writes), CPU:i7-6400 @3.4GHz. Querying
> for 20 secs shows that java process is neither being pushed on the CPU nor
> on the SDD side to the limits. What is the bottleneck in this computation?

If the amount of memory in the system (NOT talking about heap size here) 
is not sufficient to effectively cache the index, then Solr must 
actually hit the disk to satisfy a query.  Even an SSD is not as fast as 
memory.  You haven't indicated how much disk space is being consumed by 
the eight index cores or how much total memory the system has.  A little 
more than 8GB of the system's memory is being taken up by the four Solr 
processes.  Because you've asked for two replicas, there are two 
complete copies of the index on the system, and both copies will count 
in the total amount of resources that are required.

If there *is* sufficient memory for effective index caching, then the 
disk will barely see any usage during queries, because Solr will get 
most of the data it needs from the OS disk cache (system memory).  This 
will also reduce the impact on the CPU, because it will not be waiting 
for I/O.

Running a query is not going to read the entire index.  If it did, Solr 
would not be fast.

> (4) the output format is slightly different (compare ',' vs ':' and vector
> vs list). I wonder why

That I cannot explain.  The first response doesn't look right to me.  It 
passes RFC 4627 validation, but the software parsing the response would 
have to be very different for each of the output formats.

Thanks,
Shawn


Mime
View raw message