lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dan Rosher (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-2592) Pluggable shard lookup mechanism for SolrCloud
Date Mon, 17 Sep 2012 10:30:08 GMT

    [ https://issues.apache.org/jira/browse/SOLR-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13456927#comment-13456927
] 

Dan Rosher commented on SOLR-2592:
----------------------------------

The idea of a shard.key is what I did with the supplied patch, e.g.

<shardPartitioner name="ShardPartitioner" class="org.apache.solr.cloud.NamedShardPartitioner">
    <str name="shardField">date</str>
  </shardPartitioner>

Though we could use any field, region,date etc. It's NOT specifically about date partitioning
and it's at the users discretion.

The default is a HashPartition:

hash(id) % num_shards 

Michael - Your suggestion on 15/Sep/12 02:36 for us still wouldn't address the issue of knowing
exactly on what shard a doc lives. For our (and I guess for most) apps, most queries are search
ones, and we'd need to send a query to every shard, but in our app, I already know in advance
what subset of the index I need to search, and to speed the query up I'd want to index docs
that way too so that I ONLY need to query a particular shard. If I know the subset in advance,
anything with fq=... seems wasteful to me.

The downside of my implementation is that deletes and RealTimeGets would be slower since the
id alone is not enough to determine shard membership, and hence needs to be sent everywhere,
but I suspect in most applications, this is a welcomed compromise as most queries will be
search ones.

Perhaps shard membership can be efficiently stored in a distributed bloom filter or something
like, to speed that up?

All this aside, as a compromise I've though that for us we can take this one level higher,
i.e. instead of collections=docs and shard=Aug2012,Sep2012 etc we can do collections=docs_Aug2012,docs_Sep2012.
Then if we need to search across multiple dates, we can do this today, and still have hashed
based sharding, by using collection=docs_Aug2012,docs_Sep2012,... in the query.

Others might find this idea useful too.

                
> Pluggable shard lookup mechanism for SolrCloud
> ----------------------------------------------
>
>                 Key: SOLR-2592
>                 URL: https://issues.apache.org/jira/browse/SOLR-2592
>             Project: Solr
>          Issue Type: New Feature
>          Components: SolrCloud
>    Affects Versions: 4.0-ALPHA
>            Reporter: Noble Paul
>            Assignee: Mark Miller
>         Attachments: dbq_fix.patch, pluggable_sharding.patch, pluggable_sharding_V2.patch,
SOLR-2592.patch, SOLR-2592_r1373086.patch, SOLR-2592_r1384367.patch, SOLR-2592_rev_2.patch,
SOLR_2592_solr_4_0_0_BETA_ShardPartitioner.patch
>
>
> If the data in a cloud can be partitioned on some criteria (say range, hash, attribute
value etc) It will be easy to narrow down the search to a smaller subset of shards and in
effect can achieve more efficient search.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message