lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andy Laird (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-2592) Pluggable shard lookup mechanism for SolrCloud
Date Sat, 23 Jun 2012 12:37:44 GMT

    [ https://issues.apache.org/jira/browse/SOLR-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13399933#comment-13399933
] 

Andy Laird commented on SOLR-2592:
----------------------------------

I've been using some code very similar to Michael's latest patch for a few weeks now and am
liking it less and less for our use case.  As I described above, we are using this patch to
ensure that all docs with the same value for a specific field end up on the same shard --
this is so that the field collapse counting will work for distributed searches, otherwise
the returned counts are only an upper bound.

The problems we've encountered have entirely to do with our need to update the value of the
field we're doing a field-collapse on.  Our approach -- conceptually similar to the CompositeIdShardKeyParserFactory
in Michael's latest patch, involved creating a new schema field, indexId, that was a combination
of what used to be our uniqueKey plus the field that we collapse on:

*Original schema*
{code:xml}
<field name="id" type="string" indexed="true" stored="true" multiValued="false" required="true"
/>
<field name="xyz" type="string" indexed="true" stored="true" multiValued="false" required="true"
/>
...
<uniqueKey>id</uniqueKey>
{code}

*Modified schema*
{code:xml}
<field name="id" type="string" indexed="true" stored="true" multiValued="false" required="true"
/>
<field name="xyz" type="string" indexed="true" stored="true" multiValued="false" required="true"
/>
<field name="indexId" type="string" indexed="true" stored="true" multiValued="false" required="true"
/>
...
<uniqueKey>indexId</uniqueKey>
{code}

During indexing we insert the extra "indexId" data in the form, "id:xyz".  Our custom ShardKeyParser
extracts out the xyz portion of the uniqueKey and returns that as the hash value for shard
selection.  Everything works great in terms of field collapse, counts, etc.

The problems begin when considering what happens when we need to change the value of the field,
xyz.  Suppose that our document starts out with these values for the 3 fields above:
{quote}
id=123
xyz=456
indexId=123:456
{quote}

We then want to change xyz to the value 789, say.  In other words, we want to end up...
{quote}
id=123
xyz=789
indexId=123:789
{quote}

...so that the doc lives on the same shard along with other docs that have xyz=789 (so that
field collapse counts are correct since we use that to drive paging).

Before any of this we would simply pass in a new document and all would be good since we weren't
changing the uniqueKey.  However, now we need to delete the old document (with the old uniqueKey)
or we'll end up with duplicates.  We don't know whether a given update changes the value of
xyz or not and we don't know what the old value for xyz was (without doing an additional lookup)
so we must include an extra delete along with every change:

*Before*
{code:xml}
<add>
  <doc>
    <field name="id">123</field>
    <field name="xyz">789</field>
  <doc>
</add>
{code}

*Now*
{code:xml}
<delete>
  <query>id:123 AND NOT xyz:789</query>
</delete>
<add>
  <doc>
    <field name="id">123</field>
    <field name="xyz">789</field>
    <field name="clusterId">123:789</field>    <-- old clusterId was 123:456
  <doc>
</add>
{code}

So in addition to the "unsavory coupling" between id and xyz there is a significant performance
hit to this approach (as we're doing this in the context of NRT).  The fundamental issue,
of course, is that we only have the uniqueKey value (id) and score for the first phase of
distributed search -- we really need the other field that we are using for shard ownership,
too.

One idea is to have another standard schema field similar to uniqueKey that is used for the
purposes of shard distribution:
{code:xml}
<uniqueKey>id</uniqueKey>
<shardKey>xyz</shardKey>
{code}
Then, as standard procedure, the first phase of distributed search would ask for uniqueKey,
shardKey and score.  Perhaps the ShardKeyParser gets both uniqueKey and shardKey data for
maximum flexibility.  In addition to a solution to our issue with field collapse counts, date-based
sharding could be done by setting the shardKey to a date field and doing appropriate slicing
in the ShardKeyParser.

                
> Pluggable shard lookup mechanism for SolrCloud
> ----------------------------------------------
>
>                 Key: SOLR-2592
>                 URL: https://issues.apache.org/jira/browse/SOLR-2592
>             Project: Solr
>          Issue Type: New Feature
>          Components: SolrCloud
>    Affects Versions: 4.0
>            Reporter: Noble Paul
>         Attachments: SOLR-2592.patch, dbq_fix.patch, pluggable_sharding.patch, pluggable_sharding_V2.patch
>
>
> If the data in a cloud can be partitioned on some criteria (say range, hash, attribute
value etc) It will be easy to narrow down the search to a smaller subset of shards and in
effect can achieve more efficient search.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message