lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: optimal shard assignment with low shard key cardinality using compositeId to enable shard splitting
Date Thu, 21 May 2015 15:30:30 GMT
I question your base assumption:

bq: So shard by document producer seems a good choice

 Because what this _also_ does is force all of the work for a query
onto one node and all indexing for a particular producer ditto. And
will cause you to manually monitor your shards to see if some of them
grow out of proportion to others. And....

I think it would be much less hassle to just let Solr distribute the
docs as it may based on the uniqueKey and forget about it. Unless you
want, say, to do joins etc.... There will, of course, be some overhead
that you pay here, but unless you an measure it and it's a pain I
wouldn't add the complexity you're talking about, especially at the
volumes you're talking.

Best,
Erick

On Thu, May 21, 2015 at 3:20 AM, Matteo Grolla <matteo.grolla@gmail.com> wrote:
> Hi
> I'd like some feedback on how I'd like to solve the following sharding problem
>
>
> I have a collection that will eventually become big
>
> Average document size is 1.5kb
> Every year 30 Million documents will be indexed
>
> Data come from different document producers (a person, owner of his documents) and queries
are almost always performed by a document producer who can only query his own document. So
shard by document producer seems a good choice
>
> there are 3 types of doc producer
> type A,
> cardinality 105 (there are 105 producers of this type)
> produce 17M docs/year (the aggregated production af all type A producers)
> type B
> cardinality ~10k
> produce 4M docs/year
> type C
> cardinality ~10M
> produce 9M docs/year
>
> I'm thinking about
> use compositeId ( solrDocId = producerId!docId ) to send all docs of the same producer
to the same shards. When a shard becomes too large I can use shard splitting.
>
> problems
> -documents from type A producers could be oddly distributed among shards, because hashing
doesn't work well on small numbers (105) see Appendix
>
> As a solution I could do this when a new typeA producer (producerA1) arrives:
>
> 1) client app: generate a producer code
> 2) client app: simulate murmurhashing and shard assignment
> 3) client app: check shard assignment is optimal (producer code is assigned to the shard
with the least type A producers) otherwise goto 1) and try with another code
>
> when I add documents or perform searches for producerA1 I use it's producer code respectively
in the compositeId or in the route parameter
> What do you think?
>
>
> -----------Appendix: murmurhash shard assignment simulation-----------------------
>
> import mmh3
>
> hashes = [mmh3.hash(str(i))>>16 for i in xrange(105)]
>
> num_shards = 16
> shards = [0]*num_shards
>
> for hash in hashes:
>     idx = hash % num_shards
>     shards[idx] += 1
>
> print shards
> print sum(shards)
>
> -------------
>
> result: [4, 10, 6, 7, 8, 6, 7, 8, 11, 1, 8, 5, 6, 5, 5, 8]
>
> so with 16 shards and 105 shard keys I can have
> shards with 1 key
> shards with 11 keys
>

Mime
View raw message