lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matteo Grolla <matteo.gro...@gmail.com>
Subject optimal shard assignment with low shard key cardinality using compositeId to enable shard splitting
Date Thu, 21 May 2015 10:20:49 GMT
Hi
I'd like some feedback on how I'd like to solve the following sharding problem


I have a collection that will eventually become big

Average document size is 1.5kb
Every year 30 Million documents will be indexed

Data come from different document producers (a person, owner of his documents) and queries
are almost always performed by a document producer who can only query his own document. So
shard by document producer seems a good choice

there are 3 types of doc producer
type A, 
cardinality 105 (there are 105 producers of this type)
produce 17M docs/year (the aggregated production af all type A producers)
type B
cardinality ~10k
produce 4M docs/year
type C
cardinality ~10M
produce 9M docs/year

I'm thinking about 
use compositeId ( solrDocId = producerId!docId ) to send all docs of the same producer to
the same shards. When a shard becomes too large I can use shard splitting.

problems
-documents from type A producers could be oddly distributed among shards, because hashing
doesn't work well on small numbers (105) see Appendix

As a solution I could do this when a new typeA producer (producerA1) arrives:

1) client app: generate a producer code
2) client app: simulate murmurhashing and shard assignment
3) client app: check shard assignment is optimal (producer code is assigned to the shard with
the least type A producers) otherwise goto 1) and try with another code

when I add documents or perform searches for producerA1 I use it's producer code respectively
in the compositeId or in the route parameter
What do you think?


-----------Appendix: murmurhash shard assignment simulation-----------------------

import mmh3

hashes = [mmh3.hash(str(i))>>16 for i in xrange(105)]

num_shards = 16
shards = [0]*num_shards

for hash in hashes:
    idx = hash % num_shards
    shards[idx] += 1

print shards
print sum(shards)

-------------

result: [4, 10, 6, 7, 8, 6, 7, 8, 11, 1, 8, 5, 6, 5, 5, 8]

so with 16 shards and 105 shard keys I can have
shards with 1 key
shards with 11 keys


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message