lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dan Markham <>
Subject Re: [lucy-dev] ClusterSearcher
Date Sun, 27 Nov 2011 09:45:28 GMT

On Nov 26, 2011, at 1:48 AM, Nathan Kurz wrote:

> Dan --
> I took a glance.  Sounds promising.  Could you talk a bit about the use
> case you at anticipating?  
> What are you indexing?  

Best way to describe what i plan to and currently do.  50% name/valued key pair.. and 50%
full text search.
Both have rather large documents. Lots of sort fields. I truly do abuse the crap out of KinoSearch
currently. Highlighting is about the only thing i don't use heavily.

> How fast is it
> changing?
I'm thinking avg. number of changes will be about ~15 a second.
During more bulky style changes... I hope much faster.

>  Do the shards fit in memory?  
Yes and no...
Will have some servers with low query requirements overloaded to disk.. 
High profile Indexes with low search SLA's yes.

> What's a ballpark for the searches
> per second you'd like to handle?

1k/second (name/value style searches) with the 98 percentile search under 30ms.
1k/second (full text with nasty OR query's/w large posting files) with the 98 percentile search
under 300ms.
> My first thought is that you may be able to trade off some latency for
> increased throughput by sticking with partially serialized requests if you
> were able to pass a threshold score along to each node/shard so you could
> speed past low scoring results.

More detail!
I'm thinking your talking about the top_docs call getting to use a hinted low watermark in
it's priority queue?

if so.. i was chatting with marvin about this the other day.. i was scared with creating a
 cluster with 100 nodes.
On reason was the sheer number of docs i would need to push over the network with num_wanted
=> 10, offset =>200.

The thing that killed the idea for me...
How do i generate the low watermark we pass to nodes without getting data back from one node?

So the way i fixed the 100 shard problem (in my head) is i built a pyramid of MultiSearchers
this doesn't really work either and i think  now makes it worse.. I'm thinking by time i start
worrying about 100+ nodes sampling and early termination will be a must. Another crazy idea
while i have you this far off track. top_doc requests go into this "pool" all 100 nodes try
to run the queries in the pool and places the score of the lowest scoring doc into the response
pool for that node. the the top_docs query submitter can decide how long to wait for a responses/how
many responses to wait for.. and knows what nodes he will need to use top_docs from.

So what i'm doing in lucy_cluster is not trying to solve the 100node issue just yet.. and
keeping the number of nodes small < 10. But at the same time keeping the number of shards
about 30ish. Mainly so i can rebalance nodes my just moving shards... and nodes can search
more than one shard locally with a multi-searcher.  

>  But this brings up Marvin's points about
> how to handle distributed TF/IDF...

This is *easy* to solve on a per-Index basis with insider knowledge about the index and how
it's segmented. Doing it perfectly for everyone and fast sounds hard. Spreading out the cost
of cacheing/updating the TF/IDF i think is key. 
I like the idea of  sampling a node or 2 to get the cache started (service the search) and
then finish the cache out of band to get a better more complete picture. Unless your adding/updating
to a index with all new term mix quickly.. i don't think the TF/idf cache needs to move quickly.


View raw message