Return-Path: X-Original-To: apmail-incubator-lucy-dev-archive@www.apache.org Delivered-To: apmail-incubator-lucy-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C807D91AC for ; Tue, 6 Dec 2011 02:30:26 +0000 (UTC) Received: (qmail 25416 invoked by uid 500); 6 Dec 2011 02:30:26 -0000 Delivered-To: apmail-incubator-lucy-dev-archive@incubator.apache.org Received: (qmail 25388 invoked by uid 500); 6 Dec 2011 02:30:26 -0000 Mailing-List: contact lucy-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: lucy-dev@incubator.apache.org Delivered-To: mailing list lucy-dev@incubator.apache.org Received: (qmail 25380 invoked by uid 99); 6 Dec 2011 02:30:26 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 Dec 2011 02:30:26 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.85.210.47] (HELO mail-pz0-f47.google.com) (209.85.210.47) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 Dec 2011 02:30:18 +0000 Received: by dake40 with SMTP id e40so4081098dak.6 for ; Mon, 05 Dec 2011 18:29:56 -0800 (PST) Received: by 10.68.31.167 with SMTP id b7mr28184687pbi.57.1323138596295; Mon, 05 Dec 2011 18:29:56 -0800 (PST) MIME-Version: 1.0 Received: by 10.142.74.10 with HTTP; Mon, 5 Dec 2011 18:29:25 -0800 (PST) In-Reply-To: References: <20111104133141.GA10710@rectangular.com> <20111105182303.GA7341@rectangular.com> From: Nathan Kurz Date: Mon, 5 Dec 2011 18:29:25 -0800 Message-ID: To: lucy-dev@incubator.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Subject: Re: [lucy-dev] ClusterSearcher On Sun, Nov 27, 2011 at 1:45 AM, Dan Markham wrote: > Best way to describe what i plan to and currently do. =C2=A050% name/valu= ed key pair.. and 50% full text search. That sounds close to my intended use. I'm particularly interested in hybrids between recommendations, full text search, and filtered categories. I'd be searching within a particular domain (such as movies, music, academic papers) and want to return results that meet the search criteria but are ordered in a highly personalized way. >> How fast is it >> changing? > I'm thinking avg. number of changes will be about ~15 a second. OK, so manageable on the changes, but you want to make sure that updates are immediately viewable by the updater. > What's a ballpark for the searches >> per second you'd like to handle? > > 1k/second (name/value style searches) with the 98 percentile search under= 30ms. > 1k/second (full text with nasty OR query's/w large posting files) with th= e 98 percentile search under 300ms. Do you really require such a low latency on the simple searches, or are you basing this on back-calculating from requests per second? I think you'd get better throughput (requests/sec) if you could relax this requirement and allow for some serialized access to nodes. >> =C2=A0Do the shards fit in memory? > Yes and no... > Will have some servers with low query requirements overloaded to disk.. > High profile Indexes with low search SLA's yes. The hope is that the mmap() approach should degrade gracefully up to a point, so this should work, as long as loads truly are light. And how long the tail is --- only short posting lists are going to be small enough to be read from disk in the amount of time you are speaking. > I'm thinking your talking about the top_docs call getting to use a hinted= low watermark in it's priority queue? Yes, I'm thinking that a "low watermark" would help to establish some known minimum, as well as a "high watermark" to allow for offsets in results. This would help when trying to get (for example) results 1000-1100 without having to save or send the top 1100 from each node. In addition, one could save network traffic by adding a roundtrip between the central and the nodes, where the high and low are returned first and the central then sends a request for the details after the preliminary responses are tabulated. I also reached the conclusion at some point that the "score" returned should be allowed to include a string, so that results can be arranged alphabetically based on the value of a given field in all matching records: FULLTEXT MATCHES "X" SORT BY FIELD "Y". >> how to handle distributed TF/IDF... >> > > This is *easy* to solve on a per-Index basis with insider knowledge about= the index and how it's segmented. Doing it perfectly for everyone and fast= sounds hard. Spreading out the cost of cacheing/updating the TF/IDF i thin= k is key. > I like the idea of =C2=A0sampling a node or 2 to get the cache started (s= ervice the search) and then finish the cache out of band to get a better mo= re complete picture. Unless your adding/updating to a index with all new te= rm mix quickly.. i don't think the TF/idf cache needs to move quickly. I'm strongly (perhaps even belligerently?) of the opinion that the specific values for TF need to be part of the query that is sent to the nodes, rather than something local to each node's scoring mechanism. Term frequency should be converted to a weight for each clause of the query, and that weight (boost) should be used by the scorer. This boost can be equivalent to local, global, or approximate term frequency as desired, but get it out of the core! With this in mind, if you truly need an exact term frequency, and are unable to assume that any single index is a reasonable approximation for the entire corpus, I think only solution is to have a standalone TF index. These will be small enough that they should be easy to manage per node if necessary. Every few seconds the TF updates are broadcast from the indexing machine and replicated as necessary. > So the way i fixed the 100 shard problem (in my head) is i built a pyrami= d of MultiSearchers this doesn't really work either and i think now makes = it worse. My instinct is that this would not be a good architecture. While there the recursive approach is superficially appealing (and very OO), I think it would be a performance nightmare. I may be teaching my grandmother to suck eggs, but I'm pretty sure that the limiting factor for full text search at the speeds you desire is going to be memory bandwidth: how fast can you sift through a bunch of RAM? I'd love if someone can check my numbers, but my impression is that current servers can plow through the order of 10 GB/second from RAM, which means that each core can do several integer ops per multi-byte position read. Multiply by 4, 8, or 16 cores, and we are quickly memory bound. I'm not sure where the crossover is, but adding more shards per node is quickly going to hurt rather than help. > How do i generate the low watermark we pass to nodes without getting data= back from one node? I think you've given the best and possibly only answer to your question: make another round trip. Or for maximum performance, make multiple rounds trips. Do you really need a response in 30 ms? --nate