Mailing-List: contact lucy-dev-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: lucy-dev@incubator.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <E684D92E-8801-4E1A-86AA-B5F8D3258452@gmail.com>
References: 
 <CAKn6EF_iRPBdmN1bK7xiXwSm=jD2hfn8h8aNvz1VWX0phLL2FQ@mail.gmail.com>
 <20111104133141.GA10710@rectangular.com>
 <CAKn6EF_XvP47Tb_TmHyaG3MqA-83HcMYXfPWyTrxWEpNRfnE+g@mail.gmail.com>
 <20111105182303.GA7341@rectangular.com>
 <CAFAN8vytDKnamrxxn5j4PSndRRq0=6AMcjwk55U-7U70wtRJLQ@mail.gmail.com>
 <CB59EAD4-76F8-41AA-A917-65DA2D0F4D17@gmail.com>
 <CAFjs8p1UzYXKue3KNHUAXtDqNGJzdvveYR7ytbLiu2pOh96mzA@mail.gmail.com>
 <CAFjs8p3f-g79HDgh7ZX1edX40APHzp-KtRB2qp7c4BprsiLmGw@mail.gmail.com>
 <CAFAN8vxYHXcKmy-9ben0wfAyBGtw1fBR6xe5WKBqrUoQxDrvRA@mail.gmail.com>
 <E684D92E-8801-4E1A-86AA-B5F8D3258452@gmail.com>
From: Nathan Kurz <nate@verse.com>
Date: Mon, 5 Dec 2011 18:29:25 -0800
Message-ID: 
 <CAFAN8vwqOzPQSM5tsZiMDqBXJUJSOb+K7-L2N2seQTdCxj+V7A@mail.gmail.com>
To: lucy-dev@incubator.apache.org
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Subject: Re: [lucy-dev] ClusterSearcher

On Sun, Nov 27, 2011 at 1:45 AM, Dan Markham <dmarkham@gmail.com> wrote:
> Best way to describe what i plan to and currently do. =C2=A050% name/valu=
ed key pair.. and 50% full text search.

That sounds close to my intended use.  I'm particularly interested in
hybrids between recommendations, full text search, and filtered
categories.  I'd be searching within a particular domain (such as
movies, music, academic papers) and want to return results that meet
the search criteria but are ordered in a highly personalized way.

>> How fast is it
>> changing?
> I'm thinking avg. number of changes will be about ~15 a second.

OK, so manageable on the changes, but you want to make sure that
updates are immediately viewable by the updater.


> What's a ballpark for the searches
>> per second you'd like to handle?
>
> 1k/second (name/value style searches) with the 98 percentile search under=
 30ms.
> 1k/second (full text with nasty OR query's/w large posting files) with th=
e 98 percentile search under 300ms.

Do you really require such a low latency on the simple searches, or
are you basing this on back-calculating from requests per second?  I
think you'd get better throughput (requests/sec) if you could relax
this requirement and allow for some serialized access to nodes.

>> =C2=A0Do the shards fit in memory?
> Yes and no...
> Will have some servers with low query requirements overloaded to disk..
> High profile Indexes with low search SLA's yes.

The hope is that the mmap() approach should degrade gracefully up to a
point, so this should work, as long as loads truly are light.   And
how long the tail is --- only short posting lists are going to be
small enough to be read from disk in the amount of time you are
speaking.

> I'm thinking your talking about the top_docs call getting to use a hinted=
 low watermark in it's priority queue?

Yes, I'm thinking that a "low watermark" would help to establish some
known minimum, as well as a "high watermark" to allow for offsets in
results.  This would help when trying to get (for example) results
1000-1100 without having to save or send the top 1100 from each node.

In addition, one could save network traffic by adding a roundtrip
between the central and the nodes, where the high and low are returned
first and the central then sends a request for the details after the
preliminary responses are tabulated.

I also reached the conclusion at some point that the "score" returned
should be allowed to include a string, so that results can be arranged
alphabetically based on the value of a given field in all matching
records:  FULLTEXT MATCHES "X" SORT BY FIELD "Y".

>> how to handle distributed TF/IDF...
>>
>
> This is *easy* to solve on a per-Index basis with insider knowledge about=
 the index and how it's segmented. Doing it perfectly for everyone and fast=
 sounds hard. Spreading out the cost of cacheing/updating the TF/IDF i thin=
k is key.
> I like the idea of =C2=A0sampling a node or 2 to get the cache started (s=
ervice the search) and then finish the cache out of band to get a better mo=
re complete picture. Unless your adding/updating to a index with all new te=
rm mix quickly.. i don't think the TF/idf cache needs to move quickly.

I'm strongly (perhaps even belligerently?) of the opinion that the
specific values for TF need to be part of the query that is sent to
the nodes, rather than something local to each node's scoring
mechanism.  Term frequency should be converted to a weight for each
clause of the query, and that weight (boost) should be used by the
scorer.  This boost can be equivalent to local, global, or approximate
term frequency as desired, but get it out of the core!

With this in mind, if you truly need an exact term frequency, and are
unable to assume that any single index is a reasonable approximation
for the entire corpus, I think only solution is to have a standalone
TF index.  These will be small enough that they should be easy to
manage per node if necessary.  Every few seconds the TF updates are
broadcast from the indexing machine and replicated as necessary.


> So the way i fixed the 100 shard problem (in my head) is i built a pyrami=
d of MultiSearchers this doesn't really work either and i think  now makes =
it worse.

My instinct is that this would not be a good architecture.  While
there the recursive approach is superficially appealing (and very OO),
I think it would be a performance nightmare. I may be teaching my
grandmother to suck eggs, but I'm pretty sure that the limiting factor
for full text search at the speeds you desire is going to be memory
bandwidth:  how fast can you sift through a bunch of RAM?

I'd love if someone can check my numbers, but my impression is that
current servers can plow through the order of 10 GB/second from RAM,
which means that each core can do several integer ops per multi-byte
position read.  Multiply by 4, 8, or 16 cores, and we are quickly
memory bound.   I'm not sure where the crossover is, but adding more
shards per node is quickly going to hurt rather than help.

> How do i generate the low watermark we pass to nodes without getting data=
 back from one node?

I think you've given the best and possibly only answer to your
question:  make another round trip.  Or for maximum performance, make
multiple rounds trips.  Do you really need a response in 30 ms?

--nate