Return-Path: X-Original-To: apmail-lucene-dev-archive@www.apache.org Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DF30F9118 for ; Mon, 5 Mar 2012 18:06:21 +0000 (UTC) Received: (qmail 43084 invoked by uid 500); 5 Mar 2012 18:06:20 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 43000 invoked by uid 500); 5 Mar 2012 18:06:20 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 42988 invoked by uid 99); 5 Mar 2012 18:06:20 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 05 Mar 2012 18:06:20 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of joaquin.delgado@gmail.com designates 209.85.214.48 as permitted sender) Received: from [209.85.214.48] (HELO mail-bk0-f48.google.com) (209.85.214.48) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 05 Mar 2012 18:06:14 +0000 Received: by bkcji17 with SMTP id ji17so4942367bkc.35 for ; Mon, 05 Mar 2012 10:05:53 -0800 (PST) Received-SPF: pass (google.com: domain of joaquin.delgado@gmail.com designates 10.204.130.150 as permitted sender) client-ip=10.204.130.150; Authentication-Results: mr.google.com; spf=pass (google.com: domain of joaquin.delgado@gmail.com designates 10.204.130.150 as permitted sender) smtp.mail=joaquin.delgado@gmail.com; dkim=pass header.i=joaquin.delgado@gmail.com Received: from mr.google.com ([10.204.130.150]) by 10.204.130.150 with SMTP id t22mr10864844bks.1.1330970753703 (num_hops = 1); Mon, 05 Mar 2012 10:05:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=q4o6/cdb4BWpW4zd6XsIP+jeVeK1S4rT+FrIWu4unnE=; b=A2COD+tIwzOmjiQXUUovJLOrsPS/KiD3kOYPE+YO/faOC4omf+yTDIW12pj38lpmGc JYWiZJxM116S4EHn66kJChFvz0rKPXLJ+Ic6HLcMalzqN7PdFeVzBqRx+BubS+wVwI5N hqId8X9P6iqtfSdu975AYlL3L+zj9EYRh8S6gFgzCPpXMwUzK9QvlR0PtL/7efqq8Q2x KTCC+/LfDdUY7eoQ44fyuvaZ8JmNonQswfwZ49QrQz7+un4Q1OvZvdohTmtdbcm86dTS pzl3PThgQWYpzPoNO18Bs+uTNoN0/T6fa07vQuXLN1TTAzYULl4kgvOPMAurH4HF/hZd hGGw== MIME-Version: 1.0 Received: by 10.204.130.150 with SMTP id t22mr8577987bks.1.1330970753603; Mon, 05 Mar 2012 10:05:53 -0800 (PST) Received: by 10.204.185.199 with HTTP; Mon, 5 Mar 2012 10:05:53 -0800 (PST) In-Reply-To: <4F44183D.8030002@gmail.com> References: <4F44183D.8030002@gmail.com> Date: Mon, 5 Mar 2012 10:05:53 -0800 Message-ID: Subject: Re: Indexing Boolean Expressions From: "J. Delgado" To: dev@lucene.apache.org Content-Type: multipart/alternative; boundary=000e0ce0b1b6f996ce04ba82c6f9 X-Virus-Checked: Checked by ClamAV on apache.org --000e0ce0b1b6f996ce04ba82c6f9 Content-Type: text/plain; charset=ISO-8859-1 I looked at LUCENE-2987 and its work on the query side (changes to the accepted syntax to accept lower case 'or' and 'and'), which isn't really related to my proposal. What I'm proposing is to be able to index complex boolean expressions using Lucene. This can be viewed as the opposite of the regular search task. The objective here is find a set of relevant queries given a document (assignment of values to fields). This by itself may not sound that interesting but its a key piece to efficiently implementing any MATCHING system which is effectively a two-way search where constraints are defined both-ways. An example of this would be: 1) Job matching: Potential employers define their "job posting" as a documents along with complex boolean expressions used to narrow potential candidates. Job searchers upload their "profile" and may formulate complex queries when executing a search. Once a is search initiated from any of the sides constraints need to satisfied both ways. 2) Advertising: Publishers define constraints on the type of advertisers/ads they are willing to show in their sites. On the other hand, advertisers define constraints (typically at the campaign level) on publisher sites they want their ads to show at as well as on the user audiences they are targeting to. While some attribute values are known at definition time, others are only instantiated once the user visits a given page which triggers a matching request that must be satisfied in few milliseconds to select "valid" ads and then scored based on "relevance". So in a matching system a MATCH QUERY is considered to to be a tuple that consists of a value assignment to attributes/fields (doc) + a boolean expression (query) that goes against a double index also built on tuples that simultaneously boolean expressions and associated documents. To do this efficiently we need to be able to build indexes on Boolean expressions (Lucene queries) and retrieve the set of matching expressions given a doc (typically few attributes with values assigned), which is the core of what is described in this paper: "Indexing Boolean Expressions" (See http://www.vldb.org/pvldb/2/vldb09-83.pdf) -- J So to effectively resolve the problem of realtime matching one can On Tue, Feb 21, 2012 at 2:18 PM, Joe Cabrera wrote: > On 02/21/2012 12:15 PM, Aayush Kothari wrote: > > > > >> So if Aayush Kothari is interested in working on this as a Student, all >> we need is a formal mentor (I can be the informal one). >> >> Anyone up for the task? >> >> >> Completely interested in working for and learning about the > aforementioned subject/project. +1. > > This may be related to the work I'm doing with LUCENE-2987 > Basically changing the grammar to accepts conjunctions AND and OR in the > query text. > I would be interested in working with you on some of the details. > > However, I too am not a formal committer. > > -- > Joe Cabreraeminorlabs.com > > --000e0ce0b1b6f996ce04ba82c6f9 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable I looked at LUCENE-2987 and its work on the query side (changes to the acce= pted syntax to accept lower case 'or' and 'and'), which isn= 't really related to my proposal.

What I'm propo= sing is to be able to index complex boolean expressions using Lucene. This = can be=A0viewed=A0as the opposite of the regular search task. The objective= here is find a set of relevant queries given a document (assignment of val= ues to fields).

This by itself may not sound that interesting but its a= key piece to=A0efficiently=A0implementing any MATCHING system which is eff= ectively a two-way search where constraints are defined both-ways. An examp= le of this would be:

1) Job matching: Potential employers define their "= ;job posting" as a documents along with complex boolean expressions us= ed to narrow potential candidates. Job searchers upload their "profile= " and may formulate complex queries when executing a search. Once a is= search initiated from any of the sides constraints need to satisfied both = ways.=A0
2) Advertising: Publishers define constraints on the type of advertise= rs/ads they are willing to show in their sites. On the other hand, advertis= ers define=A0constraints=A0(typically at the campaign level) on publisher s= ites they want their ads to show at as well as on the user audiences they a= re targeting to. While some attribute values are known at definition time, = others are only instantiated once the user visits a given page which trigge= rs a matching request that must be satisfied in few=A0milliseconds=A0to sel= ect "valid" ads and then scored based on "relevance".

So in a matching system a MATCH QUERY is considered to = to be a tuple that consists of a value assignment to attributes/fields (doc= ) + a boolean expression (query) that goes against a double index also buil= t on tuples that =A0simultaneously boolean expressions and associated docum= ents.

To do this=A0efficiently=A0we need to be able to build = indexes on Boolean expressions (Lucene queries) and retrieve the set of mat= ching expressions given a doc (typically few attributes with values assigne= d), which is the core of what is described in this paper:=A0&qu= ot;Indexing=A0Boolean=A0= Expressions" (See= =A0http://www.vldb.org/pvldb/2/vldb09-83.pdf)

-- J


So to effe= ctively resolve the problem of realtime matching one can=A0

<= div class=3D"gmail_quote">On Tue, Feb 21, 2012 at 2:18 PM, Joe Cabrera <calcmaster16@gm= ail.com> wrote:
=20 =20 =20
On 02/21/2012 12:15 PM, Aayush Kothari wrote:



So if Aayush Kothari is interested in working on this as a Student, all we need is a formal mentor (I can be the informal one).=A0

Anyone up for the task?


Completely interested in working for and learning about the aforementioned subject/project. +1. =A0
This may be related to the work I'm doing with LUCENE-2987
Basically changing the grammar to accepts conjunctions AND and OR in the query text.
I would be interested in working with you on some of the details.

However, I too am not a formal committer.

--=20
Joe Cabrera
eminorlabs.com
  

--000e0ce0b1b6f996ce04ba82c6f9--