lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "J. Delgado" <joaquin.delg...@gmail.com>
Subject Re: Indexing Boolean Expressions
Date Mon, 26 Mar 2012 17:15:46 GMT
BTW, the idea of indexing Boolean Expressions inside a text indexing engine
is not new. For example Oracle Text provides the CTXRULE index and the
MATCHES operator within their indexing stack, which is primarily used for
Rule-based text classification.

See:

http://docs.oracle.com/cd/B28359_01/text.111/b28303/query.htm#autoId8

http://docs.oracle.com/cd/B28359_01/text.111/b28303/classify.htm#g1011013

-- J

On Mon, Mar 26, 2012 at 10:07 AM, J. Delgado <joaquin.delgado@gmail.com>wrote:

> In full dislosure, there is a patent application that Yahoo! has filed for
> the use of inverted indexes for using complex  predicates for matching
> contracts and opportunities in advertising:
>
> http://www.google.com/patents/US20110016109?printsec=abstract#v=onepage&q&f=false
>
> However I believe there are many more applications that can benefit from
> similar matching techniques (i.e. recommender systems,
> e-commerce, recruiting,etc) to make it worthwhile implementing the ideas
> exposed in the original VLDB'09 paper (which is public) in Lucene.
>
> As a Yahoo! employee, I might not be able to directly contribute to this
> project but will be happy to point to any publicly available pointer that
> can help.
>
> Cheers,
>
> -- Joaquin
>
>
> On Sun, Mar 25, 2012 at 11:44 PM, Mikhail Khludnev <
> mkhludnev@griddynamics.com> wrote:
>
>> Hello Joaquin,
>>
>> I looked through the paper several times, and see no problem to implement
>> it in Lucene (the trivial case at least):
>>
>> Let's index conjunctive condition as
>>  {fieldA:valA,fieldB:valB,fieldC:valC,numClauses:3}
>>
>> then, form query from the incoming fact (event):
>> fieldA:valA OR fieldB:valB OR fieldC:valC OR fieldD:valD
>>
>> to enforce overlap between condition and event, wrap the query above into
>> own query whose scorer will check that numClauses for the matched doc is
>> equal to number of matched clauses.
>> To get "numClauses for the matched doc" you can use FieldCache that's
>> damn fast; and "number of matched clauses" can be obtained from
>> DisjunctionSumScorer.nrMatchers()
>>
>> Negative clauses, and multivalue can be covered also, I believe.
>>
>> WDYT?
>>
>>
>> On Mon, Mar 5, 2012 at 10:05 PM, J. Delgado <joaquin.delgado@gmail.com>wrote:
>>
>>> I looked at LUCENE-2987 and its work on the query side (changes to the
>>> accepted syntax to accept lower case 'or' and 'and'), which isn't really
>>> related to my proposal.
>>>
>>> What I'm proposing is to be able to index complex boolean expressions
>>> using Lucene. This can be viewed as the opposite of the regular search
>>> task. The objective here is find a set of relevant queries given a document
>>> (assignment of values to fields).
>>>
>>> This by itself may not sound that interesting but its a key piece
>>> to efficiently implementing any MATCHING system which is effectively a
>>> two-way search where constraints are defined both-ways. An example of this
>>> would be:
>>>
>>> 1) Job matching: Potential employers define their "job posting" as a
>>> documents along with complex boolean expressions used to narrow potential
>>> candidates. Job searchers upload their "profile" and may formulate complex
>>> queries when executing a search. Once a is search initiated from any of the
>>> sides constraints need to satisfied both ways.
>>> 2) Advertising: Publishers define constraints on the type of
>>> advertisers/ads they are willing to show in their sites. On the other hand,
>>> advertisers define constraints (typically at the campaign level) on
>>> publisher sites they want their ads to show at as well as on the user
>>> audiences they are targeting to. While some attribute values are known at
>>> definition time, others are only instantiated once the user visits a given
>>> page which triggers a matching request that must be satisfied in
>>> few milliseconds to select "valid" ads and then scored based on "relevance".
>>>
>>> So in a matching system a MATCH QUERY is considered to to be a tuple
>>> that consists of a value assignment to attributes/fields (doc) + a boolean
>>> expression (query) that goes against a double index also built on tuples
>>> that  simultaneously boolean expressions and associated documents.
>>>
>>> To do this efficiently we need to be able to build indexes on Boolean
>>> expressions (Lucene queries) and retrieve the set of matching expressions
>>> given a doc (typically few attributes with values assigned), which is the
>>> core of what is described in this paper: "Indexing Boolean Expressions"
>>> (See http://www.vldb.org/pvldb/2/vldb09-83.pdf)
>>>
>>> -- J
>>>
>>>
>>> So to effectively resolve the problem of realtime matching one can
>>>
>>> On Tue, Feb 21, 2012 at 2:18 PM, Joe Cabrera <calcmaster16@gmail.com>wrote:
>>>
>>>>  On 02/21/2012 12:15 PM, Aayush Kothari wrote:
>>>>
>>>>
>>>>
>>>>
>>>>>  So if Aayush Kothari is interested in working on this as a Student,
>>>>> all we need is a formal mentor (I can be the informal one).
>>>>>
>>>>>  Anyone up for the task?
>>>>>
>>>>>
>>>>>   Completely interested in working for and learning about the
>>>> aforementioned subject/project. +1.
>>>>
>>>> This may be related to the work I'm doing with LUCENE-2987
>>>> Basically changing the grammar to accepts conjunctions AND and OR in
>>>> the query text.
>>>> I would be interested in working with you on some of the details.
>>>>
>>>> However, I too am not a formal committer.
>>>>
>>>> --
>>>> Joe Cabreraeminorlabs.com
>>>>
>>>>
>>>
>>
>>
>> --
>> Sincerely yours
>> Mikhail Khludnev
>> Lucid Certified
>> Apache Lucene/Solr Developer
>> Grid Dynamics
>>
>> <http://www.griddynamics.com>
>>  <mkhludnev@griddynamics.com>
>>
>>
>

Mime
View raw message