lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tanguy Moal <>
Subject Re: looking for a BooleanMatcher instead of BooleanScorer
Date Fri, 01 Jun 2012 08:58:10 GMT

I'm just sharing my thoughts, they might be off-topic...

Take the first example quoted from github : the user wants to find all
nodes having their facebookId in a given quite long list ( a friends list,
be aware that some facebook users have 1500+ friends!).

The application firstly had the facebookId for a user (say id=someId), and
requested the facebook graph with that id and got a quite long list of
facebookIds back, right?
At that time, I think the application should not try to enumerate its neo4j
graph using a OR-ed facebookIds list.
It should make sure that each neo4j node in set of the friends list has a
"friendOf" attribute and ensure that this multivalued attribute contains
the facebookId : someId for each involved node. Trigger an update request
of those updated nodes.
You could make your application wait for that update to complete if it
really needs to be synchronous with facebook.
That moves the problem to handling update request smartly which might be
easier sometimes.
Here you will eventually want to store a hash the user's friendslist
somewhere in the user's node so you know in advance if that user's friends
list has changed and if you need to trigger the update process again (just
When your user uses the application for the first time, or every time after
she updated her friends list, an update job will be fired for that user.
You may want to wait for update request to complete only the first time (if
you don't need your app to be 100% synchronized with facebook), and make
the subsequent jobs be queued to something handling these updates
efficiently.  That could stress the storage system with intensive writes
from times to times, especially at the beginning but that will converge to
a mainly read-based application after most active user has used the
application once. New friendships aren't that frequent (IMHO).
May by NRT developments could be used in this scenario... I don't know much
more. I don't know anything about how Neo4J works, I used it once, that's
Anyway if you hit writes issues, congratulations your application is being
used widely, go buy SSD disks :)

Finally, you will then enumerate your nodes with a very quick and efficient
query friendOf:"someId" .

What I wanted to mean is that if your application really needs to perform
queries made of many, many, many, ... really many terms that are OR-ed,
then there might exist (but it's not always true) a different design of
your data model that could allow you to still fit the use case of a search

This applies to 1 and may be to 2 too. ( :p 2-2-2 -- never mind )

I don't really understand for 3 which seems to be a MinShouldMatch issue.

As I said in the beginning, I'm simply sharing my thoughts! I hope this


2012/6/1 Li Li <>

> hi all,
>    I am looking for a 'BooleanMatcher' in lucene. for many
> application, we don't need order matched documents by relevant scores.
> we just like the boolean query. But the BooleanScorer/BooleanScorer2
> is a little bit heavy for the purpose of relevant scoring.
>    one use case is: we have some fields which has very small number
> of tokens(usually only one word). such as id,tag or something else.
>    But we need query like this: id in (1,3,5.....). if using
> booleanQuery (id:1 id:3 id:5 ...). BooleanScorer can only apply to 31
> terms. BooleanScorer2 using priority queue to know how many terms are
> matched(Coord).
>    Filters may help but it can be a very complicated query(or else,
> it self still using BooleanQuery, there is a recursive problem)
>    we may divide current BooleanScorer to a BooleanMatcher and a
> Ranker. if we need score the hitted docs, we ask the BooleanScorer for
> not only hitted id but also tf/idf coord or anything we need to use in
> ranking. but sometimes we only need docIds. then the BooleanMatcher
> can optimize it's implementation. for the case of many disjunction
> terms, we can do it like Filter or BooleanScorer instead of
> BooleanScorer2.
>    is it possible?
>    following is some user demands I searched from the mail list. the
> first one is my own requirement.
>    1.
>    2. mail to lucene
> via
> May 6
> to lucene
> Hi,
>      I met a problem about how to search many keywords  in about
> 5,000,000 documents.For example the query may be like "(a1 or a2 or a3
> ....a200) and (b1 or b2 or b3 or b4 ..... b400)",I found it will take
> vey long time(40seconds) to get the the answer in only one field(Title
> field),and JVM will throw OutMemory error in more fields(title field
> plus content field).Any suggestions or good idea to solve this
> problem?thanks in advance.
>   3 mail to lucene
> Chris Book via
> Apr 11
> to solr-user
> Hello, I have a solr index running that is working very well as a search.
>  But I want to add the ability (if possible) to use it to do matching.  The
> problem is that by default it is only looking for all the input terms to be
> present, and it doesn't give me any indication as to how many terms in the
> target field were not specified by the input.
> For example, if I'm trying to match to the song title "dust in the wind",
> I'm correctly getting a match if the input query is "dust in wind".  But I
> don't want to get a match if the input is just "dust".  Although as a
> search "dust" should return this result, I'm looking for some way to filter
> this out based on some indication that the input isn't close enough to the
> output.  Perhaps if I could get information that that the number of input
> terms is much less than the number of terms in the field.  Or something
> else along those line?
> I realize that this isn't the typical use case for a search, but I'm just
> looking for some suggestions as to how I could improve the above example a
> bit.
> Thanks,
> Chris
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

View raw message