lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Li Li <>
Subject Re: looking for a BooleanMatcher instead of BooleanScorer
Date Fri, 01 Jun 2012 09:15:50 GMT
sorry, the first problem is not mine.

On Fri, Jun 1, 2012 at 4:58 PM, Tanguy Moal <> wrote:
> Hello,
> I'm just sharing my thoughts, they might be off-topic...
> Take the first example quoted from github : the user wants to find all nodes
> having their facebookId in a given quite long list ( a friends list, be
> aware that some facebook users have 1500+ friends!).
> The application firstly had the facebookId for a user (say id=someId), and
> requested the facebook graph with that id and got a quite long list of
> facebookIds back, right?
> At that time, I think the application should not try to enumerate its neo4j
> graph using a OR-ed facebookIds list.
> It should make sure that each neo4j node in set of the friends list has a
> "friendOf" attribute and ensure that this multivalued attribute contains the
> facebookId : someId for each involved node. Trigger an update request of
> those updated nodes.
> You could make your application wait for that update to complete if it
> really needs to be synchronous with facebook.
> That moves the problem to handling update request smartly which might be
> easier sometimes.
> Here you will eventually want to store a hash the user's friendslist
> somewhere in the user's node so you know in advance if that user's friends
> list has changed and if you need to trigger the update process again (just
> thinking).
> When your user uses the application for the first time, or every time after
> she updated her friends list, an update job will be fired for that user. You
> may want to wait for update request to complete only the first time (if you
> don't need your app to be 100% synchronized with facebook), and make the
> subsequent jobs be queued to something handling these updates
> efficiently.  That could stress the storage system with intensive writes
> from times to times, especially at the beginning but that will converge to a
> mainly read-based application after most active user has used the
> application once. New friendships aren't that frequent (IMHO).
> May by NRT developments could be used in this scenario... I don't know much
> more. I don't know anything about how Neo4J works, I used it once, that's
> all.
> Anyway if you hit writes issues, congratulations your application is being
> used widely, go buy SSD disks :)
> Finally, you will then enumerate your nodes with a very quick and efficient
> query friendOf:"someId" .
> What I wanted to mean is that if your application really needs to perform
> queries made of many, many, many, ... really many terms that are OR-ed, then
> there might exist (but it's not always true) a different design of your data
> model that could allow you to still fit the use case of a search engine.

I agree. Lucene/solr may need support many other types of query used
in traditional database.
for now, we usually store structured data in rdbms and full text in
lucene/solr. But the
synchronization of data is a nightmare.  we like just use one full
featured solution instead of
integrating many solutions.

> This applies to 1 and may be to 2 too. ( :p 2-2-2 -- never mind )
> I don't really understand for 3 which seems to be a MinShouldMatch issue.
> As I said in the beginning, I'm simply sharing my thoughts! I hope this
> helps...
> --
> Tanguy
> 2012/6/1 Li Li <>
>> hi all,
>>    I am looking for a 'BooleanMatcher' in lucene. for many
>> application, we don't need order matched documents by relevant scores.
>> we just like the boolean query. But the BooleanScorer/BooleanScorer2
>> is a little bit heavy for the purpose of relevant scoring.
>>    one use case is: we have some fields which has very small number
>> of tokens(usually only one word). such as id,tag or something else.
>>    But we need query like this: id in (1,3,5.....). if using
>> booleanQuery (id:1 id:3 id:5 ...). BooleanScorer can only apply to 31
>> terms. BooleanScorer2 using priority queue to know how many terms are
>> matched(Coord).
>>    Filters may help but it can be a very complicated query(or else,
>> it self still using BooleanQuery, there is a recursive problem)
>>    we may divide current BooleanScorer to a BooleanMatcher and a
>> Ranker. if we need score the hitted docs, we ask the BooleanScorer for
>> not only hitted id but also tf/idf coord or anything we need to use in
>> ranking. but sometimes we only need docIds. then the BooleanMatcher
>> can optimize it's implementation. for the case of many disjunction
>> terms, we can do it like Filter or BooleanScorer instead of
>> BooleanScorer2.
>>    is it possible?
>>    following is some user demands I searched from the mail list. the
>> first one is my own requirement.
>>    1.
>>    2. mail to lucene
>> via
>> May 6
>> to lucene
>> Hi,
>>      I met a problem about how to search many keywords  in about
>> 5,000,000 documents.For example the query may be like "(a1 or a2 or a3
>> ....a200) and (b1 or b2 or b3 or b4 ..... b400)",I found it will take
>> vey long time(40seconds) to get the the answer in only one field(Title
>> field),and JVM will throw OutMemory error in more fields(title field
>> plus content field).Any suggestions or good idea to solve this
>> problem?thanks in advance.
>>   3 mail to lucene
>> Chris Book via
>> Apr 11
>> to solr-user
>> Hello, I have a solr index running that is working very well as a search.
>>  But I want to add the ability (if possible) to use it to do matching.
>>  The
>> problem is that by default it is only looking for all the input terms to
>> be
>> present, and it doesn't give me any indication as to how many terms in the
>> target field were not specified by the input.
>> For example, if I'm trying to match to the song title "dust in the wind",
>> I'm correctly getting a match if the input query is "dust in wind".  But I
>> don't want to get a match if the input is just "dust".  Although as a
>> search "dust" should return this result, I'm looking for some way to
>> filter
>> this out based on some indication that the input isn't close enough to the
>> output.  Perhaps if I could get information that that the number of input
>> terms is much less than the number of terms in the field.  Or something
>> else along those line?
>> I realize that this isn't the typical use case for a search, but I'm just
>> looking for some suggestions as to how I could improve the above example a
>> bit.
>> Thanks,
>> Chris
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message