lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adrian Banks <>
Subject Combining hits from multiple documents into a single hit
Date Tue, 08 Sep 2009 17:30:25 GMT
[I originally posted this to the mailing list,but it was  
suggested that I might have more luck here]

I am trying to get a particular search to work and it is proving  
problematic. The actual source data is quite complex but can be  
summarised by the following example:

I have articles that are indexed so that they can be searched. Each  
article also has multiple properties associated with it which are also  
indexed and searchable. When users search, they can get hits in either  
the main article or the associated properties. Regardless of where a  
hit is achieved, the article is returned as a search hit (ie. the  
properties are never a hit in their own right).

Now for the complexity:

Each property has security on it, which means that for any given user,  
they may or may not be able to see the property. If a user cannot see  
a property, they obviously do not get a search hit in it. This  
security check is proprietary and cannot be done using the typical  
mechanism of storing a role in the index alongside the other fields in  
the document.

I currently have a index that contains the articles and properties  
indexed separately (ie. an article is indexed as a document, and each  
property has its own document). When a search happens, a hit in  
article A or a hit in any of the properties of article A should be  
classed as hit for article A alone, with the scores combined.

Whether or not a user can see a property is not based on the property  
itself, but on the value of the property. I cannot therefore put the  
extra security conditions into the query upfront as I don't know the  
value to filter by.

As an example:

| Article | Property 1 | Property 2 |
|    A    |     X      |     J      |
|    B    |     Y      |     K      |
|    C    |     Z      |     L      |

If a user can see everything, then searching for "B and Y" will return  
a single search result for article B.

If another user cannot see a property if its value contains Y, then  
searching for "B and Y" will return no hits.

I have no way of knowing what values a user can and cannot see  
upfront. They only way to tell is to perform the security check  
(currently done at the time of filtering a hit from a field in the  
document), which I obviously cannot do for every possible data value  
for each user.

To achieve this originally, Lucene v1.3 was modified to allow this to  
happen by changing BooleanQuery to have a custom Scorer that could  
apply the logic of the security check and the combination of two hits  
in different documents being classed as a hit in a single document. I  
am trying to upgrade this version to the latest (v2.3.2 - I am using  
Lucene.Net), but ideally without having to modify Lucene in any way.

An additional problem occurs if I do an AND search. If an article  
contains the word foo and one of its properties contains the word bar,  
then searching for "foo AND bar" will return the article as a hit. My  
current code deals with this inside the custom Scorer.

Any ideas how/if this can be done?

I am thinking along the lines of using a custom HitCollector and  
passing that into the search, but when doing the boolean search "foo  
AND bar", execution never reaches my HitCollector as the  
ConjunctionScorer filters out all of the results from the sub-queries  
before getting there.



To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message