lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Harwood <>
Subject Re: DuplicateFilter question
Date Mon, 31 May 2010 07:30:42 GMT
The DuplicateFilter passed to the searcher does not have visibility of the text query and is
therefore evaluated independently from all other criteria.
Sounds like the behaviour you want is to get the last duplicate that also matches your criteria,
which seems like something fairly common to need to do but unfortunately something DuplicateFilter
will not help with. For this requirement you would need to have a new de-duping query that
wraps a child query and takes the latest match for a given field. Unfortunately if the documents
are not  sequenced in URL-order this will either involve using a lot of expensive disk seeks
or a lot of ram to evaluate efficiently.

If your documents are stored in URL order (ie the URL is just the host part and all docs from
a site are held together) you could look at the PerParentLimitingQuery I created as part of
the NestedDocumentQuery package in Lucene 2454. It is designed to return the top N docs for
a given parent (in this case, site). With some small modification it could return the last
child for a parent. Take a look at the junit example that gets the best n chapters for each

On 31 May 2010, at 08:15, Паша Минченков <> wrote:

df (DuplicateFilter) is the second parameter in the metod.
ScoreDoc[] hits =, df, 1000).scoreDocs;

This varians doesn't hit too:
ScoreDoc[] hits = FilteredQuery(tq, df), new
QueryWrapperFilter(new TermQuery(new Term("text", "now"))),
ScoreDoc[] hits = FilteredQuery(tq, new
QueryWrapperFilter(new TermQuery(new Term("text", "now")))), df,

2010/5/31, Uwe Schindler <>:
Where is df (the DuplicateFilter) used in your code?

Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen

-----Original Message-----
From: Паша Минченков []
Sent: Monday, May 31, 2010 8:27 AM
Subject: DuplicateFilter question


Why DuplicateFilter doesn't work together with other filters? For example,
a little remake of the test DuplicateFilterTest, then the impression that
filter is not applied to other filters and first trims results:

public void testKeepsLastFilter()
	throws Throwable {
	DuplicateFilter df = new DuplicateFilter(KEY_FIELD);

	Query q = new ConstantScoreQuery(new ChainedFilter(new Filter[]{
	new QueryWrapperFilter(tq),
	// new QueryWrapperFilter(new TermQuery(new Term("text",
"out"))), // works right, it is the last document.
	new QueryWrapperFilter(new TermQuery(new Term("text",
"now"))) // why it doesn't work? It is the third document.

	}, ChainedFilter.AND));

	ScoreDoc[] hits =, df, 1000).scoreDocs;

	assertTrue("Filtered searching should have found some matches",
hits.length > 0);
	for (int i = 0; i < hits.length; i++) {
	Document d = searcher.doc(hits[i].doc);
	String url = d.get(KEY_FIELD);
	TermDocs td = reader.termDocs(new Term(KEY_FIELD, url));
	int lastDoc = 0;
	while ( {
	lastDoc = td.doc();
	assertEquals("Duplicate urls should return last doc", lastDoc,

С уважением,
Минченков Павел

To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

С уважением,
Минченков Павел

To unsubscribe, e-mail:
For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message