lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Morley" <>
Subject Near Duplicate Documents, "authorization"? tf/idf implications, spamming the index?
Date Mon, 15 Feb 2016 23:42:02 GMT
Hey Solr people:
 Suppose that we did not want to break up our document set into separate 
indexes, but had certain cases where many versions of a document were not 
relevant for certain searches.
 I guess this could be thought of as a "authorization" class of problem, 
however it is not that for us.  We have a few other fields that determine 
relevancy to the current query, based on what page the query is coming 
from.  It's kind of like authorization, but not really.
 Anyway, I think the answer for how you would do it for authorization would 
solve it for our case too.
 So I guess suppose you had 99 users and 100 documents and Document 1 
everybody could see it the same, but for the 99 documents, there was a 
slightly different document, and it was unique for each of 99 users, but 
not "very" unique.  Suppose for instance that the only thing different in 
the text of the 99 different documents was that it was watermarked with the 
users name.  Aren't you spamming your tf/idf at that point?  Is there a way 
around this?  Is there a way to say, hey, group these 99 documents together 
and only count 1 of them for tf/idf purposes?
 When doing queries, each user would only ever see 2 documents, Document 1 
, plus whichever other document they specifically owned.
 If there are web pages or book chapters I can read or re-read that address 
this class of problem, those references would be great.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message