lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Mitchell <goodie...@gmail.com>
Subject Re: using score to find high confidence duplicates
Date Wed, 13 Oct 2010 20:42:23 GMT
No this isn't the MLT, just the standard query parser for now. I did
try the heuristic approach and I might stick with that actually. I ran
the process on known duplicates and created a collection of all
scores. I was then able to see how well the query worked. The scores
seemed focused to one range, which is promising.

I totally forgot about the de-duper, I'll have a look at that and see
if I can get it to work.

Thanks for your help,
Matt

On Wed, Oct 13, 2010 at 3:00 PM, Peter Karich <peathal@yahoo.de> wrote:
> Hi,
>
> are you using moreLikeThis for that feature?
> I have no suggestion for a reliable threshold, I think this depends
> on the domain you are operating and is IMO only solvable with a heuristic.
> It also depends on fields, boosts, ...
> It could be that there is a 'score gap' between duplicates and none
> duplicates
> which you can try to find, but I don't know
>
> BTW: did you check: http://wiki.apache.org/solr/Deduplication
>
> If you need deduplication while querying you could determine
> a hashvalue from the procedure above and index that into a different field.
> Then you can use collapse feature on that field to remove duplicates.
>
> Regards,
> Peter.
>
>> I have a solr index full of documents that contain lots of duplicates.
>> The duplicates are not exact duplicates though. Each may vary slightly
>> in content.
>>
>> After indexing, I have a bit of code that loops through the entire
>> index just to get what I'm calling "target" documents. For each target
>> document, I then send another query to find similar documents to the
>> "target". This similarity query includes a clause to match the target
>> to itself, so I can have a normalized max score. This was the only way
>> I could figure out how to reasonably fix the scoring range. The
>> response always includes the target at the top, and similar documents
>> afterward. So I take the scores and scale to 0-100, where 100 is
>> always the target matching itself. So far so good...
>>
>> What I want to do is create a confidence score threshold, so I can
>> automatically accept similar documents that have a score above the
>> threshold. If my query *structure* never changes, but only the values
>> in the query change... is it possible to produce a reliable
>> "threshold" score that I could use?
>>
>> Hope this makes sense :)
>>
>> Matt
>>
>
>
> --
> http://jetwick.com twitter search prototype
>
>

Mime
View raw message