manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Koji Sekiguchi <k...@r.email.ne.jp>
Subject Re: Duplicate documents and MCF
Date Sun, 03 Jul 2011 00:25:15 GMT
> If I understand ManifoldCF correctly, a unique document is a document
> with a distinct URL such as
> http://www.example.org/foo/index.html
>
> Therefore I guess that MCF treats the following document as different
> compared to the example above:
> http://www.example.org/foo/
>
> After I did a huge crawl, I now have a lot of duplicate documents in my
> Solr index, and I'm not quite sure how to cope with this problem. I
> guess I have several options:
> 1) Give root urls a higher score. Then duplicates such as the first
> example above will be listed further down in the search result list.
> 2) Filter out index.html documents, but then I do not have any guarantee
> that the root url has been indexed (in case links to the documents were
> only pointing to index.html.
> 3) Store a hashed value generated out of the documents' content in order
> to give them a unique id.

For 3), Solr has a such feature:

http://wiki.apache.org/solr/Deduplication

koji
-- 
http://www.rondhuit.com/en/

Mime
View raw message