manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Koji Sekiguchi <>
Subject Re: Duplicate documents and MCF
Date Sun, 03 Jul 2011 00:25:15 GMT
> If I understand ManifoldCF correctly, a unique document is a document
> with a distinct URL such as
> Therefore I guess that MCF treats the following document as different
> compared to the example above:
> After I did a huge crawl, I now have a lot of duplicate documents in my
> Solr index, and I'm not quite sure how to cope with this problem. I
> guess I have several options:
> 1) Give root urls a higher score. Then duplicates such as the first
> example above will be listed further down in the search result list.
> 2) Filter out index.html documents, but then I do not have any guarantee
> that the root url has been indexed (in case links to the documents were
> only pointing to index.html.
> 3) Store a hashed value generated out of the documents' content in order
> to give them a unique id.

For 3), Solr has a such feature:


View raw message