manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arcadius Ahouansou <arcad...@menelic.com>
Subject Re: Content filltering/exclusion with MCF
Date Wed, 29 Apr 2015 21:01:27 GMT
Hello Karl.

I have checked the Simple History and I could see deletions.

I have recently migrated my config to MCF 2.0.2 without migrating all
crawled data. That may be the reason why I have in Solr document that lead
to 404.

Clearing my Solr index and resetting the crawler may help solve my problem.

On the other hand, some of the page I am crawling display friendly messages
such as "The document you are looking for has expired" with a 200 HTTP
header instead of 404.
How feasible would it be to exclude document from the index based on the
content on the document?

Thank you very much.

Arcadius.



On 28 April 2015 at 12:18, Karl Wright <daddywri@gmail.com> wrote:

> Hi Arcadius,
>
> So, to be clear, the repository connection you are using is a web
> connection type?
>
> The web connector has the following code which should prevent indexing of
> any content that was received with a response type of 200:
>
>       int responseCode = cache.getResponseCode(documentIdentifier);
>       if (responseCode != 200)
>       {
>         if (Logging.connectors.isDebugEnabled())
>           Logging.connectors.debug("Web: For document
> '"+documentIdentifier+"', not indexing because response code not indexable:
> "+responseCode);
>         errorCode = "RESPONSECODENOTINDEXABLE";
>         errorDesc = "HTTP response code not indexable ("+responseCode+")";
>         activities.noDocument(documentIdentifier,versionString);
>         return;
>       }
>
>
> You should indeed see these cases logged in the simple history and no
> document sent to Solr.  Is this not what you are seeing?
>
> Karl
>
>
> On Tue, Apr 28, 2015 at 7:01 AM, Arcadius Ahouansou <arcadius@menelic.com>
> wrote:
>
>>
>> Hello.
>>
>> I am using MCF 2.0.2 for crawling the web and ingesting data into Solr.
>>
>> MCF has ingested into Solr documents that returned HTTP error let's says
>> 401, 403, 404 or have a certain content like "this page has expired and has
>> been removed"
>>
>> The question is:
>> is there a way to tell MCF to ingest
>> - only document not containing a certain content like "Not Found" or
>> - only document excluding those with header 401, 403, 404, 500, ...
>>
>> Thank you very much.
>>
>> Arcadius.
>>
>
>


-- 
Arcadius Ahouansou
Menelic Ltd | Information is Power
M: 07908761999
W: www.menelic.com
---

Mime
View raw message