manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Content filltering/exclusion with MCF
Date Wed, 29 Apr 2015 21:50:31 GMT
Hi Arcadius,

A feature like this is possible but could be very slow, since there's no
definite limit on the size of an html page.

Karl


On Wed, Apr 29, 2015 at 5:01 PM, Arcadius Ahouansou <arcadius@menelic.com>
wrote:

>
> Hello Karl.
>
> I have checked the Simple History and I could see deletions.
>
> I have recently migrated my config to MCF 2.0.2 without migrating all
> crawled data. That may be the reason why I have in Solr document that lead
> to 404.
>
> Clearing my Solr index and resetting the crawler may help solve my problem.
>
> On the other hand, some of the page I am crawling display friendly
> messages such as "The document you are looking for has expired" with a 200
> HTTP header instead of 404.
> How feasible would it be to exclude document from the index based on the
> content on the document?
>
> Thank you very much.
>
> Arcadius.
>
>
>
> On 28 April 2015 at 12:18, Karl Wright <daddywri@gmail.com> wrote:
>
>> Hi Arcadius,
>>
>> So, to be clear, the repository connection you are using is a web
>> connection type?
>>
>> The web connector has the following code which should prevent indexing of
>> any content that was received with a response type of 200:
>>
>>       int responseCode = cache.getResponseCode(documentIdentifier);
>>       if (responseCode != 200)
>>       {
>>         if (Logging.connectors.isDebugEnabled())
>>           Logging.connectors.debug("Web: For document
>> '"+documentIdentifier+"', not indexing because response code not indexable:
>> "+responseCode);
>>         errorCode = "RESPONSECODENOTINDEXABLE";
>>         errorDesc = "HTTP response code not indexable ("+responseCode+")";
>>         activities.noDocument(documentIdentifier,versionString);
>>         return;
>>       }
>>
>>
>> You should indeed see these cases logged in the simple history and no
>> document sent to Solr.  Is this not what you are seeing?
>>
>> Karl
>>
>>
>> On Tue, Apr 28, 2015 at 7:01 AM, Arcadius Ahouansou <arcadius@menelic.com
>> > wrote:
>>
>>>
>>> Hello.
>>>
>>> I am using MCF 2.0.2 for crawling the web and ingesting data into Solr.
>>>
>>> MCF has ingested into Solr documents that returned HTTP error let's says
>>> 401, 403, 404 or have a certain content like "this page has expired and has
>>> been removed"
>>>
>>> The question is:
>>> is there a way to tell MCF to ingest
>>> - only document not containing a certain content like "Not Found" or
>>> - only document excluding those with header 401, 403, 404, 500, ...
>>>
>>> Thank you very much.
>>>
>>> Arcadius.
>>>
>>
>>
>
>
> --
> Arcadius Ahouansou
> Menelic Ltd | Information is Power
> M: 07908761999
> W: www.menelic.com
> ---
>

Mime
View raw message