manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shinichiro Abe <shinichiro.ab...@gmail.com>
Subject Re: Treatment of protected files
Date Wed, 27 Apr 2011 10:00:03 GMT
Hi.Thank you for your reply.

It seems that Solr.ExtractingRequestHandler responds the same HTTP response(SERVER_ERROR(
500 )) at any time error occurs.
I'll try to open a ticket for solr.

Is it correct that MCF re-try crawling was processed when it receives 500 level response,
not 400 level response?

Thank you.
Shinichiro Abe


On 2011/04/27, at 14:45, Karl Wright wrote:

> So the 500 error is occurring because Solr is throwing an exception at
> indexing time, is that correct?
> 
> If this is correct, then here's my take.  (1) A 500 error is a nasty
> error that Solr should not be returning under normal conditions.  (2)
> A password-protected PDF is not what I would consider exceptional, so
> Tika should not be throwing an exception when it sees it, merely (at
> worst) logging an error and continuing.  However, having said that,
> output connectors in ManifoldCF can make the decision to never retry
> the document, by returning a certain status, provided the connector
> can figure out that the error warrants this treatment.
> 
> My suggestion is therefore the following.  First, we should open a
> ticket for Solr about this.  Second, if you can see the error output
> from the Simple History for a TikaException being thrown in Solr, we
> can look for that text in the response from Solr and perhaps modify
> the Solr Connector to detect the case.  If you could open a ManifoldCF
> ticket and include that text I'd be very grateful.
> 
> Thanks!
> Karl
> 
> On Tue, Apr 26, 2011 at 10:53 PM, Shinichiro Abe
> <shinichiro.abe.1@gmail.com> wrote:
>> Hello.
>> 
>> There are pdf and office files that are protected by reading password.
>> We do not have to read those files if we do not know the password of files.
>> 
>> Now, MCF job starts to crawl the filesystem repository and post to Solr.
>> Document ingestion of non-protected files is done successfully,
>> but one of protected file is not done successfully as far as the job is processed
beyond Retry Limit.
>> During that time, it is logging 500 result code in simple history.
>> (Solr throws TikaException caused by PDFBox or apache poi as the reason that it does
not read protected documents.)
>> 
>> When I ran that test by continuous clawing, not by simple once crawling,
>> the job was done halfway and logged the following:
>> Error: Repeated service interruptions - failure processing document: Ingestion HTTP
error code 500
>> the job tried to crawl that files many times.
>> 
>> It seems that a job takes a lot of time and costs for treating protected files.
>> So I want to find a way to skip quickly reading those files.
>> 
>> In my survey:
>> Hopfillers is not relevant.(right?)
>> Then Tika, PDFBox, and POI have the mechanism to decrypt protected files,
>> but throw each another exception in the case that given invalid password.
>> It occurs to me that Solr throws another result code when protected files are posted,
>> as one idea apart from possibility or not.
>> 
>> Do you have any ideas?
>> 
>> Regards,
>> Shinichiro Abe


Mime
View raw message