manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Treatment of protected files
Date Thu, 19 May 2011 10:00:10 GMT
There was a Solr ticket created I believe by Shinichiro.

The question is whether the Solr 500 response has anything in its body
that could help ManifoldCF recognize a Tika exception.  If not there
is little the Solr connector can do to detect this case.  The problem
is that you need to look in the Simple History to see what the
response actually is, and I don't think Shinichiro did that.

Karl

On Thu, May 19, 2011 at 4:42 AM, Erlend Garåsen <e.f.garasen@usit.uio.no> wrote:
>
> Do we have an MCF ticket for this issue yet? Or is rather a Solr issue?
>
> I agree with Karl. We should look for a TikaException and then tell MCF to
> skip affecting documents. But maybe this should just be a temporary fix
> until it has been fixed in Solr Cell.
>
> Exactly the same happens if Tika cannot parse a document which it does not
> support. Solr/Solr Cell returns a 500 server error, causing MCF to retry
> over and over again:
> [2011-05-18 17:39:34.104] [] webapp=/solr path=/update/extract
> params={literal.id=http://foreninger.uio.no/akademikerne/Tillitsvalgte_i_akademikerforeninger_files/themedata.thmx}
> status=500 QTime=5
> [2011-05-18 17:39:39.102] {} 0 4
> [2011-05-18 17:39:39.103] org.apache.solr.common.SolrException:
> org.apache.tika.exception.TikaException: TIKA-418: RuntimeException while
> getting content for thmx and xps file types
>
> And finally, the job just aborts:
> Exception tossed: Repeated service interruptions - failure processing
> document: Ingestion HTTP error code 500
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Repeated service
> interruptions - failure processing document: Ingestion HTTP error code 500
>        at
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:630)
> Caused by: org.apache.manifoldcf.core.interfaces.ManifoldCFException:
> Ingestion HTTP error code 500
>        at
> org.apache.manifoldcf.agents.output.solr.HttpPoster$IngestThread.run(HttpPoster.java:1362)
>
> I guess I can find a workaround since I have created my own
> ExtractingRequestHandler in order to support language detection etc., but I
> think MCF should act differently when the underlying cause is a
> TikaException.
>
> Erlend
>
>
> On 27.04.11 12.25, Karl Wright wrote:
>>
>> If I recall, it treats the 400 response as meaning "this document
>> should be skipped", and it treats the 500 response as meaning "this
>> document should be retried because I have absolutely no idea what
>> happened".  However, we could modify the code for the 500 response to
>> look at the content of the response as well, and look for a string in
>> it that would give us a clue, such as "TikaException".  If we see a
>> TikaException, we could have it conclude "this document should be
>> skipped".  That was what I was thinking.
>>
>> Karl
>>
>> On Wed, Apr 27, 2011 at 6:00 AM, Shinichiro Abe
>> <shinichiro.abe.1@gmail.com>  wrote:
>>>
>>> Hi.Thank you for your reply.
>>>
>>> It seems that Solr.ExtractingRequestHandler responds the same HTTP
>>> response(SERVER_ERROR( 500 )) at any time error occurs.
>>> I'll try to open a ticket for solr.
>>>
>>> Is it correct that MCF re-try crawling was processed when it receives 500
>>> level response, not 400 level response?
>>>
>>> Thank you.
>>> Shinichiro Abe
>>>
>>>
>>> On 2011/04/27, at 14:45, Karl Wright wrote:
>>>
>>>> So the 500 error is occurring because Solr is throwing an exception at
>>>> indexing time, is that correct?
>>>>
>>>> If this is correct, then here's my take.  (1) A 500 error is a nasty
>>>> error that Solr should not be returning under normal conditions.  (2)
>>>> A password-protected PDF is not what I would consider exceptional, so
>>>> Tika should not be throwing an exception when it sees it, merely (at
>>>> worst) logging an error and continuing.  However, having said that,
>>>> output connectors in ManifoldCF can make the decision to never retry
>>>> the document, by returning a certain status, provided the connector
>>>> can figure out that the error warrants this treatment.
>>>>
>>>> My suggestion is therefore the following.  First, we should open a
>>>> ticket for Solr about this.  Second, if you can see the error output
>>>> from the Simple History for a TikaException being thrown in Solr, we
>>>> can look for that text in the response from Solr and perhaps modify
>>>> the Solr Connector to detect the case.  If you could open a ManifoldCF
>>>> ticket and include that text I'd be very grateful.
>>>>
>>>> Thanks!
>>>> Karl
>>>>
>>>> On Tue, Apr 26, 2011 at 10:53 PM, Shinichiro Abe
>>>> <shinichiro.abe.1@gmail.com>  wrote:
>>>>>
>>>>> Hello.
>>>>>
>>>>> There are pdf and office files that are protected by reading password.
>>>>> We do not have to read those files if we do not know the password of
>>>>> files.
>>>>>
>>>>> Now, MCF job starts to crawl the filesystem repository and post to
>>>>> Solr.
>>>>> Document ingestion of non-protected files is done successfully,
>>>>> but one of protected file is not done successfully as far as the job
is
>>>>> processed beyond Retry Limit.
>>>>> During that time, it is logging 500 result code in simple history.
>>>>> (Solr throws TikaException caused by PDFBox or apache poi as the reason
>>>>> that it does not read protected documents.)
>>>>>
>>>>> When I ran that test by continuous clawing, not by simple once
>>>>> crawling,
>>>>> the job was done halfway and logged the following:
>>>>> Error: Repeated service interruptions - failure processing document:
>>>>> Ingestion HTTP error code 500
>>>>> the job tried to crawl that files many times.
>>>>>
>>>>> It seems that a job takes a lot of time and costs for treating
>>>>> protected files.
>>>>> So I want to find a way to skip quickly reading those files.
>>>>>
>>>>> In my survey:
>>>>> Hopfillers is not relevant.(right?)
>>>>> Then Tika, PDFBox, and POI have the mechanism to decrypt protected
>>>>> files,
>>>>> but throw each another exception in the case that given invalid
>>>>> password.
>>>>> It occurs to me that Solr throws another result code when protected
>>>>> files are posted,
>>>>> as one idea apart from possibility or not.
>>>>>
>>>>> Do you have any ideas?
>>>>>
>>>>> Regards,
>>>>> Shinichiro Abe
>>>
>>>
>
>
> --
> Erlend Garåsen
> Center for Information Technology Services
> University of Oslo
> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>

Mime
View raw message