manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erlend GarĂ¥sen <e.f.gara...@usit.uio.no>
Subject Re: Treatment of protected files
Date Thu, 19 May 2011 08:42:51 GMT

Do we have an MCF ticket for this issue yet? Or is rather a Solr issue?

I agree with Karl. We should look for a TikaException and then tell MCF 
to skip affecting documents. But maybe this should just be a temporary 
fix until it has been fixed in Solr Cell.

Exactly the same happens if Tika cannot parse a document which it does 
not support. Solr/Solr Cell returns a 500 server error, causing MCF to 
retry over and over again:
[2011-05-18 17:39:34.104] [] webapp=/solr path=/update/extract 
params={literal.id=http://foreninger.uio.no/akademikerne/Tillitsvalgte_i_akademikerforeninger_files/themedata.thmx}

status=500 QTime=5
[2011-05-18 17:39:39.102] {} 0 4
[2011-05-18 17:39:39.103] org.apache.solr.common.SolrException: 
org.apache.tika.exception.TikaException: TIKA-418: RuntimeException 
while getting content for thmx and xps file types

And finally, the job just aborts:
Exception tossed: Repeated service interruptions - failure processing 
document: Ingestion HTTP error code 500
org.apache.manifoldcf.core.interfaces.ManifoldCFException: Repeated 
service interruptions - failure processing document: Ingestion HTTP 
error code 500
	at 
org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:630)
Caused by: org.apache.manifoldcf.core.interfaces.ManifoldCFException: 
Ingestion HTTP error code 500
	at 
org.apache.manifoldcf.agents.output.solr.HttpPoster$IngestThread.run(HttpPoster.java:1362)

I guess I can find a workaround since I have created my own 
ExtractingRequestHandler in order to support language detection etc., 
but I think MCF should act differently when the underlying cause is a 
TikaException.

Erlend


On 27.04.11 12.25, Karl Wright wrote:
> If I recall, it treats the 400 response as meaning "this document
> should be skipped", and it treats the 500 response as meaning "this
> document should be retried because I have absolutely no idea what
> happened".  However, we could modify the code for the 500 response to
> look at the content of the response as well, and look for a string in
> it that would give us a clue, such as "TikaException".  If we see a
> TikaException, we could have it conclude "this document should be
> skipped".  That was what I was thinking.
>
> Karl
>
> On Wed, Apr 27, 2011 at 6:00 AM, Shinichiro Abe
> <shinichiro.abe.1@gmail.com>  wrote:
>> Hi.Thank you for your reply.
>>
>> It seems that Solr.ExtractingRequestHandler responds the same HTTP response(SERVER_ERROR(
500 )) at any time error occurs.
>> I'll try to open a ticket for solr.
>>
>> Is it correct that MCF re-try crawling was processed when it receives 500 level response,
not 400 level response?
>>
>> Thank you.
>> Shinichiro Abe
>>
>>
>> On 2011/04/27, at 14:45, Karl Wright wrote:
>>
>>> So the 500 error is occurring because Solr is throwing an exception at
>>> indexing time, is that correct?
>>>
>>> If this is correct, then here's my take.  (1) A 500 error is a nasty
>>> error that Solr should not be returning under normal conditions.  (2)
>>> A password-protected PDF is not what I would consider exceptional, so
>>> Tika should not be throwing an exception when it sees it, merely (at
>>> worst) logging an error and continuing.  However, having said that,
>>> output connectors in ManifoldCF can make the decision to never retry
>>> the document, by returning a certain status, provided the connector
>>> can figure out that the error warrants this treatment.
>>>
>>> My suggestion is therefore the following.  First, we should open a
>>> ticket for Solr about this.  Second, if you can see the error output
>>> from the Simple History for a TikaException being thrown in Solr, we
>>> can look for that text in the response from Solr and perhaps modify
>>> the Solr Connector to detect the case.  If you could open a ManifoldCF
>>> ticket and include that text I'd be very grateful.
>>>
>>> Thanks!
>>> Karl
>>>
>>> On Tue, Apr 26, 2011 at 10:53 PM, Shinichiro Abe
>>> <shinichiro.abe.1@gmail.com>  wrote:
>>>> Hello.
>>>>
>>>> There are pdf and office files that are protected by reading password.
>>>> We do not have to read those files if we do not know the password of files.
>>>>
>>>> Now, MCF job starts to crawl the filesystem repository and post to Solr.
>>>> Document ingestion of non-protected files is done successfully,
>>>> but one of protected file is not done successfully as far as the job is processed
beyond Retry Limit.
>>>> During that time, it is logging 500 result code in simple history.
>>>> (Solr throws TikaException caused by PDFBox or apache poi as the reason that
it does not read protected documents.)
>>>>
>>>> When I ran that test by continuous clawing, not by simple once crawling,
>>>> the job was done halfway and logged the following:
>>>> Error: Repeated service interruptions - failure processing document: Ingestion
HTTP error code 500
>>>> the job tried to crawl that files many times.
>>>>
>>>> It seems that a job takes a lot of time and costs for treating protected
files.
>>>> So I want to find a way to skip quickly reading those files.
>>>>
>>>> In my survey:
>>>> Hopfillers is not relevant.(right?)
>>>> Then Tika, PDFBox, and POI have the mechanism to decrypt protected files,
>>>> but throw each another exception in the case that given invalid password.
>>>> It occurs to me that Solr throws another result code when protected files
are posted,
>>>> as one idea apart from possibility or not.
>>>>
>>>> Do you have any ideas?
>>>>
>>>> Regards,
>>>> Shinichiro Abe
>>
>>


-- 
Erlend GarĂ¥sen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Mime
View raw message