Mailing-List: contact connectors-user-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: connectors-user@incubator.apache.org
Received-SPF: pass (athena.apache.org: local policy)
Message-ID: <4DD52227.7080502@usit.uio.no>
Date: Thu, 19 May 2011 15:59:03 +0200
From: =?ISO-8859-1?Q?Erlend_Gar=E5sen?= <e.f.garasen@usit.uio.no>
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US;
 rv:1.9.2.17) Gecko/20110414 Thunderbird/3.1.10
MIME-Version: 1.0
To: connectors-user@incubator.apache.org
Subject: Re: Treatment of protected files
References: <28E04398-B7DD-4A8D-B317-671D975334DE@gmail.com>
	<BANLkTimmucmz0jsxACGnvJA_1J9zuk2ASw@mail.gmail.com>
	<F55015B0-8E29-4850-88B0-6386E78BE509@gmail.com>
	<BANLkTinADRkCApzy-190pxAFW02BViBE0Q@mail.gmail.com>
	<4DD4D80B.7080407@usit.uio.no>
	<BANLkTikau=90Zckd++Ccz79SjrDrnSyX1w@mail.gmail.com>
	<4DD4ECB4.8010903@usit.uio.no>
	<BANLkTi=RsoAdqEkAg_4QYu6i7TDeTsASuw@mail.gmail.com>
	<BANLkTi==gg80RsoSv25y3Z-fWcxMCnAeZQ@mail.gmail.com>
 <BANLkTin_EB71PYeBys64=mxAworaa0Wi6A@mail.gmail.com>
In-Reply-To: <BANLkTin_EB71PYeBys64=mxAworaa0Wi6A@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit


Sure, I can test it tomorrow, unfortunately not right now. I'm leaving 
my office in 20 minutes, but I have plenty of time tomorrow.

Erlend

On 19.05.11 14.39, Karl Wright wrote:
> I've also checked in the proposed change, if you care to try it.
> We're having network issues here this morning so I can't seem to
> update the ticket though.
>
> Karl
>
> On Thu, May 19, 2011 at 8:35 AM, Karl Wright<daddywri@gmail.com>  wrote:
>> CONNECTORS-200 is the ticket.
>> Karl
>>
>> On Thu, May 19, 2011 at 8:04 AM, Karl Wright<daddywri@gmail.com>  wrote:
>>> This should be enough.
>>>
>>> I'll open a ticket.  The changes to the solr connector are trivial; I
>>> can do them and check them in, if someone is willing to try it out for
>>> real.
>>>
>>> Karl
>>>
>>> On Thu, May 19, 2011 at 6:11 AM, Erlend Gar�sen<e.f.garasen@usit.uio.no>  wrote:
>>>>
>>>> Here's what I found in my simple history logs:
>>>> org.apache.tika.exception.TikaException: TIKA-418: RuntimeException while
>>>> getting content for thmx and xps file types
>>>>
>>>> So, yes, Tika exceptions are stored in the MCF logs, so I guess it should be
>>>> possible to find a workaround for this.
>>>>
>>>> Erlend
>>>>
>>>> On 19.05.11 12.00, Karl Wright wrote:
>>>>>
>>>>> There was a Solr ticket created I believe by Shinichiro.
>>>>>
>>>>> The question is whether the Solr 500 response has anything in its body
>>>>> that could help ManifoldCF recognize a Tika exception.  If not there
>>>>> is little the Solr connector can do to detect this case.  The problem
>>>>> is that you need to look in the Simple History to see what the
>>>>> response actually is, and I don't think Shinichiro did that.
>>>>>
>>>>> Karl
>>>>>
>>>>> On Thu, May 19, 2011 at 4:42 AM, Erlend Gar�sen<e.f.garasen@usit.uio.no>
>>>>>   wrote:
>>>>>>
>>>>>> Do we have an MCF ticket for this issue yet? Or is rather a Solr issue?
>>>>>>
>>>>>> I agree with Karl. We should look for a TikaException and then tell MCF
>>>>>> to
>>>>>> skip affecting documents. But maybe this should just be a temporary fix
>>>>>> until it has been fixed in Solr Cell.
>>>>>>
>>>>>> Exactly the same happens if Tika cannot parse a document which it does
>>>>>> not
>>>>>> support. Solr/Solr Cell returns a 500 server error, causing MCF to retry
>>>>>> over and over again:
>>>>>> [2011-05-18 17:39:34.104] [] webapp=/solr path=/update/extract
>>>>>>
>>>>>> params={literal.id=http://foreninger.uio.no/akademikerne/Tillitsvalgte_i_akademikerforeninger_files/themedata.thmx}
>>>>>> status=500 QTime=5
>>>>>> [2011-05-18 17:39:39.102] {} 0 4
>>>>>> [2011-05-18 17:39:39.103] org.apache.solr.common.SolrException:
>>>>>> org.apache.tika.exception.TikaException: TIKA-418: RuntimeException while
>>>>>> getting content for thmx and xps file types
>>>>>>
>>>>>> And finally, the job just aborts:
>>>>>> Exception tossed: Repeated service interruptions - failure processing
>>>>>> document: Ingestion HTTP error code 500
>>>>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Repeated
>>>>>> service
>>>>>> interruptions - failure processing document: Ingestion HTTP error code
>>>>>> 500
>>>>>>         at
>>>>>>
>>>>>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:630)
>>>>>> Caused by: org.apache.manifoldcf.core.interfaces.ManifoldCFException:
>>>>>> Ingestion HTTP error code 500
>>>>>>         at
>>>>>>
>>>>>> org.apache.manifoldcf.agents.output.solr.HttpPoster$IngestThread.run(HttpPoster.java:1362)
>>>>>>
>>>>>> I guess I can find a workaround since I have created my own
>>>>>> ExtractingRequestHandler in order to support language detection etc., but
>>>>>> I
>>>>>> think MCF should act differently when the underlying cause is a
>>>>>> TikaException.
>>>>>>
>>>>>> Erlend
>>>>>>
>>>>>>
>>>>>> On 27.04.11 12.25, Karl Wright wrote:
>>>>>>>
>>>>>>> If I recall, it treats the 400 response as meaning "this document
>>>>>>> should be skipped", and it treats the 500 response as meaning "this
>>>>>>> document should be retried because I have absolutely no idea what
>>>>>>> happened".  However, we could modify the code for the 500 response to
>>>>>>> look at the content of the response as well, and look for a string in
>>>>>>> it that would give us a clue, such as "TikaException".  If we see a
>>>>>>> TikaException, we could have it conclude "this document should be
>>>>>>> skipped".  That was what I was thinking.
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>> On Wed, Apr 27, 2011 at 6:00 AM, Shinichiro Abe
>>>>>>> <shinichiro.abe.1@gmail.com>      wrote:
>>>>>>>>
>>>>>>>> Hi.Thank you for your reply.
>>>>>>>>
>>>>>>>> It seems that Solr.ExtractingRequestHandler responds the same HTTP
>>>>>>>> response(SERVER_ERROR( 500 )) at any time error occurs.
>>>>>>>> I'll try to open a ticket for solr.
>>>>>>>>
>>>>>>>> Is it correct that MCF re-try crawling was processed when it receives
>>>>>>>> 500
>>>>>>>> level response, not 400 level response?
>>>>>>>>
>>>>>>>> Thank you.
>>>>>>>> Shinichiro Abe
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2011/04/27, at 14:45, Karl Wright wrote:
>>>>>>>>
>>>>>>>>> So the 500 error is occurring because Solr is throwing an exception at
>>>>>>>>> indexing time, is that correct?
>>>>>>>>>
>>>>>>>>> If this is correct, then here's my take.  (1) A 500 error is a nasty
>>>>>>>>> error that Solr should not be returning under normal conditions.  (2)
>>>>>>>>> A password-protected PDF is not what I would consider exceptional, so
>>>>>>>>> Tika should not be throwing an exception when it sees it, merely (at
>>>>>>>>> worst) logging an error and continuing.  However, having said that,
>>>>>>>>> output connectors in ManifoldCF can make the decision to never retry
>>>>>>>>> the document, by returning a certain status, provided the connector
>>>>>>>>> can figure out that the error warrants this treatment.
>>>>>>>>>
>>>>>>>>> My suggestion is therefore the following.  First, we should open a
>>>>>>>>> ticket for Solr about this.  Second, if you can see the error output
>>>>>>>>> from the Simple History for a TikaException being thrown in Solr, we
>>>>>>>>> can look for that text in the response from Solr and perhaps modify
>>>>>>>>> the Solr Connector to detect the case.  If you could open a ManifoldCF
>>>>>>>>> ticket and include that text I'd be very grateful.
>>>>>>>>>
>>>>>>>>> Thanks!
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>> On Tue, Apr 26, 2011 at 10:53 PM, Shinichiro Abe
>>>>>>>>> <shinichiro.abe.1@gmail.com>      wrote:
>>>>>>>>>>
>>>>>>>>>> Hello.
>>>>>>>>>>
>>>>>>>>>> There are pdf and office files that are protected by reading
>>>>>>>>>> password.
>>>>>>>>>> We do not have to read those files if we do not know the password of
>>>>>>>>>> files.
>>>>>>>>>>
>>>>>>>>>> Now, MCF job starts to crawl the filesystem repository and post to
>>>>>>>>>> Solr.
>>>>>>>>>> Document ingestion of non-protected files is done successfully,
>>>>>>>>>> but one of protected file is not done successfully as far as the job
>>>>>>>>>> is
>>>>>>>>>> processed beyond Retry Limit.
>>>>>>>>>> During that time, it is logging 500 result code in simple history.
>>>>>>>>>> (Solr throws TikaException caused by PDFBox or apache poi as the
>>>>>>>>>> reason
>>>>>>>>>> that it does not read protected documents.)
>>>>>>>>>>
>>>>>>>>>> When I ran that test by continuous clawing, not by simple once
>>>>>>>>>> crawling,
>>>>>>>>>> the job was done halfway and logged the following:
>>>>>>>>>> Error: Repeated service interruptions - failure processing document:
>>>>>>>>>> Ingestion HTTP error code 500
>>>>>>>>>> the job tried to crawl that files many times.
>>>>>>>>>>
>>>>>>>>>> It seems that a job takes a lot of time and costs for treating
>>>>>>>>>> protected files.
>>>>>>>>>> So I want to find a way to skip quickly reading those files.
>>>>>>>>>>
>>>>>>>>>> In my survey:
>>>>>>>>>> Hopfillers is not relevant.(right?)
>>>>>>>>>> Then Tika, PDFBox, and POI have the mechanism to decrypt protected
>>>>>>>>>> files,
>>>>>>>>>> but throw each another exception in the case that given invalid
>>>>>>>>>> password.
>>>>>>>>>> It occurs to me that Solr throws another result code when protected
>>>>>>>>>> files are posted,
>>>>>>>>>> as one idea apart from possibility or not.
>>>>>>>>>>
>>>>>>>>>> Do you have any ideas?
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Shinichiro Abe
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Erlend Gar�sen
>>>>>> Center for Information Technology Services
>>>>>> University of Oslo
>>>>>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>>>>>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:
>>>>>> 31050
>>>>>>
>>>>
>>>>
>>>> --
>>>> Erlend Gar�sen
>>>> Center for Information Technology Services
>>>> University of Oslo
>>>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>>>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>>>>
>>>
>>


-- 
Erlend Gar�sen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050