Return-Path: X-Original-To: apmail-incubator-connectors-user-archive@minotaur.apache.org Delivered-To: apmail-incubator-connectors-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 44C324C43 for ; Thu, 19 May 2011 13:59:33 +0000 (UTC) Received: (qmail 88836 invoked by uid 500); 19 May 2011 13:59:33 -0000 Delivered-To: apmail-incubator-connectors-user-archive@incubator.apache.org Received: (qmail 88799 invoked by uid 500); 19 May 2011 13:59:33 -0000 Mailing-List: contact connectors-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: connectors-user@incubator.apache.org Delivered-To: mailing list connectors-user@incubator.apache.org Received: (qmail 88791 invoked by uid 99); 19 May 2011 13:59:33 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 19 May 2011 13:59:33 +0000 X-ASF-Spam-Status: No, hits=-2.3 required=5.0 tests=RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [129.240.10.58] (HELO mail-out2.uio.no) (129.240.10.58) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 19 May 2011 13:59:26 +0000 Received: from mail-mx1.uio.no ([129.240.10.29]) by mail-out2.uio.no with esmtp (Exim 4.75) (envelope-from ) id 1QN3km-0001qt-BI for connectors-user@incubator.apache.org; Thu, 19 May 2011 15:59:04 +0200 Received: from hoppalong.uio.no ([129.240.93.30]) by mail-mx1.uio.no with esmtpsa (TLSv1:CAMELLIA256-SHA:256) user erlendfg (Exim 4.76) (envelope-from ) id 1QN3kl-0002mm-Jj for connectors-user@incubator.apache.org; Thu, 19 May 2011 15:59:04 +0200 Message-ID: <4DD52227.7080502@usit.uio.no> Date: Thu, 19 May 2011 15:59:03 +0200 From: =?ISO-8859-1?Q?Erlend_Gar=E5sen?= User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.17) Gecko/20110414 Thunderbird/3.1.10 MIME-Version: 1.0 To: connectors-user@incubator.apache.org Subject: Re: Treatment of protected files References: <28E04398-B7DD-4A8D-B317-671D975334DE@gmail.com> <4DD4D80B.7080407@usit.uio.no> <4DD4ECB4.8010903@usit.uio.no> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit X-UiO-Ratelimit-Test: rcpts/h 1 msgs/h 1 sum rcpts/h 1 sum msgs/h 1 total rcpts 3857 max rcpts/h 21 ratelimit 0 X-UiO-Spam-info: not spam, SpamAssassin (score=-5.0, required=5.0, autolearn=disabled, T_RP_MATCHES_RCVD=-0.01,UIO_MAIL_IS_INTERNAL=-5, uiobl=NO, uiouri=NO) X-UiO-Scanned: 99D1A090CDF7061B90EC51E4F374E1565ED3C38C X-UiO-SPAM-Test: remote_host: 129.240.93.30 spam_score: -49 maxlevel 80 minaction 2 bait 0 mail/h: 1 total 2016 max/h 15 blacklist 0 greylist 0 ratelimit 0 Sure, I can test it tomorrow, unfortunately not right now. I'm leaving my office in 20 minutes, but I have plenty of time tomorrow. Erlend On 19.05.11 14.39, Karl Wright wrote: > I've also checked in the proposed change, if you care to try it. > We're having network issues here this morning so I can't seem to > update the ticket though. > > Karl > > On Thu, May 19, 2011 at 8:35 AM, Karl Wright wrote: >> CONNECTORS-200 is the ticket. >> Karl >> >> On Thu, May 19, 2011 at 8:04 AM, Karl Wright wrote: >>> This should be enough. >>> >>> I'll open a ticket. The changes to the solr connector are trivial; I >>> can do them and check them in, if someone is willing to try it out for >>> real. >>> >>> Karl >>> >>> On Thu, May 19, 2011 at 6:11 AM, Erlend Gar�sen wrote: >>>> >>>> Here's what I found in my simple history logs: >>>> org.apache.tika.exception.TikaException: TIKA-418: RuntimeException while >>>> getting content for thmx and xps file types >>>> >>>> So, yes, Tika exceptions are stored in the MCF logs, so I guess it should be >>>> possible to find a workaround for this. >>>> >>>> Erlend >>>> >>>> On 19.05.11 12.00, Karl Wright wrote: >>>>> >>>>> There was a Solr ticket created I believe by Shinichiro. >>>>> >>>>> The question is whether the Solr 500 response has anything in its body >>>>> that could help ManifoldCF recognize a Tika exception. If not there >>>>> is little the Solr connector can do to detect this case. The problem >>>>> is that you need to look in the Simple History to see what the >>>>> response actually is, and I don't think Shinichiro did that. >>>>> >>>>> Karl >>>>> >>>>> On Thu, May 19, 2011 at 4:42 AM, Erlend Gar�sen >>>>> wrote: >>>>>> >>>>>> Do we have an MCF ticket for this issue yet? Or is rather a Solr issue? >>>>>> >>>>>> I agree with Karl. We should look for a TikaException and then tell MCF >>>>>> to >>>>>> skip affecting documents. But maybe this should just be a temporary fix >>>>>> until it has been fixed in Solr Cell. >>>>>> >>>>>> Exactly the same happens if Tika cannot parse a document which it does >>>>>> not >>>>>> support. Solr/Solr Cell returns a 500 server error, causing MCF to retry >>>>>> over and over again: >>>>>> [2011-05-18 17:39:34.104] [] webapp=/solr path=/update/extract >>>>>> >>>>>> params={literal.id=http://foreninger.uio.no/akademikerne/Tillitsvalgte_i_akademikerforeninger_files/themedata.thmx} >>>>>> status=500 QTime=5 >>>>>> [2011-05-18 17:39:39.102] {} 0 4 >>>>>> [2011-05-18 17:39:39.103] org.apache.solr.common.SolrException: >>>>>> org.apache.tika.exception.TikaException: TIKA-418: RuntimeException while >>>>>> getting content for thmx and xps file types >>>>>> >>>>>> And finally, the job just aborts: >>>>>> Exception tossed: Repeated service interruptions - failure processing >>>>>> document: Ingestion HTTP error code 500 >>>>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Repeated >>>>>> service >>>>>> interruptions - failure processing document: Ingestion HTTP error code >>>>>> 500 >>>>>> at >>>>>> >>>>>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:630) >>>>>> Caused by: org.apache.manifoldcf.core.interfaces.ManifoldCFException: >>>>>> Ingestion HTTP error code 500 >>>>>> at >>>>>> >>>>>> org.apache.manifoldcf.agents.output.solr.HttpPoster$IngestThread.run(HttpPoster.java:1362) >>>>>> >>>>>> I guess I can find a workaround since I have created my own >>>>>> ExtractingRequestHandler in order to support language detection etc., but >>>>>> I >>>>>> think MCF should act differently when the underlying cause is a >>>>>> TikaException. >>>>>> >>>>>> Erlend >>>>>> >>>>>> >>>>>> On 27.04.11 12.25, Karl Wright wrote: >>>>>>> >>>>>>> If I recall, it treats the 400 response as meaning "this document >>>>>>> should be skipped", and it treats the 500 response as meaning "this >>>>>>> document should be retried because I have absolutely no idea what >>>>>>> happened". However, we could modify the code for the 500 response to >>>>>>> look at the content of the response as well, and look for a string in >>>>>>> it that would give us a clue, such as "TikaException". If we see a >>>>>>> TikaException, we could have it conclude "this document should be >>>>>>> skipped". That was what I was thinking. >>>>>>> >>>>>>> Karl >>>>>>> >>>>>>> On Wed, Apr 27, 2011 at 6:00 AM, Shinichiro Abe >>>>>>> wrote: >>>>>>>> >>>>>>>> Hi.Thank you for your reply. >>>>>>>> >>>>>>>> It seems that Solr.ExtractingRequestHandler responds the same HTTP >>>>>>>> response(SERVER_ERROR( 500 )) at any time error occurs. >>>>>>>> I'll try to open a ticket for solr. >>>>>>>> >>>>>>>> Is it correct that MCF re-try crawling was processed when it receives >>>>>>>> 500 >>>>>>>> level response, not 400 level response? >>>>>>>> >>>>>>>> Thank you. >>>>>>>> Shinichiro Abe >>>>>>>> >>>>>>>> >>>>>>>> On 2011/04/27, at 14:45, Karl Wright wrote: >>>>>>>> >>>>>>>>> So the 500 error is occurring because Solr is throwing an exception at >>>>>>>>> indexing time, is that correct? >>>>>>>>> >>>>>>>>> If this is correct, then here's my take. (1) A 500 error is a nasty >>>>>>>>> error that Solr should not be returning under normal conditions. (2) >>>>>>>>> A password-protected PDF is not what I would consider exceptional, so >>>>>>>>> Tika should not be throwing an exception when it sees it, merely (at >>>>>>>>> worst) logging an error and continuing. However, having said that, >>>>>>>>> output connectors in ManifoldCF can make the decision to never retry >>>>>>>>> the document, by returning a certain status, provided the connector >>>>>>>>> can figure out that the error warrants this treatment. >>>>>>>>> >>>>>>>>> My suggestion is therefore the following. First, we should open a >>>>>>>>> ticket for Solr about this. Second, if you can see the error output >>>>>>>>> from the Simple History for a TikaException being thrown in Solr, we >>>>>>>>> can look for that text in the response from Solr and perhaps modify >>>>>>>>> the Solr Connector to detect the case. If you could open a ManifoldCF >>>>>>>>> ticket and include that text I'd be very grateful. >>>>>>>>> >>>>>>>>> Thanks! >>>>>>>>> Karl >>>>>>>>> >>>>>>>>> On Tue, Apr 26, 2011 at 10:53 PM, Shinichiro Abe >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> Hello. >>>>>>>>>> >>>>>>>>>> There are pdf and office files that are protected by reading >>>>>>>>>> password. >>>>>>>>>> We do not have to read those files if we do not know the password of >>>>>>>>>> files. >>>>>>>>>> >>>>>>>>>> Now, MCF job starts to crawl the filesystem repository and post to >>>>>>>>>> Solr. >>>>>>>>>> Document ingestion of non-protected files is done successfully, >>>>>>>>>> but one of protected file is not done successfully as far as the job >>>>>>>>>> is >>>>>>>>>> processed beyond Retry Limit. >>>>>>>>>> During that time, it is logging 500 result code in simple history. >>>>>>>>>> (Solr throws TikaException caused by PDFBox or apache poi as the >>>>>>>>>> reason >>>>>>>>>> that it does not read protected documents.) >>>>>>>>>> >>>>>>>>>> When I ran that test by continuous clawing, not by simple once >>>>>>>>>> crawling, >>>>>>>>>> the job was done halfway and logged the following: >>>>>>>>>> Error: Repeated service interruptions - failure processing document: >>>>>>>>>> Ingestion HTTP error code 500 >>>>>>>>>> the job tried to crawl that files many times. >>>>>>>>>> >>>>>>>>>> It seems that a job takes a lot of time and costs for treating >>>>>>>>>> protected files. >>>>>>>>>> So I want to find a way to skip quickly reading those files. >>>>>>>>>> >>>>>>>>>> In my survey: >>>>>>>>>> Hopfillers is not relevant.(right?) >>>>>>>>>> Then Tika, PDFBox, and POI have the mechanism to decrypt protected >>>>>>>>>> files, >>>>>>>>>> but throw each another exception in the case that given invalid >>>>>>>>>> password. >>>>>>>>>> It occurs to me that Solr throws another result code when protected >>>>>>>>>> files are posted, >>>>>>>>>> as one idea apart from possibility or not. >>>>>>>>>> >>>>>>>>>> Do you have any ideas? >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> Shinichiro Abe >>>>>>>> >>>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Erlend Gar�sen >>>>>> Center for Information Technology Services >>>>>> University of Oslo >>>>>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway >>>>>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: >>>>>> 31050 >>>>>> >>>> >>>> >>>> -- >>>> Erlend Gar�sen >>>> Center for Information Technology Services >>>> University of Oslo >>>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway >>>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050 >>>> >>> >> -- Erlend Gar�sen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050