lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: [jira] [Commented] (SOLR-2480) Text extraction of password protected files
Date Thu, 28 Apr 2011 12:24:30 GMT
Hmmm, I'm not sure this fits into Solr-445 or not, could you add this
comment to that
patch discussion so we at least look?

Thanks,
Erick

On Thu, Apr 28, 2011 at 2:03 AM, Shinichiro Abe (JIRA) <jira@apache.org> wrote:
>
>    [ https://issues.apache.org/jira/browse/SOLR-2480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13026137#comment-13026137
]
>
> Shinichiro Abe commented on SOLR-2480:
> --------------------------------------
>
> Improvement ideas:
> 1, TikaException is always ignored, and index only the metadata.
> 2, Parameter "ignoreTikaException" is provided newly.
> If it is true then it returns 200 response, if it is false then it throws TikaException.
> 3, If Solr can catch internal exception about encrypting error, it changes return code
each exception.
> If it can judge poi.EncryptedDocumentException, pdfbox.exceptions.CryptographyException.
etc. then it returns 200 or another code response, if it judges the other exception then it
throws TikaException.
>
>> Text extraction of password protected files
>> -------------------------------------------
>>
>>                 Key: SOLR-2480
>>                 URL: https://issues.apache.org/jira/browse/SOLR-2480
>>             Project: Solr
>>          Issue Type: Improvement
>>          Components: contrib - Solr Cell (Tika extraction)
>>    Affects Versions: 3.1
>>            Reporter: Shinichiro Abe
>>            Priority: Minor
>>
>> Proposal:
>> There are password-protected files. PDF, Office documents in 2007 format/97 format.
>> These files are posted using SolrCell.
>> We do not have to read these files if we do not know the reading password of files.
>> So, these files may not be extracted text.
>> My requirement is that these files should be processed normally without extracting
text, and without throwing exception.
>> This background:
>> Now, when you post a password-protected file, solr returns 500 server error.
>> Solr catches the error in ExtractingDocumentLoader and throws TikException.
>> I use ManifoldCF.
>> If the solr server responds 500, ManifoldCF judge is that "this
>> document should be retried because I have absolutely no idea what
>> happened".
>> And it attempts to retry posting many times without getting the password.
>> In the other case, my customer posts the files with embedded images.
>> Sometimes it seems that solr throws TikaException of unknown cause.
>> He wants to post just metadata without extracting text, but makes him stop posting
by the exception.
>
> --
> This message is automatically generated by JIRA.
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message