lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Koji Sekiguchi (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SOLR-2480) Text extraction of password protected files
Date Sat, 14 May 2011 04:20:48 GMT

     [ https://issues.apache.org/jira/browse/SOLR-2480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Koji Sekiguchi updated SOLR-2480:
---------------------------------

    Attachment: password-is-solrcell.docx
                SOLR-2480.patch

Attached the next patch and password protected word file that is used for test.

I added test cases for ignoreTikaException=true|false cases.

I think this is ready to commit.

> Text extraction of password protected files
> -------------------------------------------
>
>                 Key: SOLR-2480
>                 URL: https://issues.apache.org/jira/browse/SOLR-2480
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - Solr Cell (Tika extraction)
>    Affects Versions: 1.4.1, 3.1
>            Reporter: Shinichiro Abe
>            Assignee: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 3.2, 4.0
>
>         Attachments: SOLR-2480-idea1.patch, SOLR-2480.patch, SOLR-2480.patch, password-is-solrcell.docx
>
>
> Proposal:
> There are password-protected files. PDF, Office documents in 2007 format/97 format.
> These files are posted using SolrCell.
> We do not have to read these files if we do not know the reading password of files.
> So, these files may not be extracted text.
> My requirement is that these files should be processed normally without extracting text,
and without throwing exception.
> This background:
> Now, when you post a password-protected file, solr returns 500 server error.
> Solr catches the error in ExtractingDocumentLoader and throws TikException.
> I use ManifoldCF.
> If the solr server responds 500, ManifoldCF judge is that "this
> document should be retried because I have absolutely no idea what
> happened".
> And it attempts to retry posting many times without getting the password.
> In the other case, my customer posts the files with embedded images.
> Sometimes it seems that solr throws TikaException of unknown cause.
> He wants to post just metadata without extracting text, but makes him stop posting by
the exception.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message