lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Koji Sekiguchi (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-2480) Text extraction of password protected files
Date Sat, 14 May 2011 03:09:47 GMT

    [ https://issues.apache.org/jira/browse/SOLR-2480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13033429#comment-13033429
] 

Koji Sekiguchi commented on SOLR-2480:
--------------------------------------

bq. And I think SOLR-445 can resolve improvement ideas(2).

No. You should consider the difference between this issue and SOLR-445. (see my comment above)

As I understand your requirement that was described in Description, and it is quite similar
SOLR-2512 that has been resolved, I'll try a patch that has ignoreErrors flag for TikaException.

I added an ability to ignore exceptions when trying to extract mata data from text in SOLR-2512,
i.g. Solr indexed the text but gave up meta data. On the other hand, the ignore flag in this
ticket is for giving up text but indexing meta data. It cannot be resolved by SOLR-445.

> Text extraction of password protected files
> -------------------------------------------
>
>                 Key: SOLR-2480
>                 URL: https://issues.apache.org/jira/browse/SOLR-2480
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - Solr Cell (Tika extraction)
>    Affects Versions: 1.4.1, 3.1
>            Reporter: Shinichiro Abe
>            Assignee: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 3.2, 4.0
>
>         Attachments: SOLR-2480-idea1.patch
>
>
> Proposal:
> There are password-protected files. PDF, Office documents in 2007 format/97 format.
> These files are posted using SolrCell.
> We do not have to read these files if we do not know the reading password of files.
> So, these files may not be extracted text.
> My requirement is that these files should be processed normally without extracting text,
and without throwing exception.
> This background:
> Now, when you post a password-protected file, solr returns 500 server error.
> Solr catches the error in ExtractingDocumentLoader and throws TikException.
> I use ManifoldCF.
> If the solr server responds 500, ManifoldCF judge is that "this
> document should be retried because I have absolutely no idea what
> happened".
> And it attempts to retry posting many times without getting the password.
> In the other case, my customer posts the files with embedded images.
> Sometimes it seems that solr throws TikaException of unknown cause.
> He wants to post just metadata without extracting text, but makes him stop posting by
the exception.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message