manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shinichiro Abe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-984) Give Tika's metadata some hints
Date Wed, 23 Jul 2014 04:23:39 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071349#comment-14071349
] 

Shinichiro Abe commented on CONNECTORS-984:
-------------------------------------------

Sorry, my writing code is too late for the next release, so I'd like to ask you to do that.
Otherwise I want to postpone supporting this issue.

quick-fix idea for addOrReplaceDocumentWithException():
{code}

Metadata metadata = new Metadata();
+ metadata.add(TikaMetadataKeys.RESOURCE_NAME_KEY, document.getFileName())
+ metadata.add(HttpHeaders.CONTENT_TYPE, document.getMimeType());
+ metadata.add("stream_name", document.getFileName());
+ metadata.add("stream_size", ds.getBinaryLength());
 :
 :
 :
            try
            {
              parser.parse(document.getBinaryStream(), handler, metadata, pc);
            }
            catch (TikaException e)
            {
+             if(ignoreTikaException) { //ignoreTikaException is needed configurable somewhere.
+              // If true, I'd like not to skip next process. i.e. I don't think DOCUMENTSTATUS_REJECTED
is return.
+              // Plese see: http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/extraction/src/java/org/apache/solr/handler/extraction/ExtractingDocumentLoader.java?view=markup#l219
+             else {
              resultCode = "TIKAEXCEPTION";
              description = e.getMessage();
              return handleTikaException(e);
+             }
            }

{code}

Thanks.

> Give Tika's metadata some hints
> -------------------------------
>
>                 Key: CONNECTORS-984
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-984
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Amazon CloudSearch output connector
>    Affects Versions: ManifoldCF 1.7
>            Reporter: Shinichiro Abe
>            Assignee: Shinichiro Abe
>             Fix For: ManifoldCF 1.7
>
>
> Component: Tika connector
> Currently in trunk code, we don't set data in Tika's metadata object.
> We likely have to give metadata some hints to detect and extract from document.
> * resourceName
> * ContentType
> * stream size
> * charset(new feature)
> * Password handling(new feature)
> Also, when TikaException(e.g. parsing error at PDFBox/POI) is thrown, we need to decide
to ignore or not about the parsing document. Solr Cell has 'ignoreTikaException' param. When
TikaException is thrown, if true, metadata only is indexed, if false, Solr responds server
error and the document is not indexed.
> Reference-->Solr Cell:
> http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/extraction/src/java/org/apache/solr/handler/extraction/ExtractingDocumentLoader.java?view=markup#l142



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message