manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Subasini Rath (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-1563) SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes
Date Sat, 12 Jan 2019 04:43:00 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16740992#comment-16740992
] 

Subasini Rath commented on CONNECTORS-1563:
-------------------------------------------

Thanks Karl.  Just need to get clear one more doubt.  I need to pass from manifold one custom
field and value which I want to see in Solr index.  That is the reason why I used metadata
transformer where I can pass the custom field in job - tab metadata adjuster.
If I will use only tika extractor,  is there any way to pass custom field which we will get
indexed in Solr.

On 11-Jan-2019 11:17 PM, "Karl Wright (JIRA)" <jira@apache.org> wrote:

    [ https://issues.apache.org/jira/browse/CONNECTORS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16740587#comment-16740587
]

Karl Wright commented on CONNECTORS-1563:
-----------------------------------------

The metadata extractor can go anywhere in your pipeline, after Tika extraction.  There is
absolutely no point in having *two* Tika extractions though -- and that's what you're trying
to do with the setup you've got.

What I'd recommend is that you use only the ManifoldCF-side Tika extractor, and inject content
into Solr using the /update handler, not the /update/extract handler.  There's also a checkbox
you'd need to uncheck in the Solr connection configuration. It's all covered in the ManifoldCF
end user documentation.






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


> SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have
> 0 bytes
> -----------------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-1563
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1563
>             Project: ManifoldCF
>          Issue Type: Task
>          Components: Lucene/SOLR connector
>            Reporter: Sneha
>            Assignee: Karl Wright
>            Priority: Major
>         Attachments: managed-schema, solrconfig.xml
>
>
> I am encountering this problem:
> I have checked "Use the Extract Update Handler:" param then I am getting an error on
Solr i.e. null:org.apache.solr.common.SolrException: org.apache.tika.exception.ZeroByteFileException:
InputStream must have > 0 bytes
> If I ignore tika exception, my documents get indexed but dont have content field on Solr.
> I am using Solr 7.3.1 and manifoldCF 2.8.1
> I am using solr cell and hence not configured external tika extractor in manifoldCF pipeline
> Please help me with this problem
> Thanks in advance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message