manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-1563) SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes
Date Tue, 19 Feb 2019 07:50:00 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16771663#comment-16771663
] 

Karl Wright commented on CONNECTORS-1563:
-----------------------------------------

Hi Subasini,

Are you now Tika-extracting in ManifoldCF, or in Solr?
The text field looks like it contains properly extracted content, along with other stuff you
do not want.  Is this correct?

If the extraction is happening in Solr, then I have no idea what this is coming from.  If
the extraction is happening in ManifoldCF, then if you have placed a Metadata Adjuster transformer
in the pipeline between the Tika Extractor and the Solr Output Connector, I'd say you had
set it up to concatenate many fields together into a text field.  The Metadata Adjuster has
that ability.

The choice of how metadata (or content) fields get mapped to Solr schema is set up in your
Solr output connection configuration.  The Tika extraction basically replaces a binary input
document with a character-sequence output document plus metadata fields.  The character-sequence
output document then must be sent to Solr not using the exracting update handler, but just
the standard handler, so the handler should be changed from /update/extract to just /update,
and the "Use extracting update handler" should be turned off.  The actual field name used
for the extracted content body can also be changed, if desired, in the "Schema" part of the
configuration.  But what is there by default works with Solr as it's set up by default.





> SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have
> 0 bytes
> -----------------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-1563
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1563
>             Project: ManifoldCF
>          Issue Type: Task
>          Components: Lucene/SOLR connector
>            Reporter: Sneha
>            Assignee: Karl Wright
>            Priority: Major
>         Attachments: Document simple history.docx, managed-schema, manifold settings.docx,
manifoldcf.log, solr.log, solrconfig.xml
>
>
> I am encountering this problem:
> I have checked "Use the Extract Update Handler:" param then I am getting an error on
Solr i.e. null:org.apache.solr.common.SolrException: org.apache.tika.exception.ZeroByteFileException:
InputStream must have > 0 bytes
> If I ignore tika exception, my documents get indexed but dont have content field on Solr.
> I am using Solr 7.3.1 and manifoldCF 2.8.1
> I am using solr cell and hence not configured external tika extractor in manifoldCF pipeline
> Please help me with this problem
> Thanks in advance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message