manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Subasini Rath (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-1563) SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes
Date Tue, 19 Feb 2019 06:29:00 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16771607#comment-16771607
] 

Subasini Rath commented on CONNECTORS-1563:
-------------------------------------------

Hi Karl,
    Could you please guide me - to which field manifold writes the actual textual content
of the document.

Currently I am using the _text_ field but it has been found that _text_ does not contain the
actual data. Rather it added some extra values to the actual content.

In my managed-schema : 

<field name="_text_" type="text_general" multiValued="true" indexed="true" stored="true"/>

After my indexing in Solr, the value looks like : (The first 4 lines are appended before the
content of file)

"title":["NETWORK PLANNING\u0000"],
        "_text_":[" \n \n stream_size 34070  \n X-Parsed-By org.apache.tika.parser.DefaultParser
 \n X-Parsed-By org.apache.tika.parser.txt.TXTParser  \n stream_content_type application/pdf
 \n stream_name cs.exe?bmsdocid=9.2.1&func=eebms.docdownload  \n stream_source_info cs.exe?bmsdocid=9.2.1&func=eebms.docdownload
 \n Content-Encoding UTF-8  \n resourceName cs.exe?bmsdocid=9.2.1&func=eebms.docdownload
 \n Content-Type text/plain; charset=UTF-8  \n  \n \n  9.2.1 UNCONTROLLED IF PRINTED Page
1 of 13\nCompany Policy\nNETWORK\nDocument No Amendment No Approved By Approval Date Review
Date\n: : : : :\n9.2.1 9 CEO 23/05/2016 23/05/2019\n9.2.1 NETWORK PLANNING\n1.0 POLICY STATEMENT\nThe
company will plan the expansion and augmentation of its electrical network to achieve levels
of safety, reliability and quality of supply commensurate with community, regulator, customer
and shareholder expectations.\nThe company will coordinate its planning with the NSW transmission
utility Transgrid and neighbouring distribution utilities to develop effective solutions to
satisfy load growth within the company’s supply area and in adjacent franchise areas where
the company’s network has influence.\n2.0 PURPOSE\nTo provide principles for planning network



Thanks & Regards,
Subasini Rath
O: +91-33 6636-8889 
M: +91 983-1234-341
Email: Subasini.Rath@endeavourenergy.com.au



> SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have
> 0 bytes
> -----------------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-1563
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1563
>             Project: ManifoldCF
>          Issue Type: Task
>          Components: Lucene/SOLR connector
>            Reporter: Sneha
>            Assignee: Karl Wright
>            Priority: Major
>         Attachments: Document simple history.docx, managed-schema, manifold settings.docx,
manifoldcf.log, solr.log, solrconfig.xml
>
>
> I am encountering this problem:
> I have checked "Use the Extract Update Handler:" param then I am getting an error on
Solr i.e. null:org.apache.solr.common.SolrException: org.apache.tika.exception.ZeroByteFileException:
InputStream must have > 0 bytes
> If I ignore tika exception, my documents get indexed but dont have content field on Solr.
> I am using Solr 7.3.1 and manifoldCF 2.8.1
> I am using solr cell and hence not configured external tika extractor in manifoldCF pipeline
> Please help me with this problem
> Thanks in advance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message