manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (JIRA)" <j...@apache.org>
Subject [jira] [Created] (CONNECTORS-1008) Tika Extractor doesn't seem to handle special characters correctly
Date Tue, 12 Aug 2014 14:23:13 GMT
Karl Wright created CONNECTORS-1008:
---------------------------------------

             Summary: Tika Extractor doesn't seem to handle special characters correctly
                 Key: CONNECTORS-1008
                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1008
             Project: ManifoldCF
          Issue Type: Bug
    Affects Versions: ManifoldCF 1.7
            Reporter: Karl Wright
            Assignee: Karl Wright
             Fix For: ManifoldCF 1.7


The Tika extractor, when extracting content from a PDF (specifically, the en_US end-user-documentation
pdf), does not handle anything other than Latin-1 characters properly.  For example, it does
not convert the copyright symbol, or any CJK characters, to utf-8.




--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message