manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-1008) Solr connector via /update handler doesn't seem to handle special characters correctly
Date Tue, 12 Aug 2014 15:23:12 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-1008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14094158#comment-14094158
] 

Karl Wright commented on CONNECTORS-1008:
-----------------------------------------

The problem is definitely in SolrJ.  The content is added as follows:

{code}
        File f = new File("outputlog.log");
        OutputStream os = new FileOutputStream(f,true);
        Writer w = new OutputStreamWriter(os,"utf-8");
        w.write(sb.toString());
        w.flush();
        os.close();
        outputDoc.addField( contentAttributeName, sb.toString() );
{code}
... where outputDoc is a SolrInputDocument.

The outputlog.log file is written with perfect utf-8, but by the time the SolrInputDocument
is handled by SolrJ and unpacked on the Solr side, it's corrupted.  Looks like another Solr
ticket should be created.

The workaround is to use the extracting update handler even with Tika output.  In this case
everything should work properly.

> Solr connector via /update handler doesn't seem to handle special characters correctly
> --------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-1008
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1008
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Lucene/SOLR connector
>    Affects Versions: ManifoldCF 1.7
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>            Priority: Critical
>             Fix For: ManifoldCF 1.7
>
>
> The Solr Connector, when receiving documents that have special characters, does not manage
to send these appropriately to Solr for indexing.  I have verified that the upstream Tika
extractor does its job perfectly, so the problem is either in the Solr Connector or in SolrJ.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message