lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From vybe3142 <>
Subject Re: Can SOLR Index UTF-16 Text
Date Wed, 03 Oct 2012 16:29:37 GMT
Thanks for all the responses. Problem partially solved (see below)

1. In a sense, my question is theoretical since the input to out SOLR server
is (currently) UTF-8 files produced by a third party text extraction utility
(not Tika). On the server side, we read and index the text via a custom data
handler. Last week, I tried a UTF-16 file to see what would happen, and it
wasn't handled correctly, as explained in my original question.

2. The file is UTF 16

3. We can either (a)stream the data to SOLR in the call or (b)use the
stream.file parameter to provide the file path to the SOLR handler.

Assuming case (a)

Here's how the SOLRJ request is constructed (code edited for conciseness)

If I replace the last line with

things work !!!!

What would I need to do in case (b), . wherer the raw file is loaded
remotely  i.e. my handler reads the file directly

In this case, how can I control what the content type is ?


View this message in context:
Sent from the Solr - User mailing list archive at

View raw message