lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Norskog <goks...@gmail.com>
Subject Re: Can SOLR Index UTF-16 Text
Date Tue, 02 Oct 2012 23:07:21 GMT
If it is a simple text file, does that text file start with the UTF-16 "BOM" marker?
http://unicode.org/faq/utf_bom.html

Also, do UTF-8 files work? If not, then your setup has a basic encoding problem.
And, when you post such a text file (for example, with curl), use the UTF-16 charset mime-type:
I think it is "text/plain; charset=utf-16".


----- Original Message -----
| From: "Chris Hostetter" <hossman_lucene@fucit.org>
| To: solr-user@lucene.apache.org
| Sent: Friday, September 28, 2012 5:17:15 PM
| Subject: Re: Can SOLR Index UTF-16 Text
| 
| 
| : Our SOLR setup  (4.0.BETA on Tomcat 6) works as expected when
| indexing UTF-8
| : files. Recently, however, we noticed that it has issues with
| indexing
| : certain text files eg. UTF-16 files.  See attachment for an example
| : (tarred+zipped)
| :
| : tesla-utf16.txt
| : <http://lucene.472066.n3.nabble.com/file/n4010834/tesla-utf16.txt>
| 
| No attachment came through to the list, and the URL nabble seems to
| have
| provided when you posted your message leads to a 404.
| 
| IN general, the question of "is indexing a UTF-16 file supported"
| largely
| depneds on *how* you are indexing this file -- if it's plain text,
| are you
| parsing it yourself using some client code, and then sending it to
| solr,
| are you using DIH to read it from disk? are you using
| ExtractingRequestHandler?
| 
| those are all very differnet ways to index data in Solr -- and
| depending
| on what you are doing determins how/where the encoding of that file
| is
| processed.
| 
| 
| -Hoss
| 

Mime
View raw message