lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <>
Subject Re: which unicode version is supported with lucene
Date Fri, 25 Feb 2011 14:31:11 GMT
On Fri, Feb 25, 2011 at 9:16 AM, Yonik Seeley
<> wrote:
> On Fri, Feb 25, 2011 at 9:09 AM, Bernd Fehling
> <> wrote:
> > Hi Yonik,
> >
> > good point, yes we are using Jetty.
> > Do you know if Tomcat has this limitation?
> Tomcat's defaults are worse - you need to configure it to use UTF-8 by
> default for URLs.
> Once you do, it passes all those tests (last I checked).  Those tests
> are really about UTF-8 working in GET/POST query arguments.  Solr may
> still be able to handle indexing and returning full UTF-8, but you
> wouldn't be able to query for it w/o using surrogates if you're using
> Jetty.
> It would be good to test though - does anyone know how to add a char
> above the BMP to utf8-example.xml?

I tried the following, then tried to search on this character (U+29B05
/ UTF8:[f0 a9 ac 85]) with jetty and got no results.
I also went to the analysis.jsp as a quick test, and noted that jetty
treats it as if it were U+9B05 / UTF8: [e9 ac 85].

Then i searched on 'range' via the admin gui to retrieve this
document, and chrome blew up with "This page contains the following
errors: error on line 17 at column 306: Encoding error"

Didn't try tomcat.

Index: utf8-example.xml
--- utf8-example.xml (revision 1074125)
+++ utf8-example.xml (working copy)
@@ -34,6 +34,7 @@
     <field name="features">eaiou with umlauts: ëäïöü</field>
     <field name="features">tag with escaped chars: &lt;nicetag/&gt;</field>
     <field name="features">escaped ampersand: Bonnie &amp; Clyde</field>
+    <field name="features">full unicode range (supplementary char): 𩬅</field>
     <field name="price">0</field>
     <!-- no popularity, get the default from schema.xml -->
     <field name="inStock">true</field>

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message