lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From KK <dioxide.softw...@gmail.com>
Subject How to post date encoded in NCR(decimal) to lucene indexer?
Date Mon, 01 Jun 2009 08:48:53 GMT
Hi All,
I'm trying to index data to lucene index in unicode utf-8 format. All my
search queries are of the form \uxxxx and its working fine. But the problem
is in some cases, when the document[actually a webpage content] contains
Numeric Character Reference[decimal], these are getting indexed as such. For
example I've the following data[some telugu language data],

&#3105;&#3134;&#3093;&#3149;&#3103;&#3120;&#3149;

When I index this they get indexed as such and querying using \uxxxx
format doesnot give any result. so I want to know is there any way
where we can configure lucene to take
care of such things by itself, or I've to convert the same to \uxxxx
format[this is just replace &# with \u and replace the 4-dig number
with its hex equivalent]. This manual

method doesnot sound good to me. If there is any standard way to doing
the same, please someone let ke know. Thank you.

--KK.

One question?
Is it mandatory that the data to be indexed by lucene has to in \uxxxx
format for unicode utf-8 encoded data?

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message