lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From KK <>
Subject How to post date encoded in NCR(decimal) to lucene indexer?
Date Mon, 01 Jun 2009 08:48:53 GMT
Hi All,
I'm trying to index data to lucene index in unicode utf-8 format. All my
search queries are of the form \uxxxx and its working fine. But the problem
is in some cases, when the document[actually a webpage content] contains
Numeric Character Reference[decimal], these are getting indexed as such. For
example I've the following data[some telugu language data],


When I index this they get indexed as such and querying using \uxxxx
format doesnot give any result. so I want to know is there any way
where we can configure lucene to take
care of such things by itself, or I've to convert the same to \uxxxx
format[this is just replace &# with \u and replace the 4-dig number
with its hex equivalent]. This manual

method doesnot sound good to me. If there is any standard way to doing
the same, please someone let ke know. Thank you.


One question?
Is it mandatory that the data to be indexed by lucene has to in \uxxxx
format for unicode utf-8 encoded data?

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message