lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brandon Mintern <mint...@easyesi.com>
Subject Re: Problems Indexing/Parsing Tibetan Text
Date Fri, 30 Mar 2012 18:11:48 GMT
Another good reference is this one: http://unicode.org/reports/tr29/

Since the latest Lucene uses this for the basis of its text
segmentation, it's worth getting familiar with it.

On Fri, Mar 30, 2012 at 10:09 AM, Robert Muir <rcmuir@gmail.com> wrote:
> On Fri, Mar 30, 2012 at 1:03 PM, Denis Brodeur <denisbrodeur@gmail.com> wrote:
>> Thanks Robert.  That makes sense.  Do you have a link handy where I can
>> find this information? i.e. word boundary/punctuation for any unicode
>> character set?
>>
>
> yeah, usually i use
> http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[\u0f10-\u0f19]&g=
>
> you can then click on a character and see all of its properties easily.
>
> (site seems to have some issues today)
>
> --
> lucidimagination.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message