lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Burton-West, Tom" <tburt...@umich.edu>
Subject RE: Shingle filter that reads the script attribute from ICUTokenizer and LUCENE-2906
Date Sat, 17 Dec 2011 15:13:25 GMT
Thanks Robert,



>>Another idea apart from your solution would be to add a tailoring for
>>tibetan that sets some special attribute indicating 'word-final
>>syllable'. Then this information is not 'lost' and downstream can do
>>the right thing.

>>...So essentially before doing anything like that, it would be
>>best to know 'the rules of the game' before thinking about any design.

So the ICUTokenizer would have to add that word-final syllable attribute based on some rules
and then a downstream filter could use the attributes to constuct bigrams without creating
"stupid" bigrams.

If we end up doing the project, we will be working with people who have expertise in Tibetan
and hopefully would be able to tell us the "rules of the game"  

Tom

_______________________________________


Another idea apart from your solution would be to add a tailoring for
tibetan that sets some special attribute indicating 'word-final
syllable'. Then this information is not 'lost' and downstream can do
the right thing.
Its not a difficult thing to do for the tokenizer, but we would need
more details: a quick glance at some stuff on tibetan punctuation
indicates its not 'this simple': for some syllables sometimes the
punctuation is omitted. Honestly i don't know why this is, maybe it
means there are some syllables that only appear in word-final
position? If so, such important clues should also trigger this
attribute. So essentially before doing anything like that, it would be
best to know 'the rules of the game' before thinking about any design.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message