lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Taylor <paul_t...@fastmail.fm>
Subject Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)
Date Wed, 01 Oct 2014 08:04:32 GMT
On 01/10/2014 08:08, Dawid Weiss wrote:
> Hi Steve,
>
> I have to admit I also find it frequently useful to include
> punctuation as tokens (even if it's filtered out by subsequent token
> filters for indexing, it's a useful to-have for other NLP tasks). Do
> you think it'd be possible (read: relatively easy) to create an
> analyzer (or a modification of the standard one's lexer) so that
> punctuation is returned as a separate token type?
>
> Dawid
>
>
> On Wed, Oct 1, 2014 at 7:01 AM, Steve Rowe <sarowe@gmail.com> wrote:
>> Hi Paul,
>>
>> StandardTokenizer implements the Word Boundaries rules in the Unicode Text Segmentation
Standard Annex UAX#29 - here’s the relevant section for Unicode 6.1.0, which is the version
supported by Lucene 4.1.0: <http://www.unicode.org/reports/tr29/tr29-19.html#Word_Boundaries>.
>>
>> Only those sequences between boundaries that contain letters and/or digits are returned
as tokens; all other sequences between boundaries are skipped over and not returned as tokens.
>>
>> Steve
Yep, I need punctuation in fact the only thing I usually want removed is 
whitespace yet I would to take advantage of the fact that the new 
tokenizer can recognise some word boundaries that are not based on 
whitespace  in the case of some non western languages). I have modified 
the tokenizer before but found it very diificult to understand it, is it 
possible/advisable to contstruct a tokenizer just based on pure java 
code rather than derived from a jflex definition ?

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message