lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "K. M. McCormick" <kyliemccorm...@gmail.com>
Subject Tokenizer, TokenStream, Token Filters
Date Mon, 10 Aug 2009 22:52:14 GMT
Hello Again:

I'm trying to figure out what Filters do to terms in Lucene, specifically

StandardTokenizer
StandardFilter

While these are usually 'enough' for my work, I need to know specifically
what happens to the tokens in this, how they are split, etc. in order to
make sure my indexes match my queries, which are being parsed/modified very
specifically. I was tempted to make my own filter (like MyCrazyFilter) but I
hesitate to throw away the 'standards' for no reason.

Also, I have had a hard time finding information about writing your own
Tokenizers and Token Filters, other than the fact that you can do this. Most
of the work I want to do is fairly simple stuff, but I can't find much
information on how Lucene does it.

I specifically know I want to ensure the following:
- tokens are broken at whitespace only, not at any other kinds of marks
- tokens have no accents (I use a normalizer for this)
- tokens do not only consist of punctuation (I use a simple function for
this)
- tokens do not have 'oddball' circumstances (such as the end of a sentence
retaining that punctuation... I  truncate this).

Thanks,
drago

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message