lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <>
Subject Re: TokenFilter question
Date Wed, 19 Mar 2008 03:55:04 GMT

: I was trying to apply both
: org.apache.solr.analysis.WordDelimiterFilter and 
: org.apache.lucene.analysis.ngram.NGramTokenFilter.
: Can I achive this with lucene's TokenStream?

Sure ... you just have to pick an ordering and wrap one arround the other.  
Solr does this anytime you define an <analyzer> using a <tokenizer> and a 
list of <filter>s

: While thinking about TokenFilters, I came to an idea that 
: the TokenStream should have a structured representation. 

I've thought about that once or twice over the years as well... it would 
make things like multiword synonyms a lot easier to deal with if instead 
of a TokenStream we could have a directed TokenGraph with a single start 
and a single end (ie: only one node with no incoming links and only one 
node with no outgoing links).

But even if you had a graph based api for Analyzers to express the set of 
tokens found, what would the end product look like?  what would the 
format be of an index that stored Term position information as graph 
connections (esentially 3 dimensional info) instead of simple 
numeric position (1 dimensional) ?  could it be searched as quickly?

Most of the time, things that I think would be easier with a TokenGraph 
are still feasible using judicious use of positionIncrement, slop, and 
artifical "marker tokens" ... with Payloads even more complex things 
should move into the realm of "practical" (but it's likely I'm putting 
Payloads on too much of a pedestal ... I've never actually tried using 
them for anything)


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message