lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Beard, Brian" <Brian.Be...@mybir.com>
Subject Tokenization / Analyzer question
Date Thu, 19 Aug 2010 18:01:44 GMT
I'm using lucene 2.9.1.

I'm indexing documents which correspond to an ID.
Each field in the ID document is made up of data from all subId's.
(It's a requirement that searches must work across all subId's within an
ID).

They will be indexed and stored in some format similar to:
subId0Value0 subId0Value1 DELIMITER subId1Value0 subId1Value1 DELIMITER
....
This is so the stored value strings can be matched up to the correct
subId for display purposes.
In addition, there may be more than one type of delimiter to denote the
type of data within a subId's data.

Moreover, for some of the fields it is desired to be able to tell which
subId the search terms corresponded to.

What I'd like to do is use payloads to encode the delimiter type into
the payload, but use the same text string for all delimiter types. Also,
the delimiter tokens shouldn't affect the search because they will be
filtered out of the analyzer. This is so that during a post-processing
step of the returned results, all payloads could be extracted at once by
a single term, instead of having to loop through them term by term
(because I don't see a way to get them on a per-document basis -
otherwise I could add a payload to the first term when the subId changes
during indexing). Then the search term positions could be checked
against the delimiter positions to determine the record and type of
data. 

I've got an example performing the tokenization and payload encoding
correctly, but the stored values are not the ones I want.

What I've done is write a BoundaryTokenFilter class which always is
instantiated *after* the standardAnalyzer, or WhitespaceAnalyzer, etc -
new BoundaryTokenFilter(new StandardAnalyzer().tokenStream()). 

In order to get the tokens passed down to BoundaryTokenFilter and pass
through the other tokenizers/TokenFilters higher up the chain, I'm using
character based delimiters (Different ones for each delimiter that also
have some additional encoded information for the payload). Then inside
of BoundaryTokenFilter, the character based delimiter is decoded. Then
the token is changed over to the single delimiter value and the payload
added.

This part works, but the problem I'm having is that I would like the
delimiter values stored in the index to be the same as the single
delimiter tokenized ones. 

Is there any way to do this - transform the stored values to be the
tokenized values, but only for select tokens?

The other alternative I can think of if that isn't possible, is to
modify standardAnalyzer to have indexing and search modes, so that
during indexing mode it would allow the delimiter tokens to pass through
and flag them using a typeAttribute, while they don't in search mode.
For this though I would most likely have to end up using different
delimiter values.

Any help is appreciated,

Brian Beard



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message