lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From karl wettin <>
Subject Re: Using Lucene for searching tokens, not storing them.
Date Sun, 16 Apr 2006 17:18:59 GMT

15 apr 2006 kl. 21.32 skrev Paul Elschot:
>> implements TermPositions {
>>          public int nextPosition() throws IOException {
> This enumerates all positions of the Term in the document
> as returned by the Tokenizer used by the Analyzer

Aha. And I didn't see the TermPositionVector until now.

This leads me to a new question. How is multiple fields with the same  
name treated? Are the positions concated or in a "z-axis"? I see  
SpanQuery-troubles with both.

Concated renders SpanFirst unusable on fields n > 0
	[hello,0] [world,1] [foo,2] [bar,3]

"Z-axis" mess up SpanNear, as "hello bar" is correct.
	[hello,0] [world,1]
	[foo,0] [bar,1]

Hmm.. (with double semantics, as this would mean I can't use the term  
positions to train my hidden markov models).

Thanks for explaining!

For any interested party, I do this because I have a fairly small  
corpus with very heavy load. I think there is a lot to win by not  
creating new instances of what not, seeking in the file-centric  
Directory, parsing pseudo-UTF8, et.c. at query time. I simply store  
all instance of everything (the index in a bunch of Lists and Maps.  
Bits are cheaper than ticks. 

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message