lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <>
Subject Re: Type information on Tokens?
Date Thu, 24 Apr 2003 17:37:46 GMT
Here's another vote for keeping the type information in the index.  I 
suspect where this would break down is if a field has two tokens that 
are textually the same, but are considered different types - so maybe 
its not a good idea technically.  I'd love to hear more about the 
pros/cons to keeping this information, as it seems like something quite 
useful during searching.


On Thursday, April 24, 2003, at 01:09  PM, Armbrust, Daniel C. wrote:
> If I wanted to build an index where all of the words were tagged with 
> part of speech information, its seems that the type field of the Token 
> would be the place to put this.
> But, as I understand it, lucene does not keep track of the type fields 
> that are assigned during tokenizing, and therefore doesn't use them 
> while searching.
> How could I go about keeping track of part of speech information in my 
> index?
> So far, I can only think of two ways to accomplish this, 1, is to 
> build it into my tokens, i.e. my tokens would look something like 
> "<noun>patient".  I'm afraid there may be some pit-falls with this 
> approach that I haven't identified yet, however, since I haven't tried 
> it out.
> Or, I could make lucene use the type field in its index.  But, would I 
> be correct in assuming this would not be a trivial change?  I have 
> looked over the source a bit, but I don't yet have a full grasp of how 
> hits are found and scored.
> Thanks,
> Dan
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message