lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Kor <>
Subject Re: TermVector retrieval implementation questions
Date Mon, 15 Oct 2001 10:14:14 GMT

--- Dmitry Serebrennikov <>
> Well, I think unindexed field can't really be used
> for tvs because there 
> are not term ids for them. 
> I mean we could invent them but then there is 
> no facility to compare them and see that two
> documents use the same 
> term. 
> And if there was, well that's just like a
> keyword field. So I 
> think tvs only apply to indexed fields, period.

That's something new. Unindexed fields such as keyword
fields won't have term ids? I hope you can clarify

> Right. That's how it works. Well... Actually, right
> now the files will 
> be created no matter what for new segments. However,
> old segments that 
> do not have these files work also. I agree with your
> point in general 
> though.

Hmm.. will there be a way we can convert/add
vectorization to the old segments? The users may want
some kind of migration path to the new format other
than reindexing the entire index. 

> Ok, so you are voting for vectorization flag per
> index. And if set, it 
> applies to all indexed fields (tokenized and
> keyword). This could work. 
> Right now, I have it on a per-field basis (trying to
> change your mind 
> after a field is first used in any document causes
> an exception, just as 
> it currently does with the isIndexed flag). Like you
> said, this is only 
> the initial version. Let's see what other ideas
> happen.

OK, lets wait and see :)

> Well, my app needs them to be exact. It never
> occured to me that term 
> ids could be
> non-unique and still be useful. Lucky that I'm the
> one building it! :)

I agree with you. Why settle for lossy when we can
have high fidelity! I guess we'ld stick with accuracy
and leave it the developers who uses tvs to loosen
accuracy if they really need to. Lucky that this is an
open source development effort! :)

The only caveat is that I would prefer that the unique
term id generation be computationally fast and low on
storage requirements. (yes, I know this part only
affects indexing.. I'm just trying to stick to
lucene's goal of fast searching and fast indexing) 

> Interesting! Live and learn! :)
> Meaning that this goes beyound what Lucene does with
> stemming? For 
> example, two absolutely unrelated words (like "cat"
> and "semaphore") 
> might get mapped to the same id? I suppose
> statistically this might 
> still work out to a pretty good clustering.

This is getting off topic (my fault), but sadly that's
more or less the current state of human language
technologies especially in information retrieval. 

Do You Yahoo!?
Make a great connection at Yahoo! Personals.

View raw message