lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Kor <dave...@yahoo.com>
Subject Re: TermVector retrieval implementation questions
Date Mon, 15 Oct 2001 10:14:14 GMT

--- Dmitry Serebrennikov <dmitrys@earthlink.net>
wrote:
> 
> Well, I think unindexed field can't really be used
> for tvs because there 
> are not term ids for them. 
> I mean we could invent them but then there is 
> no facility to compare them and see that two
> documents use the same 
> term. 
> And if there was, well that's just like a
> keyword field. So I 
> think tvs only apply to indexed fields, period.

That's something new. Unindexed fields such as keyword
fields won't have term ids? I hope you can clarify
further...

> Right. That's how it works. Well... Actually, right
> now the files will 
> be created no matter what for new segments. However,
> old segments that 
> do not have these files work also. I agree with your
> point in general 
> though.

Hmm.. will there be a way we can convert/add
vectorization to the old segments? The users may want
some kind of migration path to the new format other
than reindexing the entire index. 

> Ok, so you are voting for vectorization flag per
> index. And if set, it 
> applies to all indexed fields (tokenized and
> keyword). This could work. 
> Right now, I have it on a per-field basis (trying to
> change your mind 
> after a field is first used in any document causes
> an exception, just as 
> it currently does with the isIndexed flag). Like you
> said, this is only 
> the initial version. Let's see what other ideas
> happen.

OK, lets wait and see :)


> Well, my app needs them to be exact. It never
> occured to me that term 
> ids could be
> non-unique and still be useful. Lucky that I'm the
> one building it! :)

I agree with you. Why settle for lossy when we can
have high fidelity! I guess we'ld stick with accuracy
and leave it the developers who uses tvs to loosen
accuracy if they really need to. Lucky that this is an
open source development effort! :)

The only caveat is that I would prefer that the unique
term id generation be computationally fast and low on
storage requirements. (yes, I know this part only
affects indexing.. I'm just trying to stick to
lucene's goal of fast searching and fast indexing) 


> Interesting! Live and learn! :)
> Meaning that this goes beyound what Lucene does with
> stemming? For 
> example, two absolutely unrelated words (like "cat"
> and "semaphore") 
> might get mapped to the same id? I suppose
> statistically this might 
> still work out to a pretty good clustering.

This is getting off topic (my fault), but sadly that's
more or less the current state of human language
technologies especially in information retrieval. 




__________________________________________________
Do You Yahoo!?
Make a great connection at Yahoo! Personals.
http://personals.yahoo.com

Mime
View raw message