--- Dmitry Serebrennikov wrote: > > Well, I think unindexed field can't really be used > for tvs because there > are not term ids for them. > I mean we could invent them but then there is > no facility to compare them and see that two > documents use the same > term. > And if there was, well that's just like a > keyword field. So I > think tvs only apply to indexed fields, period. That's something new. Unindexed fields such as keyword fields won't have term ids? I hope you can clarify further... > Right. That's how it works. Well... Actually, right > now the files will > be created no matter what for new segments. However, > old segments that > do not have these files work also. I agree with your > point in general > though. Hmm.. will there be a way we can convert/add vectorization to the old segments? The users may want some kind of migration path to the new format other than reindexing the entire index. > Ok, so you are voting for vectorization flag per > index. And if set, it > applies to all indexed fields (tokenized and > keyword). This could work. > Right now, I have it on a per-field basis (trying to > change your mind > after a field is first used in any document causes > an exception, just as > it currently does with the isIndexed flag). Like you > said, this is only > the initial version. Let's see what other ideas > happen. OK, lets wait and see :) > Well, my app needs them to be exact. It never > occured to me that term > ids could be > non-unique and still be useful. Lucky that I'm the > one building it! :) I agree with you. Why settle for lossy when we can have high fidelity! I guess we'ld stick with accuracy and leave it the developers who uses tvs to loosen accuracy if they really need to. Lucky that this is an open source development effort! :) The only caveat is that I would prefer that the unique term id generation be computationally fast and low on storage requirements. (yes, I know this part only affects indexing.. I'm just trying to stick to lucene's goal of fast searching and fast indexing) > Interesting! Live and learn! :) > Meaning that this goes beyound what Lucene does with > stemming? For > example, two absolutely unrelated words (like "cat" > and "semaphore") > might get mapped to the same id? I suppose > statistically this might > still work out to a pretty good clustering. This is getting off topic (my fault), but sadly that's more or less the current state of human language technologies especially in information retrieval. __________________________________________________ Do You Yahoo!? Make a great connection at Yahoo! Personals. http://personals.yahoo.com