lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gregor Heinrich" <Gregor.Heinr...@igd.fhg.de>
Subject RE: Multiple fields identical terms.
Date Wed, 30 Jul 2003 14:55:43 GMT
Hi.

Thanks for your suggestion; I think the storage overhead is bearable.

Actually I am doing some sort of forward indexing in addition to the
inverted index. I.e., the result will be a meta-search engine that combines
the Lucene IR process proper with an aspect model similar to Latent Semantic
Analysis. To store the forward index, it's necessary to create a
term-document matrix where the terms should all be unique regardsless of the
field. This kind of vector space indexing could as well be useful for other
purposes such as document classification.

One idea is to run an additional Hashtable that checks for uniqueness and
attaches additional information to a term, such as its phonetic encoding or
its catalogization key. But I wanted to use as much of the existing
infrastructure and stay compatible.

I also thought of changing the way how fields and terms are allocated to
each other, i.e., allowing a list of fields in each Term object and thus
make term texts unique. But this would cause a substantial re-design of the
index file and access structure...

Gregor



-----Original Message-----
From: Erik Hatcher [mailto:lists@ehatchersolutions.com]
Sent: Wednesday, July 30, 2003 2:40 PM
To: Lucene Users List
Subject: Re: Multiple fields identical terms.


On Wednesday, July 30, 2003, at 06:16  AM, Gregor Heinrich wrote:
> I would like to have unique term texts in my term enumeration. That is,
> across all fields there should be no duplicate term text.
>
> An easy solution would be to only use one field.
>
> But does someone know an alternative way with multiple fields?

What about putting both abstract and title together into a single new
field called "keywords"?  Leave title and abstract there as well, but
just append the two strings together (with a space in the middle to
tokenize properly! :).

Is that a reasonable alternative?  What are you trying to accomplish?

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message