Karl,
This might work for you:
https://issues.apache.org/jira/browse/LUCENE-293
Regards,
Paul Elschot
On Friday 14 December 2007 18:06:01 Karl Wettin wrote:
> I have an index that contains three sorts of documents:
>
> Car brand
> Tire brand
> Tire pressure
>
> (Please bear with me, the real index has nothing to do with cars. I
> just try to explain the problem in an alternative domain to avoid NDA
> conflicts.)
>
> There is a heirarchial composite relationship between these sort of
> documents. A document describing "tire pressure" also contains "tire
> brand" and "car brand". A document describing "tire brand" also
> contains information about "car brand". A document describing "car
> brand" contains only that.
>
> The requirement is that the consumer of the API should not have to
> specify what fields they are searching in. There is no time (nor
> training data) to implement a hidden markov model (HMM) tokenizer or
> something along that path in order to extract possible attributes from
> the query string. Instead the query string is tokenized once per field
> and assebled to one huge query. Normally this works fairly well.
>
> Here are some example documents:
>
> Volvo
> Volvo, Michelin
> Volvo, Nokian
> Volvo, Nokian, 2.2 bars
> Volvo, Firestone, 2.4 bars
>
> Saab
> Saab, Michelin
> Saab, Nokian
> Saab, Nokian, 2.1 bars
> Saab, Firestone
> Saab, Firestone, 2.4 bars
> Saab, Firestone, 2.5 bars
>
> If I search for Saab the top result will be the document representing
> the car brand "Saab". The query would look like this: "car:saab
> tire:saab preasure:saab"
>
> But lets say Saab starts manufacturing tires too:
>
> Saab
> Saab, Saab tires
> Saab, Saab tires, 1.9 bars
> Saab, Saab tires, 1.8 bars
>
> If I search for "Saab" I still want the top result to be Saab the car
> brand. But it not longer is, the match for "Saab, Saab tires" now
> have a greater score than "Saab", of course.
>
> My idea is to work along the line of indexing "Saab" in the tire brand
> and tire pressure field too. Now searching for Saab will yeild a
> result where the car brand "Saab" is the top result.
>
> However, this will not work as I have different tokenization
> strategies for each field (stemming and what not). Tokenizing the
> query string Saab for the field "tire brand" in Swedish might end up
> as "saa" and will thus not find the token Saab inserted for the
> document describing the car brand Saab.
>
> I have a couple of experiments in my head I need to try out, starting
> with tokezining query strings per field and using the tokens generated
> for the field car brand as query in the tire brand and tire pressure
> too. And vice versus.
>
> Any brilliant ideas that might work? Hacky solutions are OK.
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
|