lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject Re: stemmed search and exact match on "same" field
Date Mon, 14 Aug 2006 19:08:33 GMT

Therre's a lot of information in your email, and a lot of questions that
relate to similar topics and address different ways of acomplishing
similar but different things ... too much for me to digest
all at once, so lemme start by seeing if i can summarize your goal, and
then give you my suggestion based on the goal as i see it...

You want simple term matches to be "stemmed" but you want phrase ueries to
be "unstemmed"

so if i user queries for the word...
	jumped
...you want that to match any of the words: jump, jumps, jumped, etc...

if a user queries for...
	"the dogs"
...you want that to only match the exact phrase and not something with the
tokens "the dog"

you want these ideas to work, even if phrases and terms are mixed in
the users query...
	foo:jumped bar:"the dogs"

My first though is that you kepe using two versions of hte field (one
stemmed and one unstemmed) and you then subclass QueryParser and override
the getFieldQuery(String field, String queryText) method ... if the second
arg looks like a phrase to you (ie: it has spaces or what not) them return
super.getField(field, queryText).  If it's not a phrase, then call
super.getField(field + "_STEMMED", queryText).

where this breaks down is if you want the non-stemmed behavior even if hte
users "phrase" only contains one word, ie...
	foo:jumped bar:"dogs"
...because the information that "dogs" was in quotes is lost by the time
getFieldQuery is called.  You'd have to write a lot more QueryParsing code
to get that behavior.


In general, for your goal, i would not attempt to put both teh stemmed and
unstemmed tokens in the same field -- because as i think you mentioned,
there is not way to tell them apart at query time.



: Date: Mon, 14 Aug 2006 10:53:36 -0400 (EDT)
: From: Robert Watkins <rwatkins@foo-bar.org>
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: stemmed search and exact match on "same" field
:
: I've been puzzling this one for a while now, and can't figure it out.
: The idea is to allow stemmed searches and exact matches (tokenized, but
: unstemmed phrase searches) on the same field. The subject of this email
: had "same" in quotes, because it's from the search-client perspective
: that the same field is being searched, whereas the implementation may be
: different.
:
: I have actually implemented a solution whereby content that will require
: searching with both stemmed and unstemmed queries is put into two
: separate fields, one named (e.g.) "field" and the other
: "UNSTEMMED_field". What this requires, however, is a custom query parser
: that can pick out the phrase portions of a query (arbitrarily complex)
: and shunt them to the "UNSTEMMED_" version of the required field (with
: checks that they exist, etc.), the rest of the query being applied to
: the stemmed version of the field.
:
: What I would like, however, is to be able to allow the search client to
: use QueryParser, but I can't see how that's possible, given that a mixed
: query (including term and phrase portions) can be passed to the parser
: and only one Analyzer can be applied.
:
: Assuming that the search client, in building a query, can pull out the
: phrase portions of a more complex query, and apply a different (i.e.
: non-
: stemming analyzer) to those portions, the question of the field would
: remain: unless I use the separate field method outlined above, the
: field in the index is going to have stemmed tokens, unstemmed tokens
: or both. The latter I tried as an experiment, which seemed interesting,
: but turned out to be a brick wall. There may be a way through the wall,
: but I can't see it.
:
: Using the very useful "The quick brown fox jumped over the lazy dogs.",
: I created an index of one document and indexed the content in a single
: field.  The text is passed through the StandardTokenizer, StandardFilter
: and LowerCaseFilter.  Then, a custom filter creates a stemmed version of
: each Token (using SpellFilter and PorterStemmereFilter), and adds that
: at
: the same position as the unstemmed Token, with a token type of STEMMED;
: I also played with the start and end offsets. The result is:
:
: 1: [the:0->3:<ALPHANUM>] [the:0->0:STEMMED]
: 2: [quick:4->9:<ALPHANUM>] [quick:0->0:STEMMED]
: 3: [brown:10->15:<ALPHANUM>] [brown:0->0:STEMMED]
: 4: [fox:16->19:<ALPHANUM>] [fox:0->0:STEMMED]
: 5: [jumps:20->25:<ALPHANUM>] [jump:0->0:STEMMED]
: 6: [over:26->30:<ALPHANUM>] [over:0->0:STEMMED]
: 7: [the:31->34:<ALPHANUM>] [the:0->0:STEMMED]
: 8: [lazy:35->39:<ALPHANUM>] [lazi:0->0:STEMMED]
: 9: [dogs:40->44:<ALPHANUM>] [dog:0->0:STEMMED]
:
: But as I poke around it appears that there's no way for me to use this
: information from the index when searching (or using something like
: a HitCollector) to either restrict a search only to the tokens in the
: first position (i.e. the unstemmed ones) or to ignore the tokens of type
: STEMMED. Or am I missing something obvious? (I am also concerned that
: this will skew the scoring.)
:
: While I would like the queries
:
:      +fox +dog
:      "jumps over the lazy dogs"
:
: to match, the following should not match:
:
:      "jump over the lazy dog"
:
: because in my world, the quotes demand an exact match.
:
: Any ideas would be appreciated.
: -- Robert
:
:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
: For additional commands, e-mail: java-user-help@lucene.apache.org
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message