lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Miller <markrmil...@gmail.com>
Subject Re: Phrase search using quotes -- special Tokenizer
Date Mon, 04 Sep 2006 13:10:53 GMT
More to consider:
perhaps there is some way to get what you want by overriding 
getFieldQuery(String, String) instead. I have not been able to come up 
with anything simple off the top of my head, but overriding 
getFieldQuery would free you from having to make that line change on 
every Lucene update. Perhaps you could scan the string in getFieldQuery 
and if you find a space, skip the analyzing and return a term query, 
putting the quotes back on -- good old boy might come in and you pump 
out "good old boy" as a term query. If there is no space in the query 
part then analyze like normal.

- Mark
Philip Brown wrote:
> Yeah, they are more complex than the "exactish" match -- basically, there are
> more fields involved -- combined sometimes with AND and sometimes with OR,
> and sometimes negated field values, sometimes groupings, etc.  These other
> field values are all single words (no spaces), and a search might involve a
> wildcard on them.  Hope that helps.
>
> Thanks.
>
>
> Chris Hostetter wrote:
>   
>> : Thanks for your input.  I'm sure I could do as you suggest (and maybe
>> that
>> : will end up being my best option), but I had hoped to use a string for
>> : creating the query object, particularly as some of my queries are a bit
>> : complex.
>>
>> you have to clarify what you mean by "use a string for creating the query
>> object" ... there's nothing in what i suggested that implies you can't do
>> that, that's exactly what i'm suggesting you do...
>>
>>    String input = ...;
>>    Analyzer a = new YourCustomAnalyzer();
>>    // because you know your analyzer allways produces exactly one token...
>>    Token t = a.tokenStream("yourField", new StringReader(input)).next();
>>    Query yourQuery = new TermQuery("yourField", t.termText());
>>
>> ...if your queries are more complex then just the "exactish" matching you
>> described before, then that's a seperate issue -- what you described
>> didn't sound like it required any special input processing -- you said you
>> had a "string" and you wanted to find exact matches on that string (with
>> some normalization) ... but that you didn't want your input split on
>> whitespace, or hyphens, or any of the "special" characters QueryParser
>> uses.
>>
>> If you want other things then that certainly makes things more
>> complicated, but the basic idea is still the same ... so what exactly do
>> you mean when you say it's more complicated?
>>
>>
>> : > I haven't really been following this thread, but it's gotten so long
>> : > i got interested.
>> : >
>> : > from whta i can tell skimming the discussion so far, it seems like the
>> : > biggest confusion is about the definition of a "phrase" and what
>> analyzers
>> : > do with "quote" characters and what the QueryParser does with "quote"
>> : > charcters -- when ultimately you don't seem to really care about
>> "phrases"
>> : > in a textual searching sense; nor do you seem to care about any of the
>> : > "features" of the QueryParser.
>> : >
>> : > it seems that what you care about is:
>> : >
>> : >  1) making documents, and adding a list of "text chunks" to those
>> : >     documents (what you've been calling phrases)
>> : >  2) you then want to be able to search for "almost-exact" matches on
>> those
>> : >     "text chunks" ... these matches should be "exactish" because you
>> don't
>> : >     want partial matches based on white spaces, or splitting on
>> hyphens,
>> : >     but they shouldn't be truely exact because you want some simple
>> : >     normalization...
>> : >
>> : > : actually would like to "normalize" a phrase (spaces) or a hyphenated
>> : > word or
>> : > : an underscored word to the same value -- e.g. MS-WORD or ms_WORd or
>> "MS
>> : > : Word" --> ms_word.
>> : >
>> : > ...in which case, you should:
>> : >  a) write yourself an analyzer which does no "tokenizing" (ie: each
>> input
>> : >     Field value generates a single token) but does the normalization
>> you
>> : >     want.
>> : >  b) use this Analyzer when you add the fields to your documents, even
>> : >     though you don't want *real* tokenization, add make the field type
>> : >     TOKENIZED so your analyzer gets used.
>> : >  c) when you get some text input to serach on, pass it to the same
>> : >     Analyzer, take the Token you get back and manualy construct a
>> : >     TermQuery out of it for the neccessary field.
>> : >
>> : > ...that's it.  that's all she wrote -- don't even look in
>> QueryParser's
>> : > general direction, at all.
>> : >
>> : >
>> : >
>> : > -Hoss
>> : >
>> : >
>> : > ---------------------------------------------------------------------
>> : > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> : > For additional commands, e-mail: java-user-help@lucene.apache.org
>> : >
>> : >
>> : >
>> :
>> : --
>> : View this message in context:
>> http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6128827
>> : Sent from the Lucene - Java Users forum at Nabble.com.
>> :
>> :
>> : ---------------------------------------------------------------------
>> : To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> : For additional commands, e-mail: java-user-help@lucene.apache.org
>> :
>>
>>
>>
>> -Hoss
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>>     
>
>   

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message