lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Philip Brown <...@us.ibm.com>
Subject Re: Phrase search using quotes -- special Tokenizer
Date Sat, 02 Sep 2006 18:31:18 GMT

I tend to agree with Mark.  I tried a query as so...

   TermQuery query = new TermQuery(new Term("keywordField", "phrase test"));
   IndexSearcher searcher= new IndexSearcher(activeIdx);		
   Hits hits = searcher.search(query);		

And this produced the expected results.  When building the index, I did NOT
enclose the keywords in quotes -- just added as UN_TOKENIZED.

Philip


Mark Miller-5 wrote:
> 
> I think if he wants to use the queryparser to parse his search strings 
> that he has no choice but to modify it. It will eat any pair of quotes 
> going through it no matter what analyzer is used.
> 
> - Mark
>> Well, you're flying blind. Is the behavior rooted in the indexing or
>> querying? Since you can't answer that, you're reduced to trying random
>> things hoping that one of them works. A little like voodoo. I've wasted
>> faaaaarrrrrr too much time trying to solve what I was *sure* was the 
>> problem
>> only to find it was somewhere else (the last place I look, of course) 
>> <G>...
>>
>> Using Luke on a RAMDir. No, I don't know how to, but it should be a 
>> simple
>> thing to write the index to an FSDir at the same time you create your 
>> RAMDir
>> and use Luke then. This is debugging, after all.
>>
>> I'd be really, really, really reluctant to modify the query parser and/or
>> the tokenizer, since whenever I've been tempted it's usually because I 
>> don't
>> understand the tools already provided. Then I have to maintain my custom
>> code. Which sucks. Although it sure feels more productive to hack a 
>> bunch of
>> code and get something that works 90% of the time, then spend weeks 
>> making
>> the other 10% work than taking two days to find the 3 lines you *really*
>> need <G>.
>>
>> Have you thought of a PatternAnalyzer? It takes a regular expression 
>> as the
>> tokenizer and  (from the Javadoc)
>> <<< Efficient Lucene analyzer/tokenizer that preferably operates on a 
>> String
>> rather than a 
>> Reader<http://java.sun.com/j2se/1.4/docs/api/java/io/Reader.html>,
>> that can flexibly separate text into terms via a regular expression
>> Pattern<http://java.sun.com/j2se/1.4/docs/api/java/util/regex/Pattern.html>(with

>>
>> behaviour identical to
>> String.split(String)<http://java.sun.com/j2se/1.4/docs/api/java/lang/String.html#split%28java.lang.String%29>),

>>
>> and that combines the functionality of
>> LetterTokenizer<file:///C:/lucene_1.9.1/docs/api/org/apache/lucene/analysis/LetterTokenizer.html>,

>>
>> LowerCaseTokenizer<file:///C:/lucene_1.9.1/docs/api/org/apache/lucene/analysis/LowerCaseTokenizer.html>,

>>
>> WhitespaceTokenizer<file:///C:/lucene_1.9.1/docs/api/org/apache/lucene/analysis/WhitespaceTokenizer.html>,

>>
>> StopFilter<file:///C:/lucene_1.9.1/docs/api/org/apache/lucene/analysis/StopFilter.html>into

>>
>> a single efficient multi-purpose class.>>>
>>
>> One word of caution, the regular expression consists of expressions that
>> *break* tokens, not expressions that *form* words, which threw me at 
>> first.
>> Just like the doc says, like splitstring <G>.... This is in 2.0, 
>> although I
>> *believe* it's also in the contrib section of 1.9 (or is in the 
>> regular API,
>> I forget).
>>
>> Best
>> Erick
>>
>> On 9/1/06, Philip Brown <pmb@us.ibm.com> wrote:
>>>
>>>
>>> No, I've never used Luke.  Is there an easy way to examine my 
>>> RAMDirectory
>>> index?  I can create the index with no quoted keywords, and when I 
>>> search
>>> for a keyword, I get back the expected results (just can't search for a
>>> phrase that has whitespace in it).  If I create the index with 
>>> phrases in
>>> quotes, then when I search for anything in double quotes, I get back
>>> nothing.  If I create the index with everything in quotes, then when I
>>> search for anything by the keyword field, I get nothing, regardless of
>>> whether I use quotes in the query string or not.  (I can get results 
>>> back
>>> by
>>> searching on other fields.)  What do you think?
>>>
>>> Philip
>>>
>>>
>>> Erick Erickson wrote:
>>> >
>>> > OK, I've gotta ask. Have you examined your index with Luke to see if
>>> what
>>> > you *think* is in the index actually *is*???
>>> >
>>> > Erick
>>> >
>>> > On 9/1/06, Philip Brown <pmb@us.ibm.com> wrote:
>>> >>
>>> >>
>>> >> Interesting...just ran a test where I put double quotes around
>>> everything
>>> >> (including single keywords) of source text and then ran searches 
>>> for a
>>> >> known
>>> >> keyword with and without double quotes -- doesn't find either time.
>>> >>
>>> >>
>>> >> Mark Miller-5 wrote:
>>> >> >
>>> >> > Sorry to hear you're having trouble. You indeed need the double
>>> quotes
>>> >> in
>>> >> > the source text. You will also need them in the query string. Make
>>> sure
>>> >> > they
>>> >> > are in both places. My machine is hosed right now or I would do
it
>>> for
>>> >> you
>>> >> > real quick. My guess is that I forgot to mention...no only do you
>>> need
>>> >> to
>>> >> > add the <QUOTED> definiton to the TOKEN section, but below
that you
>>> >> will
>>> >> > find the grammer...you need to add <QUOTED> to the grammer.
If you
>>> look
>>> >> > how
>>> >> > <NUM> and <APOSTROPHE> are done you will prob see what
you 
>>> should do.
>>> >> If
>>> >> > not, my machine should be back up tomarrow...
>>> >> >
>>> >> > - Mark
>>> >> >
>>> >> > On 9/1/06, Philip Brown <pmb@us.ibm.com> wrote:
>>> >> >>
>>> >> >>
>>> >> >> Well, I tried that, and it doesn't seem to work still.  I would
be
>>> >> happy
>>> >> >> to
>>> >> >> zip up the new files, so you can see what I'm using -- maybe

>>> you can
>>> >> get
>>> >> >> it
>>> >> >> to work.  The first time, I tried building the documents without
>>> >> quotes
>>> >> >> surrounding each phrase.  Then, I retried by enclosing every

>>> phrase
>>> >> >> within
>>> >> >> double quotes.  Neither seemed to work.  When constructing
the 
>>> query
>>> >> >> string
>>> >> >> for the search, I always added the double quotes (otherwise,
it'd
>>> >> think
>>> >> >> it
>>> >> >> was multiple terms).  (I didn't even test the underscore and
>>> >> hyphenated
>>> >> >> terms.)  I thought Lucene was (sort of by default) set up to

>>> search
>>> >> >> quoted
>>> >> >> phrases.  From 
>>> http://lucene.apache.org/java/docs/api/index.html -->
>>> A
>>> >> >> Phrase is a group of words surrounded by double quotes such
as
>>> "hello
>>> >> >> dolly".  So, this should be easy, right?  I must be missing
>>> something
>>> >> >> stupid.
>>> >> >>
>>> >> >> Thanks,
>>> >> >>
>>> >> >> Philip
>>> >> >>
>>> >> >>
>>> >> >> Mark Miller-5 wrote:
>>> >> >> >
>>> >> >> > So this will recognize anything in quotes as a single
token and
>>> '_'
>>> >> and
>>> >> >> > '-' will not break up words. There may be some repercussions
for
>>> the
>>> >> >> NUM
>>> >> >> > token but nothing I'd worry about. maybe you want to use
Unicode
>>> for
>>> >> >> '-'
>>> >> >> > and '_' as well...I wouldn't worry about it myself.
>>> >> >> >
>>> >> >> > - Mark
>>> >> >> >
>>> >> >> >
>>> >> >> > TOKEN : {                      // token patterns
>>> >> >> >
>>> >> >> >   // basic word: a sequence of digits & letters
>>> >> >> >   <ALPHANUM: (<LETTER>|<DIGIT>|<KOREAN>)+
>
>>> >> >> >
>>> >> >> > | <QUOTED:     "\"" (~["\""])+ "\"">
>>> >> >> >
>>> >> >> >   // internal apostrophes: O'Reilly, you're, O'Reilly's
>>> >> >> >   // use a post-filter to remove possesives
>>> >> >> > | <APOSTROPHE: <ALPHA> ("'" <ALPHA>)+ >
>>> >> >> >
>>> >> >> >   // acronyms: U.S.A., I.B.M., etc.
>>> >> >> >   // use a post-filter to remove dots
>>> >> >> > | <ACRONYM: <ALPHA> "." (<ALPHA> ".")+
>
>>> >> >> >
>>> >> >> >   // company names like AT&T and Excite@Home.
>>> >> >> > | <COMPANY: <ALPHA> ("&"|"@") <ALPHA>
>
>>> >> >> >
>>> >> >> >   // email addresses
>>> >> >> > | <EMAIL: <ALPHANUM> (("."|"-"|"_") <ALPHANUM>)*
"@" <ALPHANUM>
>>> >> >> > (("."|"-") <ALPHANUM>)+ >
>>> >> >> >
>>> >> >> >   // hostname
>>> >> >> > | <HOST: <ALPHANUM> ("." <ALPHANUM>)+ >
>>> >> >> >
>>> >> >> >   // floating point, serial, model numbers, ip addresses,
etc.
>>> >> >> >   // every other segment must have at least one digit
>>> >> >> > | <NUM: (<ALPHANUM> <P> <HAS_DIGIT>
>>> >> >> >        | <HAS_DIGIT> <P> <ALPHANUM>
>>> >> >> >        | <ALPHANUM> (<P> <HAS_DIGIT>
<P> <ALPHANUM>)+
>>> >> >> >        | <HAS_DIGIT> (<P> <ALPHANUM>
<P> <HAS_DIGIT>)+
>>> >> >> >        | <ALPHANUM> <P> <HAS_DIGIT>
(<P> <ALPHANUM> <P>
>>> >> <HAS_DIGIT>)+
>>> >> >> >        | <HAS_DIGIT> <P> <ALPHANUM>
(<P> <HAS_DIGIT> <P>
>>> >> <ALPHANUM>)+
>>> >> >> >         )
>>> >> >> >   >
>>> >> >> > | <#P: ("_"|"-"|"/"|"."|",") >
>>> >> >> > | <#HAS_DIGIT:                      // at least one
digit
>>> >> >> >     (<LETTER>|<DIGIT>)*
>>> >> >> >     <DIGIT>
>>> >> >> >     (<LETTER>|<DIGIT>)*
>>> >> >> >   >
>>> >> >> >
>>> >> >> > | < #ALPHA: (<LETTER>)+>
>>> >> >> > | < #LETTER:                      // unicode letters
>>> >> >> >       [
>>> >> >> >        "\u0041"-"\u005a",
>>> >> >> >        "\u0061"-"\u007a",
>>> >> >> >        "\u00c0"-"\u00d6",
>>> >> >> >        "\u00d8"-"\u00f6",
>>> >> >> >        "\u00f8"-"\u00ff",
>>> >> >> >        "\u0100"-"\u1fff",
>>> >> >> >        "-", "_"
>>> >> >> >       ]
>>> >> >> >   >
>>> >> >> >
>>> >> >> >
>>> >> ---------------------------------------------------------------------
>>> >> >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> >> >> > For additional commands, e-mail: 
>>> java-user-help@lucene.apache.org
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >>
>>> >> >> --
>>> >> >> View this message in context:
>>> >> >>
>>> >>
>>> http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6106920

>>>
>>> >> >> Sent from the Lucene - Java Users forum at Nabble.com.
>>> >> >>
>>> >> >>
>>> >> >>
>>> ---------------------------------------------------------------------
>>> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> >> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>>> >> >>
>>> >> >>
>>> >> >
>>> >> >
>>> >>
>>> >> --
>>> >> View this message in context:
>>> >>
>>> http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6107649

>>>
>>> >> Sent from the Lucene - Java Users forum at Nabble.com.
>>> >>
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>>> >>
>>> >>
>>> >
>>> >
>>>
>>> -- 
>>> View this message in context:
>>> http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6109067

>>>
>>> Sent from the Lucene - Java Users forum at Nabble.com.
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6115360
Sent from the Lucene - Java Users forum at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message