lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Miller <markrmil...@gmail.com>
Subject Re: Phrase search using quotes -- special Tokenizer
Date Sat, 02 Sep 2006 02:46:50 GMT
I am out of ideas. If I'm feeling perky I'll build you one in the morning.
> No, I've never used Luke.  Is there an easy way to examine my RAMDirectory
> index?  I can create the index with no quoted keywords, and when I search
> for a keyword, I get back the expected results (just can't search for a
> phrase that has whitespace in it).  If I create the index with phrases in
> quotes, then when I search for anything in double quotes, I get back
> nothing.  If I create the index with everything in quotes, then when I
> search for anything by the keyword field, I get nothing, regardless of
> whether I use quotes in the query string or not.  (I can get results back by
> searching on other fields.)  What do you think?
>
> Philip
>
>
> Erick Erickson wrote:
>   
>> OK, I've gotta ask. Have you examined your index with Luke to see if what
>> you *think* is in the index actually *is*???
>>
>> Erick
>>
>> On 9/1/06, Philip Brown <pmb@us.ibm.com> wrote:
>>     
>>> Interesting...just ran a test where I put double quotes around everything
>>> (including single keywords) of source text and then ran searches for a
>>> known
>>> keyword with and without double quotes -- doesn't find either time.
>>>
>>>
>>> Mark Miller-5 wrote:
>>>       
>>>> Sorry to hear you're having trouble. You indeed need the double quotes
>>>>         
>>> in
>>>       
>>>> the source text. You will also need them in the query string. Make sure
>>>> they
>>>> are in both places. My machine is hosed right now or I would do it for
>>>>         
>>> you
>>>       
>>>> real quick. My guess is that I forgot to mention...no only do you need
>>>>         
>>> to
>>>       
>>>> add the <QUOTED> definiton to the TOKEN section, but below that you
>>>>         
>>> will
>>>       
>>>> find the grammer...you need to add <QUOTED> to the grammer. If you
look
>>>> how
>>>> <NUM> and <APOSTROPHE> are done you will prob see what you should
do.
>>>>         
>>> If
>>>       
>>>> not, my machine should be back up tomarrow...
>>>>
>>>> - Mark
>>>>
>>>> On 9/1/06, Philip Brown <pmb@us.ibm.com> wrote:
>>>>         
>>>>> Well, I tried that, and it doesn't seem to work still.  I would be
>>>>>           
>>> happy
>>>       
>>>>> to
>>>>> zip up the new files, so you can see what I'm using -- maybe you can
>>>>>           
>>> get
>>>       
>>>>> it
>>>>> to work.  The first time, I tried building the documents without
>>>>>           
>>> quotes
>>>       
>>>>> surrounding each phrase.  Then, I retried by enclosing every phrase
>>>>> within
>>>>> double quotes.  Neither seemed to work.  When constructing the query
>>>>> string
>>>>> for the search, I always added the double quotes (otherwise, it'd
>>>>>           
>>> think
>>>       
>>>>> it
>>>>> was multiple terms).  (I didn't even test the underscore and
>>>>>           
>>> hyphenated
>>>       
>>>>> terms.)  I thought Lucene was (sort of by default) set up to search
>>>>> quoted
>>>>> phrases.  From http://lucene.apache.org/java/docs/api/index.html -->
A
>>>>> Phrase is a group of words surrounded by double quotes such as "hello
>>>>> dolly".  So, this should be easy, right?  I must be missing something
>>>>> stupid.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Philip
>>>>>
>>>>>
>>>>> Mark Miller-5 wrote:
>>>>>           
>>>>>> So this will recognize anything in quotes as a single token and '_'
>>>>>>             
>>> and
>>>       
>>>>>> '-' will not break up words. There may be some repercussions for
the
>>>>>>             
>>>>> NUM
>>>>>           
>>>>>> token but nothing I'd worry about. maybe you want to use Unicode
for
>>>>>>             
>>>>> '-'
>>>>>           
>>>>>> and '_' as well...I wouldn't worry about it myself.
>>>>>>
>>>>>> - Mark
>>>>>>
>>>>>>
>>>>>> TOKEN : {                      // token patterns
>>>>>>
>>>>>>   // basic word: a sequence of digits & letters
>>>>>>   <ALPHANUM: (<LETTER>|<DIGIT>|<KOREAN>)+ >
>>>>>>
>>>>>> | <QUOTED:     "\"" (~["\""])+ "\"">
>>>>>>
>>>>>>   // internal apostrophes: O'Reilly, you're, O'Reilly's
>>>>>>   // use a post-filter to remove possesives
>>>>>> | <APOSTROPHE: <ALPHA> ("'" <ALPHA>)+ >
>>>>>>
>>>>>>   // acronyms: U.S.A., I.B.M., etc.
>>>>>>   // use a post-filter to remove dots
>>>>>> | <ACRONYM: <ALPHA> "." (<ALPHA> ".")+ >
>>>>>>
>>>>>>   // company names like AT&T and Excite@Home.
>>>>>> | <COMPANY: <ALPHA> ("&"|"@") <ALPHA> >
>>>>>>
>>>>>>   // email addresses
>>>>>> | <EMAIL: <ALPHANUM> (("."|"-"|"_") <ALPHANUM>)* "@"
<ALPHANUM>
>>>>>> (("."|"-") <ALPHANUM>)+ >
>>>>>>
>>>>>>   // hostname
>>>>>> | <HOST: <ALPHANUM> ("." <ALPHANUM>)+ >
>>>>>>
>>>>>>   // floating point, serial, model numbers, ip addresses, etc.
>>>>>>   // every other segment must have at least one digit
>>>>>> | <NUM: (<ALPHANUM> <P> <HAS_DIGIT>
>>>>>>        | <HAS_DIGIT> <P> <ALPHANUM>
>>>>>>        | <ALPHANUM> (<P> <HAS_DIGIT> <P>
<ALPHANUM>)+
>>>>>>        | <HAS_DIGIT> (<P> <ALPHANUM> <P>
<HAS_DIGIT>)+
>>>>>>        | <ALPHANUM> <P> <HAS_DIGIT> (<P>
<ALPHANUM> <P>
>>>>>>             
>>> <HAS_DIGIT>)+
>>>       
>>>>>>        | <HAS_DIGIT> <P> <ALPHANUM> (<P>
<HAS_DIGIT> <P>
>>>>>>             
>>> <ALPHANUM>)+
>>>       
>>>>>>         )
>>>>>>   >
>>>>>> | <#P: ("_"|"-"|"/"|"."|",") >
>>>>>> | <#HAS_DIGIT:                      // at least one digit
>>>>>>     (<LETTER>|<DIGIT>)*
>>>>>>     <DIGIT>
>>>>>>     (<LETTER>|<DIGIT>)*
>>>>>>   >
>>>>>>
>>>>>> | < #ALPHA: (<LETTER>)+>
>>>>>> | < #LETTER:                      // unicode letters
>>>>>>       [
>>>>>>        "\u0041"-"\u005a",
>>>>>>        "\u0061"-"\u007a",
>>>>>>        "\u00c0"-"\u00d6",
>>>>>>        "\u00d8"-"\u00f6",
>>>>>>        "\u00f8"-"\u00ff",
>>>>>>        "\u0100"-"\u1fff",
>>>>>>        "-", "_"
>>>>>>       ]
>>>>>>   >
>>>>>>
>>>>>>
>>>>>>             
>>> ---------------------------------------------------------------------
>>>       
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>> --
>>>>> View this message in context:
>>>>>
>>>>>           
>>> http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6106920
>>>       
>>>>> Sent from the Lucene - Java Users forum at Nabble.com.
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>>>           
>>>>         
>>> --
>>> View this message in context:
>>> http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6107649
>>> Sent from the Lucene - Java Users forum at Nabble.com.
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>       
>>     
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message