lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Norskog <goks...@gmail.com>
Subject Re: RE: Re:RE: Does the string "Cla$$War" affect Lucene?
Date Wed, 15 Aug 2012 01:54:48 GMT
WhiteSpaceTokenizer breaks at spaces, tabs & newlines. This will leave
Cla$$War as one word. If you want Cla$$War to become one word, use a
CharFilter to filter out all $.

Otherwise, Lucene has debug features to show you exactly how these are
broken up. The easiest way to explore them is to install Solr and use
the 'Analysis' page.

On Tue, Aug 14, 2012 at 6:37 PM, zhoucheng2008 <zhoucheng2008@gmail.com> wrote:
> I appreciate your input. However, my question is which analyzer and tokenizer to choose.
>
>
> ------------------ Original ------------------
> From:  "Uwe Schindler"<uwe@thetaphi.de>;
> Date:  Wed, Aug 15, 2012 00:52 AM
> To:  "java-user"<java-user@lucene.apache.org>;
>
> Subject:  RE: Re:RE: Does the string "Cla$$War" affect Lucene?
>
>
>
> Please read my answer posted before, it explains exactly what happens - so
> you can imagine what type of search input produces this. If you want to
> change the behavior rethink your tokenization.
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>
>> -----Original Message-----
>> From: zhoucheng2008 [mailto:zhoucheng2008@gmail.com]
>> Sent: Tuesday, August 14, 2012 6:46 PM
>> To: java-user
>> Subject: Re: Re:RE: Does the string "Cla$$War" affect Lucene?
>>
>> Another phrase "$FREE.99" causes the same problem.
>>
>>
>> What are the ultimate solutions? How many cases would cause this problem?
>>
>>
>> Thanks
>>
>>
>>
>>
>> ------------------ Original ------------------
>> From:  "dyzc2010  "<1393975679@qq.com>;
>> Date:  Tue, Aug 14, 2012 11:27 PM
>> To:  "java-user"<java-user@lucene.apache.org>;
>>
>> Subject:  Re: Re:RE: Does the string "Cla$$War" affect Lucene?
>>
>>
>>
>> I know the reason of no hits.
>>
>>
>> Without configuring autoGeneratePhraseQueries, a term like "I love you" is
>> split into "I", "love", and "you", therefore getting quite a lot hits.
>>
>>
>> On the contrary, the term is not split, and no hits.
>>
>>
>>
>>
>> ------------------ Original ------------------
>> From:  "Jack Krupansky"<jack@basetechnology.com>;
>> Date:  Tue, Aug 14, 2012 11:01 PM
>> To:  "java-user"<java-user@lucene.apache.org>;
>>
>> Subject:  Re: Re:RE: Does the string "Cla$$War" affect Lucene?
>>
>>
>>
>> Try enclosing "Cla$$War" in quotes, which should have the same effect as
>> turning on auto-phrase query generation.
>>
>> qp.parse("\"Cla$$War\"")
>>
>> (You only need to use "escape" for characters which are query syntax
>> characters.)
>>
>> And do a q.toString to see how the term was analyzed.
>>
>> I'm surprised that you got no hits with autoGeneratePhraseQueries - which
>> suggests that maybe the index didn't use the same analyzer or maybe the
>> literal text in the title is not exactly what you think it is.
>>
>> You could use the WhitespaceAnalyzer, but that would leave leading and
>> trailing punctuation.
>>
>> -- Jack Krupansky
>>
>> -----Original Message-----
>> From: zhoucheng2008
>> Sent: Tuesday, August 14, 2012 10:42 AM
>> To: java-user
>> Subject: Re:RE: Does the string "Cla$$War" affect Lucene?
>>
>> Sound like some other analyzer can do the trick?
>>
>>
>> Anyway, I don't want a slower lucene, and I want to treat "Cla$$War" as a
>> whole word.
>>
>>
>> What is the solution left?
>>
>>
>> Thanks.
>>
>>
>>
>>
>> ------------------ Original ------------------
>> From:  "Uwe Schindler"<uwe@thetaphi.de>;
>> Date:  Tue, Aug 14, 2012 04:56 PM
>> To:  "java-user"<java-user@lucene.apache.org>;
>>
>> Subject:  RE: Does the string "Cla$$War" affect Lucene?
>>
>>
>>
>> Hi,
>>
>> If you are using StandardAnalyzer, then "Cla$$War" is split at the $
> signs,
>> so it searches for two tokens, "cla" and "war". If autogenerate phrase
>> queries is enabled for QueryParser, it will then create a phrase query
> "cla
>> war" out of it, which is slower because positions are involved. If
>> autogenerate phrases is not enabled, Lucene still have to search for 2
>> terms, so it might get slower, if "cla" or "war" hit many documents. If it
>> is enabled or not depends on the matchVersion parameter passed to ctor:
>> http://lucene.apache.org/core/3_6_1/api/core/org/apache/lucene/queryParser
>> /Q
>> ueryParser.html
>>
>> Uwe
>>
>> -----
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: uwe@thetaphi.de
>>
>>
>> > -----Original Message-----
>> > From: Ian Lea [mailto:ian.lea@gmail.com]
>> > Sent: Tuesday, August 14, 2012 10:39 AM
>> > To: java-user@lucene.apache.org
>> > Subject: Re: Does the string "Cla$$War" affect Lucene?
>> >
>> > Sounds extremely unlikely.  What is the query?  What analyzer? What
>> version of
>> > lucene?  What about other strings containing $$?
>> >
>> >
>> > --
>> > Ian.
>> >
>> >
>> > On Tue, Aug 14, 2012 at 9:13 AM, zhoucheng2008
>> > <zhoucheng2008@gmail.com> wrote:
>> > > Hi,
>> > >
>> > >
>> > > I have a big index, and when I searched it with a title string
>> "Cla$$War",
>> > Lucene became very slow. It doesn't happen when I searched with other
>> title
>> > string such as "Gone with Wind". Does the "$$" affect the search
>> performance?
>> > >
>> > >
>> > > Thanks,
>> > > Cheng
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org



-- 
Lance Norskog
goksron@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message