lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhang, Lisheng" <Lisheng.Zh...@broadvision.com>
Subject RE: Phrase Query Problem
Date Tue, 18 Dec 2007 18:39:46 GMT
Hi,

1) Whenever we change to a different analyzer, we need to reindex
   whole dataset, whether or not using WhiteSpaceAnalyzer.
2) Using WhiteSpaceAnalyzer may increase disk space and slow-down
   indexing because more tokens are indexed, how much can be slowed
   I donot know.
3) WhiteSpaceAnalyzer also keeps case, for example, if input text
   has "Health", query "health" may not return the doc, make sure
   if this is you need, also this analyzer will keep all symbols,
   like coma, period .... For example, if text has "Number ONE issue
   is health safety!", query "health safety" will not return the doc,
   because "safety!" is indexed as a token, not "safety".

I felt most important thing is to make sure the exact query requirement,
then picking up analyzer.

Best regards, Lisheng

-----Original Message-----
From: Sirish Vadala [mailto:vsirishreddy@yahoo.co.in]
Sent: Tuesday, December 18, 2007 10:26 AM
To: java-user@lucene.apache.org
Subject: RE: Phrase Query Problem



ok, thnx... I will implement using the WhiteSpaceAnalyzer... Let me check
the
indexing speed... I mean time taken to index my data set... If that takes
too long then probably I will look into implementing a custom analyzer...


Zhang, Lisheng wrote:
> 
> Hi,
> 
> In case you donot want to toss away any stop words and even
> preserve case, WhiteSpaceAnalyzer can be used, also using
> WhiteSpaceTokenizer would serve as a test (but need to reindex 
> whole data set first), to make sure there is no other problems.
> 
> Best regards, Lisheng
> 
> 
> 
> -----Original Message-----
> From: mark harwood [mailto:markharw00d@yahoo.co.uk]
> Sent: Tuesday, December 18, 2007 9:42 AM
> To: java-user@lucene.apache.org
> Subject: Re: Phrase Query Problem
> 
> 
> You could write a custom analyzer that drops stopwords but adds an extra 1
> to the "positionIncrement" property for the next valid Token after each
> omiited stop word.
> 
> This would retain the benefit of removing stopwords from your index and
> yet
> prevent your example phrases matching (because the remaining words are not
> recorded as being directly next to each other)
> 
> Cheers
> Mark
> 
> 
> ----- Original Message ----
> From: Sirish Vadala <vsirishreddy@yahoo.co.in>
> To: java-user@lucene.apache.org
> Sent: Tuesday, 18 December, 2007 5:10:19 PM
> Subject: RE: Phrase Query Problem
> 
> 
> Yes... If my query phrase is "Health Safety", docs with "Health and
>  Safety",
> "Health or Safety" are being returned...
> 
> So... Is there any other way to handle this situation... Especially in
>  the
> above mentioned case, the user is expecting around 5 records and the
>  query
> is fetching more than 550 records.8-O
> 
> Thanks.
> 
> 
> Zhang, Lisheng wrote:
>> 
>> Hi,
>> 
>> Do you mean that your query phrase is "Health Safety",
>> but docs with "Health and Safety" returned?
>> 
>> If that is the case, the reason is that StandardAnalyzer
>> filters out "and" (also "or, "in" and others) as stop 
>> words during indexing, and the QueryParser filters those
>> words out also.
>> 
>> Best regards, Lisheng
>> 
>> -----Original Message-----
>> From: Sirish Vadala [mailto:vsirishreddy@yahoo.co.in]
>> Sent: Monday, December 17, 2007 9:49 AM
>> To: java-user@lucene.apache.org
>> Subject: Phrase Query Problem
>> 
>> 
>> 
>> I have the following code for search:
>> 
>> BooleanQuery bQuery = new BooleanQuery();
>> Query queryAuthor;
>> queryAuthor = new TermQuery(new Term(IFIELD_LEAD_AUTHOR,
>> author.trim().toLowerCase()));
>> bQuery.add(queryAuthor, BooleanClause.Occur.MUST);
>> ....................................................................
>> ....................................................................
>> 
>> PhraseQuery pQuery = new PhraseQuery();
>> String[] phrase = txtWithPhrase.toLowerCase().split(" ");
>> for (int i = 0; i < phrase.length; i++) {
>>     pQuery.add(new Term(IFIELD_TEXT, phrase[i]));
>> }
>> pQuery.setSlop(0);
>> bQuery.add(pQuery, BooleanClause.Occur.MUST);
>> ....................................................................
>> ....................................................................
>> 
>> String[] sortOrder = {IFIELD_LEAD_AUTHOR, IFIELD_TEXT};
>> Sort sort = new Sort(sortOrder);
>> hits = indexSearcher.search(bQuery, sort);
>> 
>> Now My problem here is: If I do a search on a phrase with text Health
>> Safety, it is fetching me all the records where in the text is Health
>> and/or/in Safety. It is fetching me these records even after setting
>  the
>> slop of the phrase query to zero for exact match. I am using standard
>> analyzer while indexing my records.
>> 
>> Any help on this is greatly appreciated. 
>> 
>> Sirish Vadala
>> -- 
>> View this message in context:
>> http://www.nabble.com/Phrase-Query-Problem-tp14373945p14373945.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
>> 
>> 
> 
> -- 
> View this message in context:
>  http://www.nabble.com/Phrase-Query-Problem-tp14373945p14401354.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 
> 
> 
> 
>       __________________________________________________________
> Sent from Yahoo! Mail - a smarter inbox http://uk.mail.yahoo.com
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context:
http://www.nabble.com/Phrase-Query-Problem-tp14373945p14402820.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message