lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rahul D Thakare" <rahul_thak...@rediffmail.com>
Subject Re: Re: Wild card and multiple keyword search
Date Wed, 13 Jul 2005 14:40:53 GMT
Hi Erik,

  Thanks for the reply.

  Here are the answers of your queries 

 que: - The question back to you is do you want searches for simply "MAIN" to  
find both "MAIN LOGIC" and "MAIN PARTS"?  Or should it return no  
documents since its not an exact match?

Ans: It should return no documents since it is not a exact match.

I would like to explain my point 2 in below mail again by giveing an example here

 eg- If there are two records with keyword ie "Main Logic" and "Main Board" respectively.

what we want is 
1. if the user serch on Main*, both the record should return.
2. If the user search on "Main Logic", only one record should be return.
3. If the user searches for Main, no record should be returned


 point 2 above is working now because we are using Field.Keyword (which doesn't tokenise the
spaces) and KeywordAnalyser.

 point 1 (wild card search) is not working because KeywordAnalyser does not recognise wild
card

would appreciate more information on this.

thanks in advance.

rahul

On Wed, 13 Jul 2005 Erik Hatcher wrote :
>
>On Jul 13, 2005, at 8:18 AM, Rahul D Thakare wrote:
>>  We are using doc.add(Field.Text("keywords",keywords)); to add the  keywords to the
document, where keywords is comma separated  keywords string.
>
>If the text is already comma separated and that is the level at which  you things tokenized,
then simply do something like this (untested  pseudo-code):
>
>     String[] values = keywords.split(",");
>     for (int i=0; i < values.length; i++)
>         doc.add(Field.Keyword("keywords", values[i]));
>
>>Lucene seems to tokenize the keywords with multiple words like(MAIN  BOARD) as different
keywords(ie as MAIN and BOARD). Tokenization is  based on comma and space...So if we search
for "MAIN BOARD",  documents having keywords like "MAIN LOGIC", "MAIN PARTS", etc also  show
up
>>
>>If one searches for "MAIN BOARD", we want get only the documents  have "MAIN BOARD".
 How to do this ?
>
>The question back to you is do you want searches for simply "MAIN" to  find both "MAIN
LOGIC" and "MAIN PARTS"?  Or should it return no  documents since its not an exact match?
>
>Using the above code, "MAIN" would find neither of those and the  query would have to
be exact.  I see below you've clarified this  requirement...
>
>>To achieve this we used doc.add(Field.Keyword("keywords",  keywords)); and while searching
>>we cannot use standard analyzer, while searching, as divides the  keywords if we search
keywords having space... so we wrote an  KeywordAnalyser(KeywordAnalyzer is basically returns
only one  single token) as given below.
>
>There is a KeywordAnalyzer now in the contrib/analyzers codebase, and  it will ship with
the next version of Lucene (or you could build it  yourself and use it).  There is also a
couple of variants of the  KeywordAnalyzer in the Lucene in Action code (www.lucenebook.com).
>
>>Which solve the above said problem, but we are not able to the wild  card searchs
like MAIN*, etc.
>>
>>We need both the functionality ie.
>>1.  if user searches for MAIN BOARD, should get only documents that  contain MAIN
BOARD and not MAIN LOGIC, MAIN, MAIN PART etc.
>>2. User should be able to do the wild card search like MAIN*, etc  and get the desired
documents.
>>
>>Please let us know, how we should do the indexing ? and which  analyzer to use to
do the search ?
>
>There are many ways to go about this sort of thing, and I apologize  for being short on
time and not able to explain them all fully.  One  option is to keep the tokenization using
a traditional analyzer so  that it separates by whitespace, but when a user queries it turns
 into a PhraseQuery.  If you really mean for wildcards to be single  words in the field (in
other words, users don't need to query on MA*)  then the space separated tokenization would
work fine here as well.
>
>It is important to think through the analysis process as well as the  search interface
issues (the interface must be given thorough  consideration and treated as a first class citizen
when discussing  implementations), especially when wildcard and range queries come  up.  It
has been a hot topic recently on how to deal with wildcards  and ranges efficiently.  In your
example, if by "MAIN*" you intend  for the word MAIN to be a unique token and the user would
choose a  full word to search upon and merely wants to find it within a larger  field then
wildcards are not necessary.
>
>     Erik
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>For additional commands, e-mail: java-user-help@lucene.apache.org
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message