lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rahil <qamar_ra...@yahoo.co.uk>
Subject Re: Performing a like query
Date Mon, 09 Oct 2006 09:54:51 GMT
Hi Steve

Thanks for your response. I was just wondering whether there is a 
difference between the regular expression you sent me i.e.
 (i)   \s*(?:\b|(?<=\S)(?=\s)|(?<=\s)(?=\S))\s*

    and
   
 (ii)   \\b

as they lead to the same output. For example, the string search "testing 
a-new string=3/4" results in the same output :
Item is :
Item is : testing
Item is : 
Item is : a
Item is : -
Item is : new
Item is : 
Item is : string
Item is : =
Item is : 3
Item is : /
Item is : 4

What Id like to do though is remove the split over space characters so 
that the output is such :
Item is : testing
Item is : a
Item is : -
Item is : new
Item is : string
Item is : =
Item is : 3
Item is : /
Item is : 4

Im not great at regular expressions so would really appreciate if you 
could provide me with some insight into expression (i) .

Thanks for all your help
Rahil

Steven Rowe wrote:

>Hi Rahil,
>
>Rahil wrote:
>  
>
>>I couldnt figure out a valid regular expression to write a valid 
>>Pattern.compile(String regex) which can tokenise a string into "O/E -
>>visual acuity R-eye=6/24" into "O","/","E", "-", "visual", "acuity",
>>"R", "-", "eye", "=", "6", "/", "24".
>>    
>>
>
>The following regular expression should match boundaries between word
>and non-word, or between space and non-space, in either order, and
>includes contiguous whitespace:
>
>   \s*(?:\b|(?<=\S)(?=\s)|(?<=\s)(?=\S))\s*
>
>Note that with the above regex, the "(%$#!)" in "some (%$#!) text" will
>be tokenized as a single token.
>
>Hope it helps,
>Steve
>
>  
>
>>Erick Erickson wrote:
>>
>>    
>>
>>>Well, I'm not the greatest expert, but a quick look doesn't show me
>>>anything
>>>obvious. But I have to ask, wouldn't WhiteSpaceAnalyzer work for you?
>>>Although I don't remember whether WhiteSpaceAnalyzer lowercases or not.
>>>
>>>It sure looks like you're getting reasonable results given how you're
>>>tokenizing.
>>>
>>>If not that, you might want to think about PatternAnalyzer. It's in the
>>>memory contribution section, see import
>>>org.apache.lucene.index.memory.PatternAnalyzer. One note of caution, the
>>>regex identifies what is NOT a token, rather than what is. This threw
>>>me for
>>>a bit.
>>>
>>>I still claim that you could break the tokens up like "6", "/", "12", and
>>>make SpanNearQuery work with a span of 0 (or 1, I don't remember right
>>>now),
>>>but that may well be more trouble than it's worth, it's up to you of
>>>course.
>>>What you get out of this is, essentially, is a query that's only
>>>satisfied
>>>if the terms you specify are right next to each other. So you'd find both
>>>your documents in your example, since you would have tokenized "6", "/",
>>>"12" in, say positions 0, 1, 2 in doc1 and 4, 5, 6 in the second doc. But
>>>since they're tokens that are next to each other in each doc,
>>>searching with
>>>a SpanNearQuery for "6", "/", and "12" that are "right next to each
>>>other",
>>>which you specify with a slop of 0 as I remember you should get both.
>>>
>>>Alternatively, if you tokenize it this way, a PhraseQuery might work as
>>>well, Thus, searching for "6 / 12" (as a phrase query and note the
>>>spaces)
>>>might be just what you want. You'd have to tokenize the query, but that's
>>>relatively easy. This is probably much simpler than a SpanNearQuery
>>>now that
>>>I think about it.....
>>>
>>>Be aware that if you use the *TermEnums we've been talking about, you'll
>>>probably wind up wrapping them in a ConstantScoreQuery. And if you
>>>have no
>>>*other* terms, you won't get any relevancy out of your search. This
>>>may be
>>>important.....
>>>
>>>Anyway, that's as creative as I can be Sunday night <G>. Best of luck....
>>>
>>>Erick
>>>
>>>On 10/1/06, Rahil <qamar_rahil@yahoo.co.uk> wrote:
>>>
>>>      
>>>
>>>>Hi Erick
>>>>
>>>>Thanks for your response. There's a lot to chew on in your reply and Im
>>>>looking at the suggestions you've made.
>>>>
>>>>Yeah I have Luke installed and have queried my index but there isn't any
>>>>great explanation Im getting out of it.  A query for "6/12" is sent as
>>>>"TERM:6/12" which is quite straight-forward. I did an explanation of the
>>>>query in my code though and got some more information but that too
>>>>wasn't of much help either.
>>>>-- 
>>>>Explanation explain = searcher.explain(query,0);
>>>>
>>>>OUTPUT:
>>>>query: +TERM:6/12
>>>>explain.getDescription() : weight(TERM:6/12 in 0), product of:
>>>>Detail 0 : 0.99999994 = queryWeight(TERM:6/12), product of:
>>>>  2.0986123 = idf(docFreq=1)
>>>>  0.47650534 = queryNorm
>>>>
>>>>Detail 1 : 0.0 = fieldWeight(TERM:6/12 in 0), product of:
>>>>  0.0 = tf(termFreq(TERM:6/12)=0)
>>>>  2.0986123 = idf(docFreq=1)
>>>>  0.5 = fieldNorm(field=TERM, doc=0)
>>>>
>>>>Number of results returned: 1
>>>>SampleLucene.displayIndexResults
>>>>SCORE    DESCRIPTIONSTATUS    CONCEPTID    TERM
>>>>1.0    0    260278007    6/12 (finding)
>>>>-- 
>>>>
>>>>My tokeniser called BaseAnalyzer extends Analyzer. Since I wanted to
>>>>retain all non whitespace characters and not just letters and digits, I
>>>>introduced the following block of code in the overridden tokenStream( )
>>>>
>>>>-- 
>>>>public TokenStream tokenStream(String fieldName, Reader reader) {
>>>>
>>>>        return new CharTokenizer(reader) {
>>>>
>>>>            protected char normalize(char c) {
>>>>                     return Character.toLowerCase(c);
>>>>            }
>>>>                protected boolean isTokenChar(char c) {
>>>>                       boolean type = false;
>>>>                       boolean space =   Character.isWhitespace(c);
>>>>                       boolean letDig =  Character.isLetterOrDigit(c);
>>>>
>>>>                        if(letDig && !space) //letter or digit but
not
>>>>whitespace
>>>>                            type = true;
>>>>                        else if(!letDig && !space)   //not letter,digit
>>>>or whitespace (retain non-whitespace characters)
>>>>                            type = true;
>>>>                        else if( !letDig && space)              //is
not
>>>>a letter or digit but is a whitespace
>>>>                            type = false;
>>>>                return type;
>>>>            }
>>>>        };
>>>>      }
>>>>
>>>>---
>>>>The problem is that when the term "6/12 (finding)" is tokenised, two
>>>>tokens are generated viz. '6/12' and '(finding)'. Therefore when I
>>>>search for '6/12' this term is returned as in a way it is an EXACT token
>>>>match.
>>>>
>>>>However when the term "R-eye=6/12 (finding)" is tokenised it again
>>>>results in two tokens viz. 'R-eye=6/12' and '(finding)'. So now if I
>>>>look for '6/12' its no more an exact match since there is no token with
>>>>this EXACT value. A simple searcher.search(query) isnt useful to pull
>>>>out the partial token match.
>>>>
>>>>I think it wont be useful to create separate tokens for "6", "/", "12"
>>>>or "R","-","eye","=", and so on. Im having a look at the RegexTermEnum
>>>>and WildcardTermEnum as they might possibily help.
>>>>
>>>>Would appreciate your comments on the BaseAnalyzer tokenizer and query
>>>>explanation Ive received so far.
>>>>
>>>>Thanks
>>>>Rahil
>>>>
>>>>Erick Erickson wrote:
>>>>
>>>>        
>>>>
>>>>>Most often, from what I've seen on this e-mail list, unexpected
>>>>>results are
>>>>>because you're not indexing on the tokens you *think* you're indexing.
>>>>>          
>>>>>
>>>>Or
>>>>        
>>>>
>>>>>not searching on them. By that I mean that the analyzers you're using
>>>>>          
>>>>>
>>>>are
>>>>        
>>>>
>>>>>behaving in ways you don't expect.
>>>>>
>>>>>That said, I think you're getting exactly what you should. I suspect
>>>>>you're
>>>>>indexing tokens as follows
>>>>>doc1: "6/12"  and "(finding)"
>>>>>doc2: "R-eye=6/12" and "(finding)"
>>>>>
>>>>>So it makes perfect sense that searching in 6/12 returns doc1 and
>>>>>search on
>>>>>R-eye=6/12 returns doc 2
>>>>>
>>>>>So, first question: Have you actually used something like Luke (google
>>>>>luke
>>>>>lucene) to examine your index and see if what you've put in there is
>>>>>          
>>>>>
>>>>what
>>>>        
>>>>
>>>>>you expect? What analyzer is your custom analyzer built upon and is it
>>>>>doing
>>>>>anything you're unaware of (for instance, lower-casing the 'R' in your
>>>>>second example)?
>>>>>
>>>>>Here's what I'd do.
>>>>>1> get Luke and see what's actually in your index.
>>>>>2> use searcher.explain (?) to see the query you're actually emitting.
>>>>>3> if you make no headway, post the smallest code snippets you can
>>>>>          
>>>>>
>>>>that
>>>>        
>>>>
>>>>>illustrate the problem. Folks would need the indexing AND searching
>>>>>          
>>>>>
>>>>code.
>>>>        
>>>>
>>>>>As far as queryies like "contains" in java.... Well sure. Write a
>>>>>          
>>>>>
>>>>filter
>>>>        
>>>>
>>>>>that filters on regular expressions or wildcards (you'll need
>>>>>WildcardTermEnum and RegexTermEnum). Or index things differently (e.g.
>>>>>index
>>>>>"6/12" and "finding" on doc1 and "r". "eye" "6/12" and "finding" on
>>>>>doc 2.
>>>>>Now your searches for "6/12" will work. Or index "6" "/", "12" and
>>>>>"finding"
>>>>>on doc1, index similarly for doc2, and use a SpanNearQuery with an
>>>>>appropriate span value. Or....
>>>>>
>>>>>This is all gobbldeygook if you haven't gotten a copy of "Lucene In
>>>>>Action",
>>>>>which you should read in order to get the most out of Lucene. It's for
>>>>>the
>>>>>1.4 code base, but the 2.0 Lucene code base isn't that much different.
>>>>>More
>>>>>importantly, it ties lots of stuff together. Also, the junit tests
>>>>>that come
>>>>>along with the Lucene code can be invaluable to show you how to do
>>>>>something.
>>>>>
>>>>>Hope this helps
>>>>>Erick
>>>>>
>>>>>On 10/1/06, Rahil <qamar_rahil@yahoo.co.uk> wrote:
>>>>>
>>>>>          
>>>>>
>>>>>>Hi
>>>>>>
>>>>>>I have a custom-built Analyzer where I tokenize all non-whitespace
>>>>>>characters as well available in the field "TERM" (which is the only
>>>>>>field being tokenised).
>>>>>>
>>>>>>If I now query my index file for a term "6/12" for instance, I get
>>>>>>            
>>>>>>
>>>>back
>>>>        
>>>>
>>>>>>only ONE result
>>>>>>
>>>>>>SCORE    DESCRIPTIONSTATUS    CONCEPTID    TERM
>>>>>>1.0    0    260278007    6/12 (finding)
>>>>>>
>>>>>>instead of TWO. There is another token in the index file of the form
>>>>>>
>>>>>>2561280012    0    163939000    R-eye=6/12 (finding)    0    3   
en
>>>>>>
>>>>>>At first it wasn't quite obvious to me why this was happening. But
>>>>>>            
>>>>>>
>>>>after
>>>>        
>>>>
>>>>>>playing around a bit I realised that if I pass a query "R-eye=6/12"
>>>>>>instead, I will get this second result (but not the first one
>>>>>>            
>>>>>>
>>>>now). Is
>>>>        
>>>>
>>>>>>there a way to tweak the  Query query = parser.parse(searchString)
>>>>>>method so that I can get both the records if I query for "6/12".
>>>>>>Something like a 'contains' query in Java.
>>>>>>
>>>>>>Will appreciate all help. Thanks a lot
>>>>>>
>>>>>>Regards
>>>>>>Rahil
>>>>>>            
>>>>>>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>  
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message