lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rahil <qamar_ra...@yahoo.co.uk>
Subject Re: Performing a like query
Date Fri, 06 Oct 2006 14:05:41 GMT
Hi Erick

Im having trouble with writing a good regular expression for the 
PatternAnalyzer to deal with word and non-word characters.I couldnt 
figure out a valid regular expression to write a valid 
Pattern.compile(String regex) which can tokenise a string into "O/E - 
visual acuity R-eye=6/24" into "O","/","E", "-", "visual", "acuity", 
"R", "-", "eye", "=", "6", "/", "24". Ive given it quite a few shots but 
am now totally frustrated with it. I can either tokenise at \W+ or \w+ 
but not both.

Could you please help.

Thanks a lot. Much appreciate it.

Regards
Rahil

Erick Erickson wrote:

> Well, I'm not the greatest expert, but a quick look doesn't show me 
> anything
> obvious. But I have to ask, wouldn't WhiteSpaceAnalyzer work for you?
> Although I don't remember whether WhiteSpaceAnalyzer lowercases or not.
>
> It sure looks like you're getting reasonable results given how you're
> tokenizing.
>
> If not that, you might want to think about PatternAnalyzer. It's in the
> memory contribution section, see import
> org.apache.lucene.index.memory.PatternAnalyzer. One note of caution, the
> regex identifies what is NOT a token, rather than what is. This threw 
> me for
> a bit.
>
> I still claim that you could break the tokens up like "6", "/", "12", and
> make SpanNearQuery work with a span of 0 (or 1, I don't remember right 
> now),
> but that may well be more trouble than it's worth, it's up to you of 
> course.
> What you get out of this is, essentially, is a query that's only 
> satisfied
> if the terms you specify are right next to each other. So you'd find both
> your documents in your example, since you would have tokenized "6", "/",
> "12" in, say positions 0, 1, 2 in doc1 and 4, 5, 6 in the second doc. But
> since they're tokens that are next to each other in each doc, 
> searching with
> a SpanNearQuery for "6", "/", and "12" that are "right next to each 
> other",
> which you specify with a slop of 0 as I remember you should get both.
>
> Alternatively, if you tokenize it this way, a PhraseQuery might work as
> well, Thus, searching for "6 / 12" (as a phrase query and note the 
> spaces)
> might be just what you want. You'd have to tokenize the query, but that's
> relatively easy. This is probably much simpler than a SpanNearQuery 
> now that
> I think about it.....
>
> Be aware that if you use the *TermEnums we've been talking about, you'll
> probably wind up wrapping them in a ConstantScoreQuery. And if you 
> have no
> *other* terms, you won't get any relevancy out of your search. This 
> may be
> important.....
>
> Anyway, that's as creative as I can be Sunday night <G>. Best of luck....
>
> Erick
>
> On 10/1/06, Rahil <qamar_rahil@yahoo.co.uk> wrote:
>
>>
>> Hi Erick
>>
>> Thanks for your response. There's a lot to chew on in your reply and Im
>> looking at the suggestions you've made.
>>
>> Yeah I have Luke installed and have queried my index but there isn't any
>> great explanation Im getting out of it.  A query for "6/12" is sent as
>> "TERM:6/12" which is quite straight-forward. I did an explanation of the
>> query in my code though and got some more information but that too
>> wasn't of much help either.
>> -- 
>> Explanation explain = searcher.explain(query,0);
>>
>> OUTPUT:
>> query: +TERM:6/12
>> explain.getDescription() : weight(TERM:6/12 in 0), product of:
>> Detail 0 : 0.99999994 = queryWeight(TERM:6/12), product of:
>>   2.0986123 = idf(docFreq=1)
>>   0.47650534 = queryNorm
>>
>> Detail 1 : 0.0 = fieldWeight(TERM:6/12 in 0), product of:
>>   0.0 = tf(termFreq(TERM:6/12)=0)
>>   2.0986123 = idf(docFreq=1)
>>   0.5 = fieldNorm(field=TERM, doc=0)
>>
>> Number of results returned: 1
>> SampleLucene.displayIndexResults
>> SCORE    DESCRIPTIONSTATUS    CONCEPTID    TERM
>> 1.0    0    260278007    6/12 (finding)
>> -- 
>>
>> My tokeniser called BaseAnalyzer extends Analyzer. Since I wanted to
>> retain all non whitespace characters and not just letters and digits, I
>> introduced the following block of code in the overridden tokenStream( )
>>
>> -- 
>> public TokenStream tokenStream(String fieldName, Reader reader) {
>>
>>         return new CharTokenizer(reader) {
>>
>>             protected char normalize(char c) {
>>                      return Character.toLowerCase(c);
>>             }
>>                 protected boolean isTokenChar(char c) {
>>                        boolean type = false;
>>                        boolean space =   Character.isWhitespace(c);
>>                        boolean letDig =  Character.isLetterOrDigit(c);
>>
>>                         if(letDig && !space) //letter or digit but not
>> whitespace
>>                             type = true;
>>                         else if(!letDig && !space)   //not letter,digit
>> or whitespace (retain non-whitespace characters)
>>                             type = true;
>>                         else if( !letDig && space)              //is not
>> a letter or digit but is a whitespace
>>                             type = false;
>>                 return type;
>>             }
>>         };
>>       }
>>
>> ---
>> The problem is that when the term "6/12 (finding)" is tokenised, two
>> tokens are generated viz. '6/12' and '(finding)'. Therefore when I
>> search for '6/12' this term is returned as in a way it is an EXACT token
>> match.
>>
>> However when the term "R-eye=6/12 (finding)" is tokenised it again
>> results in two tokens viz. 'R-eye=6/12' and '(finding)'. So now if I
>> look for '6/12' its no more an exact match since there is no token with
>> this EXACT value. A simple searcher.search(query) isnt useful to pull
>> out the partial token match.
>>
>> I think it wont be useful to create separate tokens for "6", "/", "12"
>> or "R","-","eye","=", and so on. Im having a look at the RegexTermEnum
>> and WildcardTermEnum as they might possibily help.
>>
>> Would appreciate your comments on the BaseAnalyzer tokenizer and query
>> explanation Ive received so far.
>>
>> Thanks
>> Rahil
>>
>> Erick Erickson wrote:
>>
>> > Most often, from what I've seen on this e-mail list, unexpected
>> > results are
>> > because you're not indexing on the tokens you *think* you're indexing.
>> Or
>> > not searching on them. By that I mean that the analyzers you're using
>> are
>> > behaving in ways you don't expect.
>> >
>> > That said, I think you're getting exactly what you should. I suspect
>> > you're
>> > indexing tokens as follows
>> > doc1: "6/12"  and "(finding)"
>> > doc2: "R-eye=6/12" and "(finding)"
>> >
>> > So it makes perfect sense that searching in 6/12 returns doc1 and
>> > search on
>> > R-eye=6/12 returns doc 2
>> >
>> > So, first question: Have you actually used something like Luke (google
>> > luke
>> > lucene) to examine your index and see if what you've put in there is
>> what
>> > you expect? What analyzer is your custom analyzer built upon and is it
>> > doing
>> > anything you're unaware of (for instance, lower-casing the 'R' in your
>> > second example)?
>> >
>> > Here's what I'd do.
>> > 1> get Luke and see what's actually in your index.
>> > 2> use searcher.explain (?) to see the query you're actually emitting.
>> > 3> if you make no headway, post the smallest code snippets you can 
>> that
>> > illustrate the problem. Folks would need the indexing AND searching
>> code.
>> >
>> > As far as queryies like "contains" in java.... Well sure. Write a 
>> filter
>> > that filters on regular expressions or wildcards (you'll need
>> > WildcardTermEnum and RegexTermEnum). Or index things differently (e.g.
>> > index
>> > "6/12" and "finding" on doc1 and "r". "eye" "6/12" and "finding" on
>> > doc 2.
>> > Now your searches for "6/12" will work. Or index "6" "/", "12" and
>> > "finding"
>> > on doc1, index similarly for doc2, and use a SpanNearQuery with an
>> > appropriate span value. Or....
>> >
>> > This is all gobbldeygook if you haven't gotten a copy of "Lucene In
>> > Action",
>> > which you should read in order to get the most out of Lucene. It's for
>> > the
>> > 1.4 code base, but the 2.0 Lucene code base isn't that much different.
>> > More
>> > importantly, it ties lots of stuff together. Also, the junit tests
>> > that come
>> > along with the Lucene code can be invaluable to show you how to do
>> > something.
>> >
>> > Hope this helps
>> > Erick
>> >
>> > On 10/1/06, Rahil <qamar_rahil@yahoo.co.uk> wrote:
>> >
>> >>
>> >> Hi
>> >>
>> >> I have a custom-built Analyzer where I tokenize all non-whitespace
>> >> characters as well available in the field "TERM" (which is the only
>> >> field being tokenised).
>> >>
>> >> If I now query my index file for a term "6/12" for instance, I get 
>> back
>> >> only ONE result
>> >>
>> >> SCORE    DESCRIPTIONSTATUS    CONCEPTID    TERM
>> >> 1.0    0    260278007    6/12 (finding)
>> >>
>> >> instead of TWO. There is another token in the index file of the form
>> >>
>> >> 2561280012    0    163939000    R-eye=6/12 (finding)    0    3    en
>> >>
>> >> At first it wasn't quite obvious to me why this was happening. But
>> after
>> >> playing around a bit I realised that if I pass a query "R-eye=6/12"
>> >> instead, I will get this second result (but not the first one 
>> now). Is
>> >> there a way to tweak the  Query query = parser.parse(searchString)
>> >> method so that I can get both the records if I query for "6/12".
>> >> Something like a 'contains' query in Java.
>> >>
>> >> Will appreciate all help. Thanks a lot
>> >>
>> >> Regards
>> >> Rahil
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>
>> >>
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message