lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rahil <>
Subject Re: Performing a like query
Date Sun, 01 Oct 2006 22:48:04 GMT
Hi Erick

Thanks for your response. There's a lot to chew on in your reply and Im 
looking at the suggestions you've made.

Yeah I have Luke installed and have queried my index but there isn't any 
great explanation Im getting out of it.  A query for "6/12" is sent as 
"TERM:6/12" which is quite straight-forward. I did an explanation of the 
query in my code though and got some more information but that too 
wasn't of much help either.
Explanation explain = searcher.explain(query,0);

query: +TERM:6/12
explain.getDescription() : weight(TERM:6/12 in 0), product of:
Detail 0 : 0.99999994 = queryWeight(TERM:6/12), product of:
  2.0986123 = idf(docFreq=1)
  0.47650534 = queryNorm

Detail 1 : 0.0 = fieldWeight(TERM:6/12 in 0), product of:
  0.0 = tf(termFreq(TERM:6/12)=0)
  2.0986123 = idf(docFreq=1)
  0.5 = fieldNorm(field=TERM, doc=0)

Number of results returned: 1
1.0    0    260278007    6/12 (finding)

My tokeniser called BaseAnalyzer extends Analyzer. Since I wanted to 
retain all non whitespace characters and not just letters and digits, I 
introduced the following block of code in the overridden tokenStream( )

public TokenStream tokenStream(String fieldName, Reader reader) {
        return new CharTokenizer(reader) {

            protected char normalize(char c) {
                     return Character.toLowerCase(c);
                protected boolean isTokenChar(char c) {
                       boolean type = false;
                       boolean space =   Character.isWhitespace(c);
                       boolean letDig =  Character.isLetterOrDigit(c);
                        if(letDig && !space) //letter or digit but not 
                            type = true;
                        else if(!letDig && !space)   //not letter,digit 
or whitespace (retain non-whitespace characters)
                            type = true;
                        else if( !letDig && space)              //is not 
a letter or digit but is a whitespace
                            type = false;
                return type;

The problem is that when the term "6/12 (finding)" is tokenised, two 
tokens are generated viz. '6/12' and '(finding)'. Therefore when I 
search for '6/12' this term is returned as in a way it is an EXACT token 

However when the term "R-eye=6/12 (finding)" is tokenised it again 
results in two tokens viz. 'R-eye=6/12' and '(finding)'. So now if I 
look for '6/12' its no more an exact match since there is no token with 
this EXACT value. A simple isnt useful to pull 
out the partial token match.

I think it wont be useful to create separate tokens for "6", "/", "12" 
or "R","-","eye","=", and so on. Im having a look at the RegexTermEnum 
and WildcardTermEnum as they might possibily help.

Would appreciate your comments on the BaseAnalyzer tokenizer and query 
explanation Ive received so far.


Erick Erickson wrote:

> Most often, from what I've seen on this e-mail list, unexpected 
> results are
> because you're not indexing on the tokens you *think* you're indexing. Or
> not searching on them. By that I mean that the analyzers you're using are
> behaving in ways you don't expect.
> That said, I think you're getting exactly what you should. I suspect 
> you're
> indexing tokens as follows
> doc1: "6/12"  and "(finding)"
> doc2: "R-eye=6/12" and "(finding)"
> So it makes perfect sense that searching in 6/12 returns doc1 and 
> search on
> R-eye=6/12 returns doc 2
> So, first question: Have you actually used something like Luke (google 
> luke
> lucene) to examine your index and see if what you've put in there is what
> you expect? What analyzer is your custom analyzer built upon and is it 
> doing
> anything you're unaware of (for instance, lower-casing the 'R' in your
> second example)?
> Here's what I'd do.
> 1> get Luke and see what's actually in your index.
> 2> use searcher.explain (?) to see the query you're actually emitting.
> 3> if you make no headway, post the smallest code snippets you can that
> illustrate the problem. Folks would need the indexing AND searching code.
> As far as queryies like "contains" in java.... Well sure. Write a filter
> that filters on regular expressions or wildcards (you'll need
> WildcardTermEnum and RegexTermEnum). Or index things differently (e.g. 
> index
> "6/12" and "finding" on doc1 and "r". "eye" "6/12" and "finding" on 
> doc 2.
> Now your searches for "6/12" will work. Or index "6" "/", "12" and 
> "finding"
> on doc1, index similarly for doc2, and use a SpanNearQuery with an
> appropriate span value. Or....
> This is all gobbldeygook if you haven't gotten a copy of "Lucene In 
> Action",
> which you should read in order to get the most out of Lucene. It's for 
> the
> 1.4 code base, but the 2.0 Lucene code base isn't that much different. 
> More
> importantly, it ties lots of stuff together. Also, the junit tests 
> that come
> along with the Lucene code can be invaluable to show you how to do
> something.
> Hope this helps
> Erick
> On 10/1/06, Rahil <> wrote:
>> Hi
>> I have a custom-built Analyzer where I tokenize all non-whitespace
>> characters as well available in the field "TERM" (which is the only
>> field being tokenised).
>> If I now query my index file for a term "6/12" for instance, I get back
>> only ONE result
>> 1.0    0    260278007    6/12 (finding)
>> instead of TWO. There is another token in the index file of the form
>> 2561280012    0    163939000    R-eye=6/12 (finding)    0    3    en
>> At first it wasn't quite obvious to me why this was happening. But after
>> playing around a bit I realised that if I pass a query "R-eye=6/12"
>> instead, I will get this second result (but not the first one now). Is
>> there a way to tweak the  Query query = parser.parse(searchString)
>> method so that I can get both the records if I query for "6/12".
>> Something like a 'contains' query in Java.
>> Will appreciate all help. Thanks a lot
>> Regards
>> Rahil
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message