lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rahil <qamar_ra...@yahoo.co.uk>
Subject Re: Performing a like query
Date Tue, 10 Oct 2006 11:06:03 GMT
Hi Erick

I think you've made some really important observations. Steven has 
provided a good regular expression to help with word and non-words. For 
the moment I have reverted to my analyser and am going to try doing some 
clever pattern matching later. Also, Ill try using a different analyser 
for indexing and querying and see whether that makes a difference.

Im determined to sort this out but time constraints mean I cant give it 
too much time! :)

Thanks again for the insight
Rahil

Erick Erickson wrote:

> My intuition is that you'll have a real problem using regular 
> expressions.
> It'll either be incredibly ugly (and unmaintainable) or just won't work
> since the regular expression tools tend to throw out the delimiters.
>
> I think you'll be much better off writing your own analyzer (see LIA, the
> synonym injector code for a model, although you probably won't want 
> the gap
> to be 0). The analyzer will just march down the string, and return the 
> first
> part of the line up to the delimiter. And return the delimiter if 
> that's the
> first thing in the line. Logically, it looks kinda like this:
>
> 1st call: returns R, preserves the rest (/E - visual acuity R-eye=6/24)
> 2nd call. recognizes that the line starts with a delimiter and returns 
> that,
> preserving (E - visual acuity R-eye=6/24)
> & etc.
>
>
> The only real question is whether you really want to preserve the 
> delimiters
> or whether that's unnecessary. Remember that if you use the same 
> analyzer in
> both indexing and querying (with a few subtlties), you get the same token
> stream. Whether that works for you depends upon whether the delimiters 
> hold
> information, and their presence/absence alters the meaning of the field.
>
> Not a huge help, but all I can come up with.
>
> Good luck
> Erick
>
> On 10/6/06, Rahil <qamar_rahil@yahoo.co.uk> wrote:
>
>>
>> Hi Erick
>>
>> Im having trouble with writing a good regular expression for the
>> PatternAnalyzer to deal with word and non-word characters.I couldnt
>> figure out a valid regular expression to write a valid
>> Pattern.compile(String regex) which can tokenise a string into "O/E -
>> visual acuity R-eye=6/24" into "O","/","E", "-", "visual", "acuity",
>> "R", "-", "eye", "=", "6", "/", "24". Ive given it quite a few shots but
>> am now totally frustrated with it. I can either tokenise at \W+ or \w+
>> but not both.
>>
>> Could you please help.
>>
>> Thanks a lot. Much appreciate it.
>>
>> Regards
>> Rahil
>>
>> Erick Erickson wrote:
>>
>> > Well, I'm not the greatest expert, but a quick look doesn't show me
>> > anything
>> > obvious. But I have to ask, wouldn't WhiteSpaceAnalyzer work for you?
>> > Although I don't remember whether WhiteSpaceAnalyzer lowercases or 
>> not.
>> >
>> > It sure looks like you're getting reasonable results given how you're
>> > tokenizing.
>> >
>> > If not that, you might want to think about PatternAnalyzer. It's in 
>> the
>> > memory contribution section, see import
>> > org.apache.lucene.index.memory.PatternAnalyzer. One note of 
>> caution, the
>> > regex identifies what is NOT a token, rather than what is. This threw
>> > me for
>> > a bit.
>> >
>> > I still claim that you could break the tokens up like "6", "/", "12",
>> and
>> > make SpanNearQuery work with a span of 0 (or 1, I don't remember right
>> > now),
>> > but that may well be more trouble than it's worth, it's up to you of
>> > course.
>> > What you get out of this is, essentially, is a query that's only
>> > satisfied
>> > if the terms you specify are right next to each other. So you'd find
>> both
>> > your documents in your example, since you would have tokenized "6", 
>> "/",
>> > "12" in, say positions 0, 1, 2 in doc1 and 4, 5, 6 in the second doc.
>> But
>> > since they're tokens that are next to each other in each doc,
>> > searching with
>> > a SpanNearQuery for "6", "/", and "12" that are "right next to each
>> > other",
>> > which you specify with a slop of 0 as I remember you should get both.
>> >
>> > Alternatively, if you tokenize it this way, a PhraseQuery might 
>> work as
>> > well, Thus, searching for "6 / 12" (as a phrase query and note the
>> > spaces)
>> > might be just what you want. You'd have to tokenize the query, but
>> that's
>> > relatively easy. This is probably much simpler than a SpanNearQuery
>> > now that
>> > I think about it.....
>> >
>> > Be aware that if you use the *TermEnums we've been talking about, 
>> you'll
>> > probably wind up wrapping them in a ConstantScoreQuery. And if you
>> > have no
>> > *other* terms, you won't get any relevancy out of your search. This
>> > may be
>> > important.....
>> >
>> > Anyway, that's as creative as I can be Sunday night <G>. Best of
>> luck....
>> >
>> > Erick
>> >
>> > On 10/1/06, Rahil <qamar_rahil@yahoo.co.uk> wrote:
>> >
>> >>
>> >> Hi Erick
>> >>
>> >> Thanks for your response. There's a lot to chew on in your reply 
>> and Im
>> >> looking at the suggestions you've made.
>> >>
>> >> Yeah I have Luke installed and have queried my index but there isn't
>> any
>> >> great explanation Im getting out of it.  A query for "6/12" is 
>> sent as
>> >> "TERM:6/12" which is quite straight-forward. I did an explanation of
>> the
>> >> query in my code though and got some more information but that too
>> >> wasn't of much help either.
>> >> --
>> >> Explanation explain = searcher.explain(query,0);
>> >>
>> >> OUTPUT:
>> >> query: +TERM:6/12
>> >> explain.getDescription() : weight(TERM:6/12 in 0), product of:
>> >> Detail 0 : 0.99999994 = queryWeight(TERM:6/12), product of:
>> >>   2.0986123 = idf(docFreq=1)
>> >>   0.47650534 = queryNorm
>> >>
>> >> Detail 1 : 0.0 = fieldWeight(TERM:6/12 in 0), product of:
>> >>   0.0 = tf(termFreq(TERM:6/12)=0)
>> >>   2.0986123 = idf(docFreq=1)
>> >>   0.5 = fieldNorm(field=TERM, doc=0)
>> >>
>> >> Number of results returned: 1
>> >> SampleLucene.displayIndexResults
>> >> SCORE    DESCRIPTIONSTATUS    CONCEPTID    TERM
>> >> 1.0    0    260278007    6/12 (finding)
>> >> --
>> >>
>> >> My tokeniser called BaseAnalyzer extends Analyzer. Since I wanted to
>> >> retain all non whitespace characters and not just letters and 
>> digits, I
>> >> introduced the following block of code in the overridden 
>> tokenStream( )
>> >>
>> >> --
>> >> public TokenStream tokenStream(String fieldName, Reader reader) {
>> >>
>> >>         return new CharTokenizer(reader) {
>> >>
>> >>             protected char normalize(char c) {
>> >>                      return Character.toLowerCase(c);
>> >>             }
>> >>                 protected boolean isTokenChar(char c) {
>> >>                        boolean type = false;
>> >>                        boolean space =   Character.isWhitespace(c);
>> >>                        boolean letDig =  
>> Character.isLetterOrDigit(c);
>> >>
>> >>                         if(letDig && !space) //letter or digit but

>> not
>> >> whitespace
>> >>                             type = true;
>> >>                         else if(!letDig && !space)   //not 
>> letter,digit
>> >> or whitespace (retain non-whitespace characters)
>> >>                             type = true;
>> >>                         else if( !letDig && space)             
//is
>> not
>> >> a letter or digit but is a whitespace
>> >>                             type = false;
>> >>                 return type;
>> >>             }
>> >>         };
>> >>       }
>> >>
>> >> ---
>> >> The problem is that when the term "6/12 (finding)" is tokenised, two
>> >> tokens are generated viz. '6/12' and '(finding)'. Therefore when I
>> >> search for '6/12' this term is returned as in a way it is an EXACT
>> token
>> >> match.
>> >>
>> >> However when the term "R-eye=6/12 (finding)" is tokenised it again
>> >> results in two tokens viz. 'R-eye=6/12' and '(finding)'. So now if I
>> >> look for '6/12' its no more an exact match since there is no token 
>> with
>> >> this EXACT value. A simple searcher.search(query) isnt useful to pull
>> >> out the partial token match.
>> >>
>> >> I think it wont be useful to create separate tokens for "6", "/", 
>> "12"
>> >> or "R","-","eye","=", and so on. Im having a look at the 
>> RegexTermEnum
>> >> and WildcardTermEnum as they might possibily help.
>> >>
>> >> Would appreciate your comments on the BaseAnalyzer tokenizer and 
>> query
>> >> explanation Ive received so far.
>> >>
>> >> Thanks
>> >> Rahil
>> >>
>> >> Erick Erickson wrote:
>> >>
>> >> > Most often, from what I've seen on this e-mail list, unexpected
>> >> > results are
>> >> > because you're not indexing on the tokens you *think* you're
>> indexing.
>> >> Or
>> >> > not searching on them. By that I mean that the analyzers you're 
>> using
>> >> are
>> >> > behaving in ways you don't expect.
>> >> >
>> >> > That said, I think you're getting exactly what you should. I 
>> suspect
>> >> > you're
>> >> > indexing tokens as follows
>> >> > doc1: "6/12"  and "(finding)"
>> >> > doc2: "R-eye=6/12" and "(finding)"
>> >> >
>> >> > So it makes perfect sense that searching in 6/12 returns doc1 and
>> >> > search on
>> >> > R-eye=6/12 returns doc 2
>> >> >
>> >> > So, first question: Have you actually used something like Luke
>> (google
>> >> > luke
>> >> > lucene) to examine your index and see if what you've put in 
>> there is
>> >> what
>> >> > you expect? What analyzer is your custom analyzer built upon and is
>> it
>> >> > doing
>> >> > anything you're unaware of (for instance, lower-casing the 'R' in
>> your
>> >> > second example)?
>> >> >
>> >> > Here's what I'd do.
>> >> > 1> get Luke and see what's actually in your index.
>> >> > 2> use searcher.explain (?) to see the query you're actually
>> emitting.
>> >> > 3> if you make no headway, post the smallest code snippets you can
>> >> that
>> >> > illustrate the problem. Folks would need the indexing AND searching
>> >> code.
>> >> >
>> >> > As far as queryies like "contains" in java.... Well sure. Write a
>> >> filter
>> >> > that filters on regular expressions or wildcards (you'll need
>> >> > WildcardTermEnum and RegexTermEnum). Or index things differently 
>> (e.g
>> .
>> >> > index
>> >> > "6/12" and "finding" on doc1 and "r". "eye" "6/12" and "finding" on
>> >> > doc 2.
>> >> > Now your searches for "6/12" will work. Or index "6" "/", "12" and
>> >> > "finding"
>> >> > on doc1, index similarly for doc2, and use a SpanNearQuery with an
>> >> > appropriate span value. Or....
>> >> >
>> >> > This is all gobbldeygook if you haven't gotten a copy of "Lucene In
>> >> > Action",
>> >> > which you should read in order to get the most out of Lucene. It's
>> for
>> >> > the
>> >> > 1.4 code base, but the 2.0 Lucene code base isn't that much
>> different.
>> >> > More
>> >> > importantly, it ties lots of stuff together. Also, the junit tests
>> >> > that come
>> >> > along with the Lucene code can be invaluable to show you how to do
>> >> > something.
>> >> >
>> >> > Hope this helps
>> >> > Erick
>> >> >
>> >> > On 10/1/06, Rahil <qamar_rahil@yahoo.co.uk> wrote:
>> >> >
>> >> >>
>> >> >> Hi
>> >> >>
>> >> >> I have a custom-built Analyzer where I tokenize all non-whitespace
>> >> >> characters as well available in the field "TERM" (which is the

>> only
>> >> >> field being tokenised).
>> >> >>
>> >> >> If I now query my index file for a term "6/12" for instance, I
get
>> >> back
>> >> >> only ONE result
>> >> >>
>> >> >> SCORE    DESCRIPTIONSTATUS    CONCEPTID    TERM
>> >> >> 1.0    0    260278007    6/12 (finding)
>> >> >>
>> >> >> instead of TWO. There is another token in the index file of the

>> form
>> >> >>
>> >> >> 2561280012    0    163939000    R-eye=6/12 (finding)    0    
>> 3    en
>> >> >>
>> >> >> At first it wasn't quite obvious to me why this was happening.
But
>> >> after
>> >> >> playing around a bit I realised that if I pass a query 
>> "R-eye=6/12"
>> >> >> instead, I will get this second result (but not the first one
>> >> now). Is
>> >> >> there a way to tweak the  Query query = parser.parse(searchString)
>> >> >> method so that I can get both the records if I query for "6/12".
>> >> >> Something like a 'contains' query in Java.
>> >> >>
>> >> >> Will appreciate all help. Thanks a lot
>> >> >>
>> >> >> Regards
>> >> >> Rahil
>> >> >>
>> >> >>
>> >> >>
>> ---------------------------------------------------------------------
>> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >> >>
>> >> >>
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>
>> >>
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message