lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: Performing a like query
Date Fri, 06 Oct 2006 16:24:59 GMT
My intuition is that you'll have a real problem using regular expressions.
It'll either be incredibly ugly (and unmaintainable) or just won't work
since the regular expression tools tend to throw out the delimiters.

I think you'll be much better off writing your own analyzer (see LIA, the
synonym injector code for a model, although you probably won't want the gap
to be 0). The analyzer will just march down the string, and return the first
part of the line up to the delimiter. And return the delimiter if that's the
first thing in the line. Logically, it looks kinda like this:

1st call: returns R, preserves the rest (/E - visual acuity R-eye=6/24)
2nd call. recognizes that the line starts with a delimiter and returns that,
preserving (E - visual acuity R-eye=6/24)
& etc.


The only real question is whether you really want to preserve the delimiters
or whether that's unnecessary. Remember that if you use the same analyzer in
both indexing and querying (with a few subtlties), you get the same token
stream. Whether that works for you depends upon whether the delimiters hold
information, and their presence/absence alters the meaning of the field.

Not a huge help, but all I can come up with.

Good luck
Erick

On 10/6/06, Rahil <qamar_rahil@yahoo.co.uk> wrote:
>
> Hi Erick
>
> Im having trouble with writing a good regular expression for the
> PatternAnalyzer to deal with word and non-word characters.I couldnt
> figure out a valid regular expression to write a valid
> Pattern.compile(String regex) which can tokenise a string into "O/E -
> visual acuity R-eye=6/24" into "O","/","E", "-", "visual", "acuity",
> "R", "-", "eye", "=", "6", "/", "24". Ive given it quite a few shots but
> am now totally frustrated with it. I can either tokenise at \W+ or \w+
> but not both.
>
> Could you please help.
>
> Thanks a lot. Much appreciate it.
>
> Regards
> Rahil
>
> Erick Erickson wrote:
>
> > Well, I'm not the greatest expert, but a quick look doesn't show me
> > anything
> > obvious. But I have to ask, wouldn't WhiteSpaceAnalyzer work for you?
> > Although I don't remember whether WhiteSpaceAnalyzer lowercases or not.
> >
> > It sure looks like you're getting reasonable results given how you're
> > tokenizing.
> >
> > If not that, you might want to think about PatternAnalyzer. It's in the
> > memory contribution section, see import
> > org.apache.lucene.index.memory.PatternAnalyzer. One note of caution, the
> > regex identifies what is NOT a token, rather than what is. This threw
> > me for
> > a bit.
> >
> > I still claim that you could break the tokens up like "6", "/", "12",
> and
> > make SpanNearQuery work with a span of 0 (or 1, I don't remember right
> > now),
> > but that may well be more trouble than it's worth, it's up to you of
> > course.
> > What you get out of this is, essentially, is a query that's only
> > satisfied
> > if the terms you specify are right next to each other. So you'd find
> both
> > your documents in your example, since you would have tokenized "6", "/",
> > "12" in, say positions 0, 1, 2 in doc1 and 4, 5, 6 in the second doc.
> But
> > since they're tokens that are next to each other in each doc,
> > searching with
> > a SpanNearQuery for "6", "/", and "12" that are "right next to each
> > other",
> > which you specify with a slop of 0 as I remember you should get both.
> >
> > Alternatively, if you tokenize it this way, a PhraseQuery might work as
> > well, Thus, searching for "6 / 12" (as a phrase query and note the
> > spaces)
> > might be just what you want. You'd have to tokenize the query, but
> that's
> > relatively easy. This is probably much simpler than a SpanNearQuery
> > now that
> > I think about it.....
> >
> > Be aware that if you use the *TermEnums we've been talking about, you'll
> > probably wind up wrapping them in a ConstantScoreQuery. And if you
> > have no
> > *other* terms, you won't get any relevancy out of your search. This
> > may be
> > important.....
> >
> > Anyway, that's as creative as I can be Sunday night <G>. Best of
> luck....
> >
> > Erick
> >
> > On 10/1/06, Rahil <qamar_rahil@yahoo.co.uk> wrote:
> >
> >>
> >> Hi Erick
> >>
> >> Thanks for your response. There's a lot to chew on in your reply and Im
> >> looking at the suggestions you've made.
> >>
> >> Yeah I have Luke installed and have queried my index but there isn't
> any
> >> great explanation Im getting out of it.  A query for "6/12" is sent as
> >> "TERM:6/12" which is quite straight-forward. I did an explanation of
> the
> >> query in my code though and got some more information but that too
> >> wasn't of much help either.
> >> --
> >> Explanation explain = searcher.explain(query,0);
> >>
> >> OUTPUT:
> >> query: +TERM:6/12
> >> explain.getDescription() : weight(TERM:6/12 in 0), product of:
> >> Detail 0 : 0.99999994 = queryWeight(TERM:6/12), product of:
> >>   2.0986123 = idf(docFreq=1)
> >>   0.47650534 = queryNorm
> >>
> >> Detail 1 : 0.0 = fieldWeight(TERM:6/12 in 0), product of:
> >>   0.0 = tf(termFreq(TERM:6/12)=0)
> >>   2.0986123 = idf(docFreq=1)
> >>   0.5 = fieldNorm(field=TERM, doc=0)
> >>
> >> Number of results returned: 1
> >> SampleLucene.displayIndexResults
> >> SCORE    DESCRIPTIONSTATUS    CONCEPTID    TERM
> >> 1.0    0    260278007    6/12 (finding)
> >> --
> >>
> >> My tokeniser called BaseAnalyzer extends Analyzer. Since I wanted to
> >> retain all non whitespace characters and not just letters and digits, I
> >> introduced the following block of code in the overridden tokenStream( )
> >>
> >> --
> >> public TokenStream tokenStream(String fieldName, Reader reader) {
> >>
> >>         return new CharTokenizer(reader) {
> >>
> >>             protected char normalize(char c) {
> >>                      return Character.toLowerCase(c);
> >>             }
> >>                 protected boolean isTokenChar(char c) {
> >>                        boolean type = false;
> >>                        boolean space =   Character.isWhitespace(c);
> >>                        boolean letDig =  Character.isLetterOrDigit(c);
> >>
> >>                         if(letDig && !space) //letter or digit but not
> >> whitespace
> >>                             type = true;
> >>                         else if(!letDig && !space)   //not letter,digit
> >> or whitespace (retain non-whitespace characters)
> >>                             type = true;
> >>                         else if( !letDig && space)              //is
> not
> >> a letter or digit but is a whitespace
> >>                             type = false;
> >>                 return type;
> >>             }
> >>         };
> >>       }
> >>
> >> ---
> >> The problem is that when the term "6/12 (finding)" is tokenised, two
> >> tokens are generated viz. '6/12' and '(finding)'. Therefore when I
> >> search for '6/12' this term is returned as in a way it is an EXACT
> token
> >> match.
> >>
> >> However when the term "R-eye=6/12 (finding)" is tokenised it again
> >> results in two tokens viz. 'R-eye=6/12' and '(finding)'. So now if I
> >> look for '6/12' its no more an exact match since there is no token with
> >> this EXACT value. A simple searcher.search(query) isnt useful to pull
> >> out the partial token match.
> >>
> >> I think it wont be useful to create separate tokens for "6", "/", "12"
> >> or "R","-","eye","=", and so on. Im having a look at the RegexTermEnum
> >> and WildcardTermEnum as they might possibily help.
> >>
> >> Would appreciate your comments on the BaseAnalyzer tokenizer and query
> >> explanation Ive received so far.
> >>
> >> Thanks
> >> Rahil
> >>
> >> Erick Erickson wrote:
> >>
> >> > Most often, from what I've seen on this e-mail list, unexpected
> >> > results are
> >> > because you're not indexing on the tokens you *think* you're
> indexing.
> >> Or
> >> > not searching on them. By that I mean that the analyzers you're using
> >> are
> >> > behaving in ways you don't expect.
> >> >
> >> > That said, I think you're getting exactly what you should. I suspect
> >> > you're
> >> > indexing tokens as follows
> >> > doc1: "6/12"  and "(finding)"
> >> > doc2: "R-eye=6/12" and "(finding)"
> >> >
> >> > So it makes perfect sense that searching in 6/12 returns doc1 and
> >> > search on
> >> > R-eye=6/12 returns doc 2
> >> >
> >> > So, first question: Have you actually used something like Luke
> (google
> >> > luke
> >> > lucene) to examine your index and see if what you've put in there is
> >> what
> >> > you expect? What analyzer is your custom analyzer built upon and is
> it
> >> > doing
> >> > anything you're unaware of (for instance, lower-casing the 'R' in
> your
> >> > second example)?
> >> >
> >> > Here's what I'd do.
> >> > 1> get Luke and see what's actually in your index.
> >> > 2> use searcher.explain (?) to see the query you're actually
> emitting.
> >> > 3> if you make no headway, post the smallest code snippets you can
> >> that
> >> > illustrate the problem. Folks would need the indexing AND searching
> >> code.
> >> >
> >> > As far as queryies like "contains" in java.... Well sure. Write a
> >> filter
> >> > that filters on regular expressions or wildcards (you'll need
> >> > WildcardTermEnum and RegexTermEnum). Or index things differently (e.g
> .
> >> > index
> >> > "6/12" and "finding" on doc1 and "r". "eye" "6/12" and "finding" on
> >> > doc 2.
> >> > Now your searches for "6/12" will work. Or index "6" "/", "12" and
> >> > "finding"
> >> > on doc1, index similarly for doc2, and use a SpanNearQuery with an
> >> > appropriate span value. Or....
> >> >
> >> > This is all gobbldeygook if you haven't gotten a copy of "Lucene In
> >> > Action",
> >> > which you should read in order to get the most out of Lucene. It's
> for
> >> > the
> >> > 1.4 code base, but the 2.0 Lucene code base isn't that much
> different.
> >> > More
> >> > importantly, it ties lots of stuff together. Also, the junit tests
> >> > that come
> >> > along with the Lucene code can be invaluable to show you how to do
> >> > something.
> >> >
> >> > Hope this helps
> >> > Erick
> >> >
> >> > On 10/1/06, Rahil <qamar_rahil@yahoo.co.uk> wrote:
> >> >
> >> >>
> >> >> Hi
> >> >>
> >> >> I have a custom-built Analyzer where I tokenize all non-whitespace
> >> >> characters as well available in the field "TERM" (which is the only
> >> >> field being tokenised).
> >> >>
> >> >> If I now query my index file for a term "6/12" for instance, I get
> >> back
> >> >> only ONE result
> >> >>
> >> >> SCORE    DESCRIPTIONSTATUS    CONCEPTID    TERM
> >> >> 1.0    0    260278007    6/12 (finding)
> >> >>
> >> >> instead of TWO. There is another token in the index file of the form
> >> >>
> >> >> 2561280012    0    163939000    R-eye=6/12 (finding)    0    3    en
> >> >>
> >> >> At first it wasn't quite obvious to me why this was happening. But
> >> after
> >> >> playing around a bit I realised that if I pass a query "R-eye=6/12"
> >> >> instead, I will get this second result (but not the first one
> >> now). Is
> >> >> there a way to tweak the  Query query = parser.parse(searchString)
> >> >> method so that I can get both the records if I query for "6/12".
> >> >> Something like a 'contains' query in Java.
> >> >>
> >> >> Will appreciate all help. Thanks a lot
> >> >>
> >> >> Regards
> >> >> Rahil
> >> >>
> >> >>
> >> >>
> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >> >>
> >> >>
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message