lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven Rowe <sar...@syr.edu>
Subject Re: Performing a like query
Date Fri, 06 Oct 2006 16:30:17 GMT
Hi Rahil,

Rahil wrote:
> I couldnt figure out a valid regular expression to write a valid 
> Pattern.compile(String regex) which can tokenise a string into "O/E -
> visual acuity R-eye=6/24" into "O","/","E", "-", "visual", "acuity",
> "R", "-", "eye", "=", "6", "/", "24".

The following regular expression should match boundaries between word
and non-word, or between space and non-space, in either order, and
includes contiguous whitespace:

   \s*(?:\b|(?<=\S)(?=\s)|(?<=\s)(?=\S))\s*

Note that with the above regex, the "(%$#!)" in "some (%$#!) text" will
be tokenized as a single token.

Hope it helps,
Steve

> Erick Erickson wrote:
> 
>> Well, I'm not the greatest expert, but a quick look doesn't show me
>> anything
>> obvious. But I have to ask, wouldn't WhiteSpaceAnalyzer work for you?
>> Although I don't remember whether WhiteSpaceAnalyzer lowercases or not.
>>
>> It sure looks like you're getting reasonable results given how you're
>> tokenizing.
>>
>> If not that, you might want to think about PatternAnalyzer. It's in the
>> memory contribution section, see import
>> org.apache.lucene.index.memory.PatternAnalyzer. One note of caution, the
>> regex identifies what is NOT a token, rather than what is. This threw
>> me for
>> a bit.
>>
>> I still claim that you could break the tokens up like "6", "/", "12", and
>> make SpanNearQuery work with a span of 0 (or 1, I don't remember right
>> now),
>> but that may well be more trouble than it's worth, it's up to you of
>> course.
>> What you get out of this is, essentially, is a query that's only
>> satisfied
>> if the terms you specify are right next to each other. So you'd find both
>> your documents in your example, since you would have tokenized "6", "/",
>> "12" in, say positions 0, 1, 2 in doc1 and 4, 5, 6 in the second doc. But
>> since they're tokens that are next to each other in each doc,
>> searching with
>> a SpanNearQuery for "6", "/", and "12" that are "right next to each
>> other",
>> which you specify with a slop of 0 as I remember you should get both.
>>
>> Alternatively, if you tokenize it this way, a PhraseQuery might work as
>> well, Thus, searching for "6 / 12" (as a phrase query and note the
>> spaces)
>> might be just what you want. You'd have to tokenize the query, but that's
>> relatively easy. This is probably much simpler than a SpanNearQuery
>> now that
>> I think about it.....
>>
>> Be aware that if you use the *TermEnums we've been talking about, you'll
>> probably wind up wrapping them in a ConstantScoreQuery. And if you
>> have no
>> *other* terms, you won't get any relevancy out of your search. This
>> may be
>> important.....
>>
>> Anyway, that's as creative as I can be Sunday night <G>. Best of luck....
>>
>> Erick
>>
>> On 10/1/06, Rahil <qamar_rahil@yahoo.co.uk> wrote:
>>
>>>
>>> Hi Erick
>>>
>>> Thanks for your response. There's a lot to chew on in your reply and Im
>>> looking at the suggestions you've made.
>>>
>>> Yeah I have Luke installed and have queried my index but there isn't any
>>> great explanation Im getting out of it.  A query for "6/12" is sent as
>>> "TERM:6/12" which is quite straight-forward. I did an explanation of the
>>> query in my code though and got some more information but that too
>>> wasn't of much help either.
>>> -- 
>>> Explanation explain = searcher.explain(query,0);
>>>
>>> OUTPUT:
>>> query: +TERM:6/12
>>> explain.getDescription() : weight(TERM:6/12 in 0), product of:
>>> Detail 0 : 0.99999994 = queryWeight(TERM:6/12), product of:
>>>   2.0986123 = idf(docFreq=1)
>>>   0.47650534 = queryNorm
>>>
>>> Detail 1 : 0.0 = fieldWeight(TERM:6/12 in 0), product of:
>>>   0.0 = tf(termFreq(TERM:6/12)=0)
>>>   2.0986123 = idf(docFreq=1)
>>>   0.5 = fieldNorm(field=TERM, doc=0)
>>>
>>> Number of results returned: 1
>>> SampleLucene.displayIndexResults
>>> SCORE    DESCRIPTIONSTATUS    CONCEPTID    TERM
>>> 1.0    0    260278007    6/12 (finding)
>>> -- 
>>>
>>> My tokeniser called BaseAnalyzer extends Analyzer. Since I wanted to
>>> retain all non whitespace characters and not just letters and digits, I
>>> introduced the following block of code in the overridden tokenStream( )
>>>
>>> -- 
>>> public TokenStream tokenStream(String fieldName, Reader reader) {
>>>
>>>         return new CharTokenizer(reader) {
>>>
>>>             protected char normalize(char c) {
>>>                      return Character.toLowerCase(c);
>>>             }
>>>                 protected boolean isTokenChar(char c) {
>>>                        boolean type = false;
>>>                        boolean space =   Character.isWhitespace(c);
>>>                        boolean letDig =  Character.isLetterOrDigit(c);
>>>
>>>                         if(letDig && !space) //letter or digit but not
>>> whitespace
>>>                             type = true;
>>>                         else if(!letDig && !space)   //not letter,digit
>>> or whitespace (retain non-whitespace characters)
>>>                             type = true;
>>>                         else if( !letDig && space)              //is
not
>>> a letter or digit but is a whitespace
>>>                             type = false;
>>>                 return type;
>>>             }
>>>         };
>>>       }
>>>
>>> ---
>>> The problem is that when the term "6/12 (finding)" is tokenised, two
>>> tokens are generated viz. '6/12' and '(finding)'. Therefore when I
>>> search for '6/12' this term is returned as in a way it is an EXACT token
>>> match.
>>>
>>> However when the term "R-eye=6/12 (finding)" is tokenised it again
>>> results in two tokens viz. 'R-eye=6/12' and '(finding)'. So now if I
>>> look for '6/12' its no more an exact match since there is no token with
>>> this EXACT value. A simple searcher.search(query) isnt useful to pull
>>> out the partial token match.
>>>
>>> I think it wont be useful to create separate tokens for "6", "/", "12"
>>> or "R","-","eye","=", and so on. Im having a look at the RegexTermEnum
>>> and WildcardTermEnum as they might possibily help.
>>>
>>> Would appreciate your comments on the BaseAnalyzer tokenizer and query
>>> explanation Ive received so far.
>>>
>>> Thanks
>>> Rahil
>>>
>>> Erick Erickson wrote:
>>>
>>> > Most often, from what I've seen on this e-mail list, unexpected
>>> > results are
>>> > because you're not indexing on the tokens you *think* you're indexing.
>>> Or
>>> > not searching on them. By that I mean that the analyzers you're using
>>> are
>>> > behaving in ways you don't expect.
>>> >
>>> > That said, I think you're getting exactly what you should. I suspect
>>> > you're
>>> > indexing tokens as follows
>>> > doc1: "6/12"  and "(finding)"
>>> > doc2: "R-eye=6/12" and "(finding)"
>>> >
>>> > So it makes perfect sense that searching in 6/12 returns doc1 and
>>> > search on
>>> > R-eye=6/12 returns doc 2
>>> >
>>> > So, first question: Have you actually used something like Luke (google
>>> > luke
>>> > lucene) to examine your index and see if what you've put in there is
>>> what
>>> > you expect? What analyzer is your custom analyzer built upon and is it
>>> > doing
>>> > anything you're unaware of (for instance, lower-casing the 'R' in your
>>> > second example)?
>>> >
>>> > Here's what I'd do.
>>> > 1> get Luke and see what's actually in your index.
>>> > 2> use searcher.explain (?) to see the query you're actually emitting.
>>> > 3> if you make no headway, post the smallest code snippets you can
>>> that
>>> > illustrate the problem. Folks would need the indexing AND searching
>>> code.
>>> >
>>> > As far as queryies like "contains" in java.... Well sure. Write a
>>> filter
>>> > that filters on regular expressions or wildcards (you'll need
>>> > WildcardTermEnum and RegexTermEnum). Or index things differently (e.g.
>>> > index
>>> > "6/12" and "finding" on doc1 and "r". "eye" "6/12" and "finding" on
>>> > doc 2.
>>> > Now your searches for "6/12" will work. Or index "6" "/", "12" and
>>> > "finding"
>>> > on doc1, index similarly for doc2, and use a SpanNearQuery with an
>>> > appropriate span value. Or....
>>> >
>>> > This is all gobbldeygook if you haven't gotten a copy of "Lucene In
>>> > Action",
>>> > which you should read in order to get the most out of Lucene. It's for
>>> > the
>>> > 1.4 code base, but the 2.0 Lucene code base isn't that much different.
>>> > More
>>> > importantly, it ties lots of stuff together. Also, the junit tests
>>> > that come
>>> > along with the Lucene code can be invaluable to show you how to do
>>> > something.
>>> >
>>> > Hope this helps
>>> > Erick
>>> >
>>> > On 10/1/06, Rahil <qamar_rahil@yahoo.co.uk> wrote:
>>> >
>>> >>
>>> >> Hi
>>> >>
>>> >> I have a custom-built Analyzer where I tokenize all non-whitespace
>>> >> characters as well available in the field "TERM" (which is the only
>>> >> field being tokenised).
>>> >>
>>> >> If I now query my index file for a term "6/12" for instance, I get
>>> back
>>> >> only ONE result
>>> >>
>>> >> SCORE    DESCRIPTIONSTATUS    CONCEPTID    TERM
>>> >> 1.0    0    260278007    6/12 (finding)
>>> >>
>>> >> instead of TWO. There is another token in the index file of the form
>>> >>
>>> >> 2561280012    0    163939000    R-eye=6/12 (finding)    0    3    en
>>> >>
>>> >> At first it wasn't quite obvious to me why this was happening. But
>>> after
>>> >> playing around a bit I realised that if I pass a query "R-eye=6/12"
>>> >> instead, I will get this second result (but not the first one
>>> now). Is
>>> >> there a way to tweak the  Query query = parser.parse(searchString)
>>> >> method so that I can get both the records if I query for "6/12".
>>> >> Something like a 'contains' query in Java.
>>> >>
>>> >> Will appreciate all help. Thanks a lot
>>> >>
>>> >> Regards
>>> >> Rahil

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message