lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: searching for string with a blank
Date Tue, 30 Sep 2008 15:03:12 GMT
Sure, you can use the same analyzer at query time, or an analyzer
that does what you want. See KeywordAnalyzer for instance
(although that one doesn't lower case, so...). Or just make
your own analyzer, see the Synonym example in Lucene In Action.

Or be really simple, and just (say) lowercase your input stream
before passing it to the analyzer, probably both at index
and query time.

Do note that I *think* (but am not entirely sure) that UN_TOKENIZED
doesn't change the case of your indexed tokens, so you probably
have to re-think about what you really want to happen in your
indexing process. Luke helps here too <G>.

<<<the problem is that I use some fields tokenized and other untokenized
but that all the queries are parsed using my analyzer. Is this correct?>>>

Yes, that's the nub of it. See PerFieldAnalyzerWrapper for a way to
use different analyzers on different fields (both at index and query
time). It's perfectly reasonable to use different analyzers on different
fields, and easy with PerFieldAnalyzerWrapper.


Best
Erick


On Tue, Sep 30, 2008 at 10:40 AM, David Massart <dmassart@acm.org> wrote:

> Thanks Erick.
>
> I see, the problem is that I use some fields tokenized and other
> untokenized
> but that all the queries are parsed using my analyzer. Is this correct?
>
> I suppose that removing the blanks from  learningResourceType before adding
> it as an untokenized field helps (provided that I do the same operation at
> query time).
>
> Or are there more efficient ways to deal with the problem?
>
> Cheers,
>
> David
>
> On Tue, Sep 30, 2008 at 3:51 PM, Erick Erickson <erickerickson@gmail.com
> >wrote:
>
> > What *analyzer* are you using for your queries? Have Luke explain
> > your queries and I suspect you'll see your problem. Which if you're
> > using StandardAnalyzer is probably that 'case study'
> > will look like (lrt:case lrt:study). Note that the implied operator is OR
> > unless
> > you've changed it. But these are compared against your tokens
> >
> > application
> >
> > assessment
> >
> > broadcast
> >
> > case study
> >
> > course
> >
> > demonstration
> >
> > drill and practice
> >
> > educational game
> >
> > EACH OF WHICH IS A TOKEN. No indexed value matches
> > "case" (the closest is "case study", but it doesn't match, since
> > you indexed it UN_TOKENIZED, "case" != "case study")
> > and similarly for "study".
> >
> > If this is completely off base, perhaps you can provide some additional
> > detail.
> >
> > Best
> > Erick
> >
> > On Tue, Sep 30, 2008 at 9:24 AM, David Massart <dmassart@acm.org> wrote:
> >
> > > Hi all,
> > > Here is a new newbee question.
> > >
> > > I'm adding to a lucene index, documents containing a field lrt created
> as
> > > follows:
> > >
> > > doc.add( new Field( "lrt", learningResourceType
> > >
> > > .toLowerCase(), Field.Store.NO,
> > >
> > > Field.Index.UN_TOKENIZED ) ) ;
> > >
> > >
> > > where possible values for learningResourceType are:
> > >
> > >
> > > application
> > >
> > > assessment
> > >
> > > broadcast
> > >
> > > case study
> > >
> > > course
> > >
> > > demonstration
> > >
> > > drill and practice
> > >
> > > educational game
> > >
> > >
> > > In this index, searching for a string that doesn't contain a blank
> > > character
> > > works  fine
> > >
> > >
> > > for example, lrt: application
> > >
> > >
> > > but searching for a string with a blank does not return any results:
> > >
> > >
> > > for example, lrt: case study doesn't work
> > >
> > >
> > > Using lukeall, I can successfully query my index by escaping the blank
> > with
> > > a backslash
> > >
> > >
> > > for example, lrt: case\ study
> > >
> > >
> > > But it doesn't work when I try to do the same thing in Java
> > >
> > >
> > > String query = "lrt: case\\ study"
> > >
> > >
> > > doesn't return any result.
> > >
> > >
> > > I've also try to QueryParser.escape( "case study") but it doesn't seem
> to
> > > affect blank characters?
> > >
> > >
> > > Thanks for your help.
> > >
> > >
> > > David
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message