lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: Re : Re: Re : Re: Re : Re: Re : Re: Question concerning Analyzers
Date Fri, 09 Feb 2007 17:38:57 GMT
The query should be tokenized *by the query parser*. You shouldn't have to
do the tokenizing yourself. When you print out the results of the parsing,
you should see something like field:value1 field:value2, which are built up
under the covers to be a BooleanQuery with a bunch of clauses.

I think, though, I'm really at the end of any helpful suggestions I can come
up with without looking at some code from both the indexing and querying.
Otherwise, we'll just continue to mislead each other. If you haven't
already, I strongly urge you to get a copy of Lucene In Action since that'll
give you a much more thorough explication of tokenizing than I can.

Best
Erick

On 2/9/07, Xavier To <to.xavier@courrier.uqam.ca> wrote:
>
> Hey, thanks a lot for taking so much time here...
>
> I did check the and they appear to be the same...at least they are same
> class and same package. I just noticed something : they are using
> LowerCaseFilter.... I was going to say "could it be the source of the
> numbers being ignored ?" but it shouldn't since they are indexed (the
> modification of using WhitespaceAnalyzer during the search did return the
> exact number of results for "2002" which is 5.
>
> As for the tokenizing, shouldn't a query be tokenized ? It was already
> like that, and all I did was modify it so it would use Lucene's tokenizing
> methods... If a query shouldn't be tokenized, maybe tokenizing it is the
> problem. If it should be tokenized,  what am I doing wrong that forces me to
> add a single blank after each token ? I mean, I don't understand what the
> analyzer has to do with the tokenizing process... The reason why I add a
> blank is because the tokens are getting appended into a string, and then the
> string is sent through QueryParser.
>
> As I said, I don't really understand why the guy who made this search
> engine didn't just sent the query as a long string instead of tokenizing it,
> but since it was working fine with alphabetical searches, I said to myself
> "it must be the way to do it".
>
> Xavier Tô
> Bacc. en Informatique et Génie Logiciel
> to.xavier@courrier.uqam.ca
> (450)434-8905
>
> ----- Message d'origine -----
> De: Erick Erickson <erickerickson@gmail.com>
> Date: Jeudi, Février 8, 2007 5:13 pm
> Objet: Re: Re : Re: Re : Re: Re : Re: Question concerning Analyzers
>
> > See below....
> >
> > On 2/8/07, Xavier To <to.xavier@courrier.uqam.ca> wrote:
> > >
> > > Thanks for helping me.
> > >
> > > I don't really understand what you mean by my Tokenizer
> > "corrects" what
> > > the indexing analyzer did.
> >
> >
> > You shouldn't have to do change the tokens in the usual case to get
> > thesearch to work right. You mentioned tokenizing the search
> > string, but then
> > having to add whitespaces back in. That step is the step that
> > "corrects"what the analyzer did. I put "corrects" in quotes because
> > it isn't really
> > correcting anything, the analyzers are doing what they should. But
> > if you
> > have to make this manual change, you're trying to fix up the query
> > string to
> > match what the analyzer did at index time. Which will leave you
> > correctingthis, then that, then the other thing when it would be
> > much better just to
> > use the same analyzer if possible. I've just seen too many "oh,
> > there's one
> > more thing" statements in this situation.
> >
> >
> > By the way, the tokenizer we use is one provided in Lucene. My
> > guess is that
> > > the problem was that the analyzer was thought to be the same by
> > the guy who
> > > made the search engine, but the querying analyzer is fetched
> > inside a JAR by
> > > a bean. Could it be that this is the problem ?
> >
> >
> > It shouldn't be if the same analyzer is fetched inside the bean.
> > Can't you
> > check what analyzer is used in both cases?
> >
> > Erick
> >
> >
> > Xavier Tô
> > > Bacc. en Informatique et Génie Logiciel
> > > to.xavier@courrier.uqam.ca
> > > (450)434-8905
> > >
> > > ----- Message d'origine -----
> > > De: Erick Erickson <erickerickson@gmail.com>
> > > Date: Jeudi, Février 8, 2007 12:51 pm
> > > Objet: Re: Re : Re: Re : Re: Question concerning Analyzers
> > >
> > > > Well, you've proved that your problem is that the analyzer you're
> > > > using when
> > > > querying isn't matching what you use during indexing. I think that
> > > > whatyou've done will lead you into significant problems down the
> > > > road as your
> > > > tokenizer then has to "correct" for what the index analyzer did
> > > > though.
> > > > What would probably be MUCH less work in the long run is to
> > align the
> > > > analyzer you use at query time with the analyzer you use at index
> > > > time. You
> > > > can use a PerFieldAnalyzerWrapper to handle different fields in
> > > > differentways. Forget your custom tokenizer for the time being,
> > > > just try using the
> > > > same analyzer during searching that you used during indexing. You
> > > > can use
> > > > the
> > > > *QueryParser<file:///C:/lucene-
> > > >
> > 2.0.0/docs/api/org/apache/lucene/queryParser/QueryParser.html#QueryParser%28java.lang.String,%20org.apache.lucene.analysis.Analyzer%29>*(String>
> <http://java.sun.com/j2se/1.4/docs/api/java/lang/String.html> f,
> > > > Analyzer<file:///C:/lucene-
> > > > 2.0.0/docs/api/org/apache/lucene/analysis/Analyzer.html> a)
> > > >
> > > > form of the QueryParser, where the Analyzer is the same one you
> > > > used when
> > > > indexing. There are some circumstances where you want to use
> > different> > analyzers when querying and when indexing, but don't
> > go there
> > > > unless you
> > > > need to <G>....
> > > >
> > > > If that doesn't do what you want, I'd really recommend is that you
> > > > make your
> > > > own custom Analyzer, built on, say, WhitespaceTokenizer,
> > > > LowerCaseFilter.This is usually the way I've approached this kind
> > > > of problem. And use *that*
> > > > one at index and query time.
> > > >
> > > > There's an example in Lucene In Action, see the SynonymAnalyzer
> > > > example.That example is MUCH more complex than you'll need <G>...
> > > >
> > > > Best
> > > > Erick
> > > >
> > > > On 2/8/07, Xavier To <to.xavier@courrier.uqam.ca> wrote:
> > > > >
> > > > > Hey !
> > > > >
> > > > > I tried using WhitespaceAnalyzer during the search and it
> > works. I
> > > > > refactored the tokenizing process so it uses TokenStream
> > instead of
> > > > > StringTokenizer and it works fine for one thing : the query
> > "this> > is a test"
> > > > > becomes "thisisatest". I fixed it by adding a space after each
> > > > token except
> > > > > for the last one, but is there a clean way to do it ? I'm using
> > > > > WhitespaceTokenizer.
> > > > >
> > > > > Thanks a bunch !
> > > > >
> > > > > Xavier Tô
> > > > > Bacc. en Informatique et Génie Logiciel
> > > > > to.xavier@courrier.uqam.ca
> > > > > (450)434-8905
> > > > >
> > > > > ----- Message d'origine -----
> > > > > De: Erick Erickson <erickerickson@gmail.com>
> > > > > Date: Mercredi, Février 7, 2007 4:28 pm
> > > > > Objet: Re: Re : Re: Question concerning Analyzers
> > > > >
> > > > > > Then the analyzer you're using when parsing the query is
> > stripping> > > > them. It
> > > > > > must be different than the one you use when indexing somehow.
> > > > At least
> > > > > > that's the only explanation I can imagine....
> > > > > >
> > > > > > Perhaps, somehow, you are using a default analyzer when you
> > > > parse a
> > > > > > query?Or you aren't specifying the field when you query and
> > > > thus a
> > > > > > default is
> > > > > > used? Or you are using a PerFieldAnalyzerWrapper and dropping
> > > > > > through to the
> > > > > > default? or ????
> > > > > >
> > > > > > Just for yucks, I'd try using WhitespaceAnalyzer on a query
> > with> > > > somethingyou *know* exists in the index for a
> > particular field and
> > > > > > work my way up to
> > > > > > whatever your real problem is in small steps (since you
> > can't post
> > > > > > code<G>)......
> > > > > >
> > > > > > Best
> > > > > > Erick
> > > > > >
> > > > > > On 2/7/07, Xavier To <to.xavier@courrier.uqam.ca> wrote:
> > > > > > >
> > > > > > > Thanks Erik and Erick,
> > > > > > >
> > > > > > > I guess my question was rather unclear, but you guys
> > answered it
> > > > > > all the
> > > > > > > same : it is impossible for an analyzer to index
> > something and
> > > > > > having the
> > > > > > > same analyzer ignore the thing indexed during a search.
> > > > > > >
> > > > > > > If it makes everything clearer, during indexation,
> > numbers  are
> > > > > > indexed,> whether or not they are accompanied by letters
(
> > 2003> > and> > 4wd are both
> > > > > > > indexed). That's fine, since we want this.  The problem
> > occurs> > > > when I try to
> > > > > > > search for them : They are ignored. I know they are indexed
> > > > > > because I ran
> > > > > > > through the index using Luke.
> > > > > > >
> > > > > > > Any thoughts regarding this problem ?
> > > > > > >
> > > > > > > Xavier Tô
> > > > > > > Bacc. en Informatique et Génie Logiciel
> > > > > > > to.xavier@courrier.uqam.ca
> > > > > > > (450)434-8905
> > > > > > >
> > > > > > > ----- Message d'origine -----
> > > > > > > De: Erik Hatcher <erik@ehatchersolutions.com>
> > > > > > > Date: Mercredi, Février 7, 2007 3:15 pm
> > > > > > > Objet: Re: Question concerning Analyzers
> > > > > > >
> > > > > > > > There is no requirement that you use the same analyzer
to
> > > > > > search as
> > > > > > > >
> > > > > > > > you used to index.  So, yes, you could certainly index
> > > > things and
> > > > > > > > ignore them during a search.
> > > > > > > >
> > > > > > > >       Erik
> > > > > > > >
> > > > > > > >
> > > > > > > > On Feb 7, 2007, at 2:10 PM, Xavier To wrote:
> > > > > > > >
> > > > > > > > > Hi, me again
> > > > > > > > >
> > > > > > > > > I'm still stuck with my search engine, but something
> > popped> > > > in my
> > > > > > > >
> > > > > > > > > head : Can an analyzer index something but ignore
it
> > > > during a
> > > > > > > > > search ? I'm asking this because now that I've
been
> > > > searching> > for> >
> > > > > > > > > an answer, I've come to think that I should redo
the
> > whole> > > > search> >
> > > > > > > > > engine, but I don't want to reproduce the same
error as
> > > > we have
> > > > > > > > > now. It would be stupid to accidentaly redo the
same
> > > > mistake. I
> > > > > > > > > still haven't received news from my seniors about
me
> > posting> > > > code> >
> > > > > > > > > and all...
> > > > > > > > >
> > > > > > > > > Xavier Tô
> > > > > > > > > Bacc. en Informatique et Génie Logiciel
> > > > > > > > > to.xavier@courrier.uqam.ca
> > > > > > > > > (450)434-8905
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > ------------------------------------------------------
> > ----
> > > > ----
> > > > > > ----
> > > > > > > > ---
> > > > > > > > > To unsubscribe, e-mail: java-user-
> > > > unsubscribe@lucene.apache.org> > > > > For additional commands,
> > e-
> > > > mail: java-user-
> > > > > > help@lucene.apache.org> >
> > > > > > > >
> > > > > > > > --------------------------------------------------------
> > ----
> > > > ----
> > > > > > ----
> > > > > > > > -
> > > > > > > > To unsubscribe, e-mail: java-user-
> > > > unsubscribe@lucene.apache.org> > > > For additional commands,
e-
> > > > mail: java-user-help@lucene.apache.org
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > ----------------------------------------------------------
> > ----
> > > > ----
> > > > > > ---
> > > > > > > To unsubscribe, e-mail: java-user-
> > unsubscribe@lucene.apache.org> > > > > For additional commands, e-
> > mail: java-user-
> > > > help@lucene.apache.org> > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --------------------------------------------------------------
> > ----
> > > > ---
> > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > For additional commands, e-mail: java-user-
> > help@lucene.apache.org> > >
> > > > >
> > > >
> > >
> > >
> > > ------------------------------------------------------------------
> > ---
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message