lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: Re : Re: Re : Re: Re : Re: Question concerning Analyzers
Date Thu, 08 Feb 2007 22:13:17 GMT
See below....

On 2/8/07, Xavier To <to.xavier@courrier.uqam.ca> wrote:
>
> Thanks for helping me.
>
> I don't really understand what you mean by my Tokenizer "corrects" what
> the indexing analyzer did.


You shouldn't have to do change the tokens in the usual case to get the
search to work right. You mentioned tokenizing the search string, but then
having to add whitespaces back in. That step is the step that "corrects"
what the analyzer did. I put "corrects" in quotes because it isn't really
correcting anything, the analyzers are doing what they should. But if you
have to make this manual change, you're trying to fix up the query string to
match what the analyzer did at index time. Which will leave you correcting
this, then that, then the other thing when it would be much better just to
use the same analyzer if possible. I've just seen too many "oh, there's one
more thing" statements in this situation.


By the way, the tokenizer we use is one provided in Lucene. My guess is that
> the problem was that the analyzer was thought to be the same by the guy who
> made the search engine, but the querying analyzer is fetched inside a JAR by
> a bean. Could it be that this is the problem ?


It shouldn't be if the same analyzer is fetched inside the bean. Can't you
check what analyzer is used in both cases?

Erick


Xavier Tô
> Bacc. en Informatique et Génie Logiciel
> to.xavier@courrier.uqam.ca
> (450)434-8905
>
> ----- Message d'origine -----
> De: Erick Erickson <erickerickson@gmail.com>
> Date: Jeudi, Février 8, 2007 12:51 pm
> Objet: Re: Re : Re: Re : Re: Question concerning Analyzers
>
> > Well, you've proved that your problem is that the analyzer you're
> > using when
> > querying isn't matching what you use during indexing. I think that
> > whatyou've done will lead you into significant problems down the
> > road as your
> > tokenizer then has to "correct" for what the index analyzer did
> > though.
> > What would probably be MUCH less work in the long run is to align the
> > analyzer you use at query time with the analyzer you use at index
> > time. You
> > can use a PerFieldAnalyzerWrapper to handle different fields in
> > differentways. Forget your custom tokenizer for the time being,
> > just try using the
> > same analyzer during searching that you used during indexing. You
> > can use
> > the
> > *QueryParser<file:///C:/lucene-
> > 2.0.0/docs/api/org/apache/lucene/queryParser/QueryParser.html#QueryParser%28java.lang.String,%20org.apache.lucene.analysis.Analyzer%29>*(String
> <http://java.sun.com/j2se/1.4/docs/api/java/lang/String.html> f,
> > Analyzer<file:///C:/lucene-
> > 2.0.0/docs/api/org/apache/lucene/analysis/Analyzer.html> a)
> >
> > form of the QueryParser, where the Analyzer is the same one you
> > used when
> > indexing. There are some circumstances where you want to use different
> > analyzers when querying and when indexing, but don't go there
> > unless you
> > need to <G>....
> >
> > If that doesn't do what you want, I'd really recommend is that you
> > make your
> > own custom Analyzer, built on, say, WhitespaceTokenizer,
> > LowerCaseFilter.This is usually the way I've approached this kind
> > of problem. And use *that*
> > one at index and query time.
> >
> > There's an example in Lucene In Action, see the SynonymAnalyzer
> > example.That example is MUCH more complex than you'll need <G>...
> >
> > Best
> > Erick
> >
> > On 2/8/07, Xavier To <to.xavier@courrier.uqam.ca> wrote:
> > >
> > > Hey !
> > >
> > > I tried using WhitespaceAnalyzer during the search and it works. I
> > > refactored the tokenizing process so it uses TokenStream instead of
> > > StringTokenizer and it works fine for one thing : the query "this
> > is a test"
> > > becomes "thisisatest". I fixed it by adding a space after each
> > token except
> > > for the last one, but is there a clean way to do it ? I'm using
> > > WhitespaceTokenizer.
> > >
> > > Thanks a bunch !
> > >
> > > Xavier Tô
> > > Bacc. en Informatique et Génie Logiciel
> > > to.xavier@courrier.uqam.ca
> > > (450)434-8905
> > >
> > > ----- Message d'origine -----
> > > De: Erick Erickson <erickerickson@gmail.com>
> > > Date: Mercredi, Février 7, 2007 4:28 pm
> > > Objet: Re: Re : Re: Question concerning Analyzers
> > >
> > > > Then the analyzer you're using when parsing the query is stripping
> > > > them. It
> > > > must be different than the one you use when indexing somehow.
> > At least
> > > > that's the only explanation I can imagine....
> > > >
> > > > Perhaps, somehow, you are using a default analyzer when you
> > parse a
> > > > query?Or you aren't specifying the field when you query and
> > thus a
> > > > default is
> > > > used? Or you are using a PerFieldAnalyzerWrapper and dropping
> > > > through to the
> > > > default? or ????
> > > >
> > > > Just for yucks, I'd try using WhitespaceAnalyzer on a query with
> > > > somethingyou *know* exists in the index for a particular field and
> > > > work my way up to
> > > > whatever your real problem is in small steps (since you can't post
> > > > code<G>)......
> > > >
> > > > Best
> > > > Erick
> > > >
> > > > On 2/7/07, Xavier To <to.xavier@courrier.uqam.ca> wrote:
> > > > >
> > > > > Thanks Erik and Erick,
> > > > >
> > > > > I guess my question was rather unclear, but you guys answered it
> > > > all the
> > > > > same : it is impossible for an analyzer to index something and
> > > > having the
> > > > > same analyzer ignore the thing indexed during a search.
> > > > >
> > > > > If it makes everything clearer, during indexation, numbers  are
> > > > indexed,> whether or not they are accompanied by letters ( 2003
> > and> > 4wd are both
> > > > > indexed). That's fine, since we want this.  The problem occurs
> > > > when I try to
> > > > > search for them : They are ignored. I know they are indexed
> > > > because I ran
> > > > > through the index using Luke.
> > > > >
> > > > > Any thoughts regarding this problem ?
> > > > >
> > > > > Xavier Tô
> > > > > Bacc. en Informatique et Génie Logiciel
> > > > > to.xavier@courrier.uqam.ca
> > > > > (450)434-8905
> > > > >
> > > > > ----- Message d'origine -----
> > > > > De: Erik Hatcher <erik@ehatchersolutions.com>
> > > > > Date: Mercredi, Février 7, 2007 3:15 pm
> > > > > Objet: Re: Question concerning Analyzers
> > > > >
> > > > > > There is no requirement that you use the same analyzer to
> > > > search as
> > > > > >
> > > > > > you used to index.  So, yes, you could certainly index
> > things and
> > > > > > ignore them during a search.
> > > > > >
> > > > > >       Erik
> > > > > >
> > > > > >
> > > > > > On Feb 7, 2007, at 2:10 PM, Xavier To wrote:
> > > > > >
> > > > > > > Hi, me again
> > > > > > >
> > > > > > > I'm still stuck with my search engine, but something popped
> > > > in my
> > > > > >
> > > > > > > head : Can an analyzer index something but ignore it
> > during a
> > > > > > > search ? I'm asking this because now that I've been
> > searching> > for> >
> > > > > > > an answer, I've come to think that I should redo the whole
> > > > search> >
> > > > > > > engine, but I don't want to reproduce the same error as
> > we have
> > > > > > > now. It would be stupid to accidentaly redo the same
> > mistake. I
> > > > > > > still haven't received news from my seniors about me posting
> > > > code> >
> > > > > > > and all...
> > > > > > >
> > > > > > > Xavier Tô
> > > > > > > Bacc. en Informatique et Génie Logiciel
> > > > > > > to.xavier@courrier.uqam.ca
> > > > > > > (450)434-8905
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > ----------------------------------------------------------
> > ----
> > > > ----
> > > > > > ---
> > > > > > > To unsubscribe, e-mail: java-user-
> > unsubscribe@lucene.apache.org> > > > > For additional commands, e-
> > mail: java-user-
> > > > help@lucene.apache.org> >
> > > > > >
> > > > > > ------------------------------------------------------------
> > ----
> > > > ----
> > > > > > -
> > > > > > To unsubscribe, e-mail: java-user-
> > unsubscribe@lucene.apache.org> > > > For additional commands, e-
> > mail: java-user-help@lucene.apache.org
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --------------------------------------------------------------
> > ----
> > > > ---
> > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > For additional commands, e-mail: java-user-
> > help@lucene.apache.org> > >
> > > > >
> > > >
> > >
> > >
> > > ------------------------------------------------------------------
> > ---
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message