lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From prashant ullegaddi <prashullega...@gmail.com>
Subject Re: Weird behaviour
Date Sun, 02 Aug 2009 18:39:16 GMT
Thank you Phil and Shai.

I will write a different Analyzer.

On Sun, Aug 2, 2009 at 11:50 PM, Shai Erera <serera@gmail.com> wrote:

> You can always create your own Analyzer which creates a TokenStream just
> like StandardAnalyzer, but instead of using StandardFilter, write another
> TokenFilter which receives the HOST token type, and breaks it further to
> its
> components (e.g., extract "en", "wikipedia" and "org"). You can also return
> the original HOST token and its components.
>
> I hope this helps.
>
> Shai
>
> On Sun, Aug 2, 2009 at 8:58 PM, prashant ullegaddi <
> prashullegaddi@gmail.com
> > wrote:
>
> > Hi Phil,
> >
> > The query you gave did work. Well, that proves StandardAnalyzer has a
> > different way
> > of tokenizing URLs.
> >
> > Thanks,
> > Prashant.
> >
> > On Sun, Aug 2, 2009 at 11:22 PM, Phil Whelan <phil123@gmail.com> wrote:
> >
> > > Hi Prashant,
> > >
> > > I agree with Shai, that using Luke and printing out what the Document
> > > looks like before it goes into the index, are going to be your best
> > > bet for debugging this problem.
> > >
> > > The problem you're having is that StandardAnalyzer does not break-up
> > > the hostname into separate terms, as it has a special case for
> > > hostnames and acronyms.
> > >
> > > This should work...
> > > +title:"rahul dravid" +url:"en.wikipedia.org"
> > >
> > > Thanks,
> > > Phil
> > >
> > > On Sun, Aug 2, 2009 at 10:14 AM, prashant
> > > ullegaddi<prashullegaddi@gmail.com> wrote:
> > > > Yes, I'm sure that title:"Rahul Dravid" is extracted properly, and
> > there
> > > is
> > > > a document relevant to this query as well.
> > > > The following query and its results proves it:
> > > >
> > > > Enter query:
> > > > Searching for: +title:"rahul dravid" +url:wiki
> > > > 4 total matching documents
> > > >   trec-id: clueweb09-enwp02-13-14368, URL:
> > > > http://en.wikipedia.org/wiki/Rahul_Dravid
> > > >   trec-id: clueweb09-enwp01-83-11378, URL:
> > > > http://en.wikipedia.org/wiki/Rahul_S_Dravid
> > > >   trec-id: clueweb09-en0011-08-22737, URL:
> > > > http://www.reference.com/browse/wiki/Rahul_Dravid
> > > >   trec-id: clueweb09-enwp01-69-13556, URL:
> > > > http://en.wikipedia.org/wiki/Rahul_Sharad_Dravid
> > > > Press (q)uit or enter number to jump to a page.
> > > >
> > > > But see following query:
> > > >
> > > > Enter query:
> > > > +title:"rahul dravid" +url:"wikipedia"
> > > > Searching for: +title:"rahul dravid" +url:wikipedia
> > > > 0 total matching documents
> > > > Press (q)uit or enter number to jump to a page.
> > > >
> > > > Isn't it weird?
> > > >
> > > > -- Prashant.
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message