lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shai Erera <ser...@gmail.com>
Subject Re: Weird behaviour
Date Sun, 02 Aug 2009 18:20:25 GMT
You can always create your own Analyzer which creates a TokenStream just
like StandardAnalyzer, but instead of using StandardFilter, write another
TokenFilter which receives the HOST token type, and breaks it further to its
components (e.g., extract "en", "wikipedia" and "org"). You can also return
the original HOST token and its components.

I hope this helps.

Shai

On Sun, Aug 2, 2009 at 8:58 PM, prashant ullegaddi <prashullegaddi@gmail.com
> wrote:

> Hi Phil,
>
> The query you gave did work. Well, that proves StandardAnalyzer has a
> different way
> of tokenizing URLs.
>
> Thanks,
> Prashant.
>
> On Sun, Aug 2, 2009 at 11:22 PM, Phil Whelan <phil123@gmail.com> wrote:
>
> > Hi Prashant,
> >
> > I agree with Shai, that using Luke and printing out what the Document
> > looks like before it goes into the index, are going to be your best
> > bet for debugging this problem.
> >
> > The problem you're having is that StandardAnalyzer does not break-up
> > the hostname into separate terms, as it has a special case for
> > hostnames and acronyms.
> >
> > This should work...
> > +title:"rahul dravid" +url:"en.wikipedia.org"
> >
> > Thanks,
> > Phil
> >
> > On Sun, Aug 2, 2009 at 10:14 AM, prashant
> > ullegaddi<prashullegaddi@gmail.com> wrote:
> > > Yes, I'm sure that title:"Rahul Dravid" is extracted properly, and
> there
> > is
> > > a document relevant to this query as well.
> > > The following query and its results proves it:
> > >
> > > Enter query:
> > > Searching for: +title:"rahul dravid" +url:wiki
> > > 4 total matching documents
> > >   trec-id: clueweb09-enwp02-13-14368, URL:
> > > http://en.wikipedia.org/wiki/Rahul_Dravid
> > >   trec-id: clueweb09-enwp01-83-11378, URL:
> > > http://en.wikipedia.org/wiki/Rahul_S_Dravid
> > >   trec-id: clueweb09-en0011-08-22737, URL:
> > > http://www.reference.com/browse/wiki/Rahul_Dravid
> > >   trec-id: clueweb09-enwp01-69-13556, URL:
> > > http://en.wikipedia.org/wiki/Rahul_Sharad_Dravid
> > > Press (q)uit or enter number to jump to a page.
> > >
> > > But see following query:
> > >
> > > Enter query:
> > > +title:"rahul dravid" +url:"wikipedia"
> > > Searching for: +title:"rahul dravid" +url:wikipedia
> > > 0 total matching documents
> > > Press (q)uit or enter number to jump to a page.
> > >
> > > Isn't it weird?
> > >
> > > -- Prashant.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message