lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shai Erera <ser...@gmail.com>
Subject Re: Weird behaviour
Date Sun, 02 Aug 2009 15:43:51 GMT
How do you parse/convert the page to a Document object? Are you sure the
title "Rahul Dravid" is extracted properly and put in the "title" field?

You can read about Luke here: http://www.getopt.org/luke/.

Can you do System.out.println(document.toString()) before you add it to the
index, and paste the output here?

Shai

On Sun, Aug 2, 2009 at 4:47 PM, prashant ullegaddi <prashullegaddi@gmail.com
> wrote:

> Firstly, I'm indexing the string in url field only.
>
> I've never used Luke, I don't know how to use.
>
> What I'm trying to do is search for those documents which are from
> some particular site, and have a given title.
>
>
> On Sun, Aug 2, 2009 at 4:07 PM, Shai Erera <serera@gmail.com> wrote:
>
> > You write that you index the string under the "url" field. Do you also
> > index
> > it under "title"? If not, that can explain why title:"Rahul Dravid" does
> > not
> > work for you.
> >
> > Also, did you try to look at the index w/ Luke? It will show you what are
> > the terms in the index.
> >
> > Another thing which is always good to debug such things is to create a
> > StandardAnalyzer, then request a tokenStream() from it, passing a
> > StringReader w/ the text you want to parse. Then just print the tokens
> > returned.
> >
> > I've done that, using the version from trunk, w/ Version.2_4, and the
> > tokens
> > that are extracted are:
> > (http,0,4,type=<ALPHANUM>)
> > (en.wikipedia.org,7,23,type=<HOST>)
> > (wiki,24,28,type=<ALPHANUM>)
> > (rahul,29,34,type=<ALPHANUM>)
> > (dravid,35,41,type=<ALPHANUM>)
> >
> > So:
> > 1) You don't get results for title:"Rahul Dravid" since you index it
> under
> > "url" and not "title".
> > 2) url:"wiki/Rahul_Dravid" works, since it looks for a phrase that exists
> > in
> > the index (look at the last 3 tokens produced by the Analyzer, in the
> > output
> > above).
> > 3) ur:"<entire string" also works, since you index all of it under the
> > "url"
> > field.
> >
> > Does this explain the behavior you see?
> >
> > Shai
> >
> > On Sun, Aug 2, 2009 at 1:27 PM, prashant ullegaddi <
> > prashullegaddi@gmail.com
> > > wrote:
> >
> > > Hi,
> > >
> > > I've indexed some 50million documents. I've indexed the target URL of
> > each
> > > document as "url" field by using
> > > StandardAnalyzer with index.ANALYZED. Suppose, there is a wikipedia
> page
> > > with title:"Rahul Dravid" and
> > > url: http://en.wikipedia.org/wiki/Rahul_Dravid.
> > >
> > > But when I search for +title:"Rahul Dravid" +url:"Wikipedia", I'm
> getting
> > > no
> > > results. I get the document(s) when
> > > I search for url:http://en.wikipedia.org/wiki/Rahul_Dravid or url:"
> > > en.wikipedia.org/wiki/Rahul_Dravid". I get
> > > results even when I search for url:"wiki/Rahul_Dravid".
> > >
> > > It'd be helpful if somebody can throw some light on this.
> > >
> > > -- Prashant.
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message