nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rohit Potnis" <rohit.pot...@gmail.com>
Subject Re: Searching parameterized URLs
Date Thu, 01 May 2008 05:14:27 GMT
sorry... i was replying to Jasper's  comments on searching the index... Any
help regarding my last reply to this chain (my alternative approach)?

also, please ignore the *...* surrounding the nutch configuration entries in
the previous email.. (I guess Rich Text mail is not supported :))
e.g. read *+^http://([a-z0-9]*\.)*somesite.com/* as just +^http://
([a-z0-9]*\.)*somesite.com/

Waiting for a reply..

Rohit


On 4/30/08, Rohit Potnis <rohit.potnis@gmail.com> wrote:
>
> Thanks for your replies..
>
> @Otis:
>
> Continuing from our previous email exchange:
>
> The "xyz" value was not in my list of indexes.
>
> So I tried an alternative:
>
> in my urls folder, I changed the url in the urls folder to:
> http://www.somesite.com/somepage.jsp?id=someId
> hoping that this would fetch only one URL.
>
> my crawl-urlfilter.txt was configured for:
> # accept hosts in MY.DOMAIN.NAME <http://my.domain.name/>
> *+^http://([a-z0-9]*\.)*somesite.com/*
>
> and I executed the command: *bin/nutch crawl urls -dir crawldir - depth 10
> *
>
> This, however, fetched 0 records.
>
> So now I'm wondering if my alternative was correct? If not, can you please
> help me understand the right way to search this?
>
> thanks much,
> Rohit
>
>  On 4/30/08, ogjunk-nutch@yahoo.com <ogjunk-nutch@yahoo.com> wrote:
> >
> > JSP pages typically render HTML, so you don't need a JSP plugin, but an
> > parse-html plugin in your nutch-site.xml
> >
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> > ----- Original Message ----
> > > From: Jasper Kamperman <jasper.kamperman@openwaternet.com>
> > > To: nutch-user@lucene.apache.org
> > > Sent: Wednesday, April 30, 2008 1:32:29 PM
> > > Subject: Re: Searching parameterized URLs
> > >
> > > I think the first question is to figure out whether the page with URL
> > > http://www.somesite.com/somepage.jsp?id=someId even made it into your
> > > index. There are several ways to do this, personally I tend to use
> > > luke to have a look at the index, tell luke to open your nutch-0.9/
> > > crawl/index directory (which is where it ends up if you follow the
> > > default instructions for running the crawl).
> > >
> > > If the page is in your index you can use luke to see what fields were
> > > extracted, hopefully there is some field named "foo" which would have
> > > "xyz" somewhere. The Nutch demo app should then find the page if you
> > > specify foo:xyz in the searchbar. If "foo" is one of "content",
> > > "title", "anchor" or "url" then the demo app should find it if you
> > > plainly search for xyz, no need to specify any of the default fields.
> > >
> > > Since it is a jsp page, it is entirely possible that you either don't
> > > have the correct (jsp) plugin configured or that the plugin you have
> > > isn't smart enough to get the content out of a jsp page.
> > >
> > > Jasper
> > >
> > > On Apr 30, 2008, at 10:13 AM, Rohit Potnis wrote:
> > >
> > > > Hi,
> > > >
> > > > I'm a nutch-newbie and am developing a search-based website.
> > > >
> > > > How can I use Nutch to search for parameterized URLs?
> > > >
> > > > e.g. I want to search on an item called "xyz". The information on
> > > > this item
> > > > is available on http://www.somesite.com/somepage.jsp?id=someId
> > > > where someId is the databaseId (generated by the host application)
> > > > for item
> > > > "xyz".
> > > >
> > > >  I know that item "xyz" shows up with the above URL when I search
> > > > using
> > > > Google but it doesn't appear when I search for it using the sample
> > web
> > > > application provided with nutch.
> > > >
> > > > *Configuration:*
> > > >
> > > > I have configured the crawl-urlfilter.txt to :
> > > >
> > > > # accept hosts in MY.DOMAIN.NAME <http://my.domain.name/>
> > > > *+^http://([a-z0-9]*\.)*somesite.com/*
> > > >
> > > > My *urls* folder contains a text file containing :
> > > > *http://www.somesite.com*
> > > >
> > > > and I executed the command: *bin/nutch crawl urls -dir crawldir -
> > > > depth 3*
> > > >
> > > > How can I get: http://www.somesite.com/somepage.jsp?id=someId when
> > > > I search
> > > > for "xyz" the same way it shows up during a Google search?
> > > >
> > > > Your help would be much appreciated,
> > > > Rohit
> > >
> > >
> >
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message