lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ".: Abhishek :." <ab1s...@gmail.com>
Subject Re: Nutch and Solr search on the fly
Date Thu, 10 Feb 2011 01:55:14 GMT
Hi Charan,

 Thanks for the clarifications.

 The link I have been referring to(
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/) does not say
anything about using the crawl? Do I have to do it after the  last step
mentioned?

Thanks,
Abi

On Thu, Feb 10, 2011 at 12:58 AM, charan kumar <charan.kumar@gmail.com>wrote:

> Hi Abishek,
>
> depth is a param of crawl command, not fetch command
>
> If you are using custom script calling individual stages of nutch crawl,
> then depth N means , you running that script for N times.. You can put a
> loop, in the script.
>
> Thanks,
> Charan
>
> On Wed, Feb 9, 2011 at 6:26 AM, .: Abhishek :. <ab1sh3k@gmail.com> wrote:
>
> > Hi Erick,
> >
> >  Thanks a bunch for the response
> >
> >  Could be a chance..but all I am wondering is where to specify the depth
> in
> > the whole entire process in the URL
> > http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/? I tried
> > specifying it during the fetcher phase but it was just ignored :(
> >
> > Thanks,
> > Abi
> >
> > On Wed, Feb 9, 2011 at 10:11 PM, Erick Erickson <erickerickson@gmail.com
> > >wrote:
> >
> > > WARNING: I don't do Nutch much, but could it be that your
> > > crawl depth is 1? See:
> > > http://wiki.apache.org/nutch/NutchTutorial
> > >
> > > <http://wiki.apache.org/nutch/NutchTutorial>and search for "depth"
> > > Best
> > > Erick
> > >
> > > On Wed, Feb 9, 2011 at 9:06 AM, .: Abhishek :. <ab1sh3k@gmail.com>
> > wrote:
> > >
> > > > Hi Markus,
> > > >
> > > >  I am sorry for not being clear, I meant to say that...
> > > >
> > > >  Suppose if a url namely
> > www.somehost.com/gifts/greetingcard.html(which<http://www.somehost.com/gifts/greetingcard.html%28which>
> <http://www.somehost.com/gifts/greetingcard.html%28which>
> > <http://www.somehost.com/gifts/greetingcard.html%28which>in
> > > > turn contain links to a.html, b.html, c.html, d.html) is injected
> into
> > > the
> > > > seed.txt, after the whole process I was expecting a bunch of other
> > pages
> > > > which crawled from this seed url. However, at the end of it all I see
> > is
> > > > the
> > > > contents from only this page namely
> > > > www.somehost.com/gifts/greetingcard.htmland I do not see any other
> > > > pages(here a.html, b.html, c.html, d.html)
> > > > crawled from this one.
> > > >
> > > >  The crawling happens only for the URLs mentioned in the seed.txt and
> > > does
> > > > not proceed further from there. So I am just bit confused. Why is it
> > not
> > > > crawling the linked pages(a.html, b.html, c.html and d.html). I get a
> > > > feeling that I am missing something that the author of the blog(
> > > > http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/) assumed
> > > > everyone would know.
> > > >
> > > > Thanks,
> > > > Abi
> > > >
> > > >
> > > > On Wed, Feb 9, 2011 at 7:09 PM, Markus Jelsma <
> > > markus.jelsma@openindex.io
> > > > >wrote:
> > > >
> > > > > The parsed data is only sent to the Solr index of you tell a
> segment
> > to
> > > > be
> > > > > indexed; solrindex <crawldb> <linkdb> <segment>
> > > > >
> > > > > If you did this only once after injecting  and then the consequent
> > > > > fetch,parse,update,index sequence then you, of course, only see
> those
> > > > > URL's.
> > > > > If you don't index a segment after it's being parsed, you need to
> do
> > it
> > > > > later
> > > > > on.
> > > > >
> > > > > On Wednesday 09 February 2011 04:29:44 .: Abhishek :. wrote:
> > > > > > Hi all,
> > > > > >
> > > > > >  I am a newbie to nutch and solr. Well relatively much newer
to
> > Solr
> > > > than
> > > > > > Nutch :)
> > > > > >
> > > > > >  I have been using nutch for past two weeks, and I wanted to
know
> > if
> > > I
> > > > > can
> > > > > > query or search on my nutch crawls on the fly(before it
> completes).
> > I
> > > > am
> > > > > > asking this because the websites I am crawling are really huge
> and
> > it
> > > > > takes
> > > > > > around 3-4 days for a crawl to complete. I want to analyze some
> > quick
> > > > > > results while the nutch crawler is still crawling the URLs.
Some
> > one
> > > > > > suggested me that Solr would make it possible.
> > > > > >
> > > > > >  I followed the steps in
> > > > > > http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/
for
> > > this.
> > > > By
> > > > > > this process, I see only the injected URLs are shown in the
Solr
> > > > search.
> > > > > I
> > > > > > know I did something really foolish and the crawl never happened,
> I
> > > > feel
> > > > > I
> > > > > > am missing some information here. I think somewhere in the
> process
> > > > there
> > > > > > should be a crawling happening and I missed it out.
> > > > > >
> > > > > >  Just wanted to see if some one could help me pointing this
out
> and
> > > > where
> > > > > I
> > > > > > went wrong in the process. Forgive my foolishness and thanks
for
> > your
> > > > > > patience.
> > > > > >
> > > > > > Cheers,
> > > > > > Abi
> > > > >
> > > > > --
> > > > > Markus Jelsma - CTO - Openindex
> > > > > http://www.linkedin.com/in/markus17
> > > > > 050-8536620 / 06-50258350
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message