nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arun Kaundal <arun.kaun...@gmail.com>
Subject Re: Help require in local hard-disk crawling with Nutch
Date Thu, 01 Dec 2005 12:43:42 GMT
Thanx very much Jack. You solve my problem. I have to make necessary
changes. If I got some difficulty again, I will never forget to wake you up.
    Thanx very much


On 11/30/05, Jack Tang <himars@gmail.com> wrote:
>
> Hi
>
> I hope this helps
> http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
>
> /Jack
>
> On 11/30/05, Arun Kumar Sharma <sharma_arun_se@yahoo.co.in> wrote:
> > Nutch Geeks-
> >
> >         I want to do local hard-disk crawling. I  want to know what I
> need to do for this.I find this article helpful
> >   "
> http://wiki.apache.org/nutch/FAQ#head-c721b23b43b15885f5ea7d8da62c1c40a37878e6
> "
> >
> >   But I need little more clarification,
> >
> >   1.Can u send me default cofiguration that I need to make in
> crawl-urlfilter.txt for local files spidering ? Make necessary changes  in
> file content below
> >
> >     file content below:
> >
> >     # skip file:, ftp:, & mailto: urls
> >     -^(http|ftp|mailto|https):
> >
> >     # skip image and other suffixes we can't yet parse
> >
> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$
> >
> >     # skip URLs containing certain characters as probable queries, etc.
> >     -[?*!@=]
> >
> >     # accept hosts in MY.DOMAIN.NAME
> >     +^http://([a-z0-9]*\.)*www.mysite.com/
> >
> >     # skip everything else
> >     -.
> >
> >   after this I add add single entry in my nutch-site.xml file
> >
> >   <nutch-conf>
> >     <property>
> >         <name>plugin.includes
> </name>      <value>protocol-file|protocol-http|parse-(text|html)|index-basic|query-(basic|site|url)</value>
> >      </property>
> >   </nutch-conf>
> >
> >     Is it correct ? if  not what I need to change.
> >
> >     If I do this I got following error :
> >
> >     "051130 102544 SEVERE org.apache.nutch.plugin.PluginRuntimeException:  extension
> point: org.apache.nutch.searcher.QueryFilter does not exist.
> >     java.lang.ExceptionInInitializerError"
> >
> >   2. In the case of local hard-disk crawling, what I need to add in
> urls.txt?
> >
> >    2. I  want to crawl both pdf and ms-word files , How I can include
> plugins  for that? What necessary configuration required for that in
> nutch-site.xml file?
> >
> >       answer awaited anxiously............
> >
> > Bill Goffe <goffe@Oswego.EDU> wrote:  Arun -
> >
> > I suspect others will mention this too, but see
> >
> http://wiki.apache.org/nutch/FAQ#head-c721b23b43b15885f5ea7d8da62c1c40a37878e6
> >
> >           - Bill
> >
> >
> > >  I want to crawl and index local system files, is there any way to
> do  this using nutch? What I need to do and what configuration changes
> are  required? I am very new to nutch so need your help in this regards.
> > >         thanx in adavance for quick and good response.
> > >
> > >
> > > Regards,
> > >
> > > Arun Kumar Sharma (Tech Lead -Java/J2EE)
> > > Mob: +91.981.529.5761
> > >
> > >
> > >
> > >
> > >
> > > ---------------------------------
> > >  Enjoy this Diwali with Y! India Click here
> > --
> >          *------------------------------------------------------*
> >          | Bill Goffe                 goffe@oswego.edu          |
> >          | Department of Economics    voice: (315) 312-3444     |
> >          | SUNY Oswego                fax:   (315) 312-5444     |
> >          | 416 Mahar Hall                  |
> >          | Oswego, NY  13126                                    |
> >
> *--------*------------------------------------------------------*-----------*
> > | "He's better about shaving his legs than I am. The pressure's on me
> to    |
> > | keep my legs
> smooth."                                                     |
> > |  -- Sheryl Crow, on her boyfriend Lance Armstrong. "Crow's
> Armstrong      |
> > |     Song: 'Make 'Em Suffer,'" July 15, 2005, CNN.com
>                       |
> >
> *---------------------------------------------------------------------------*
> >
> >
> >
> >
> >
> > Regards,
> >
> > Arun Kumar Sharma (Tech Lead -Java/J2EE)
> > Mob: +91.981.529.5761
> >
> >
> >
> >
> >
> > ---------------------------------
> >  Enjoy this Diwali with Y! India Click here
> >
>
>
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message