nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fuad Efendi" <f...@efendi.ca>
Subject RE: URL built by JavaScript Function - Can this be Crawled
Date Tue, 15 Sep 2009 00:29:20 GMT
Google has sitemaps instead... initially designed to help finding such
dynamic URLs (not necessarily built by JavaScript; could be form submission)

Evaluation of JavaScript is extremely CPU-costly for crawlers (it isn't
personal computer where you have single JavaScript thread for double-cores!)
- especially if you need to execute 1000s "use cases" (method parameters'
combinations) in order to find all possible return values...


Google may use some JavaScript emulations (sometimes!) in order to find
black-hat-SEOs etc, and to evaluate some landing pages quality for AdWords
(do they use AdSense?) - but it is not a job of Googlebot...


Just generate 'sitemap' (seed.txt file) for Nutch...



> -----Original Message-----
> From: Mohamed Parvez [mailto:parvez@gmail.com]
> Sent: September-14-09 12:36 PM
> To: nutch-user@lucene.apache.org
> Subject: Re: URL built by JavaScript Function - Can this be Crawled
> 
> Thanks ken.
> If Google itself has not fully implemented, JavaScript analysis/execution
> for crawling
> I am going to stay away from it and look for alternate solution.
> 
> Thanks/Regards,
> Parvez
> 
> 
> 
> On Mon, Sep 14, 2009 at 11:15 AM, Ken Krugler
> <kkrugler_lists@transpac.com>wrote:
> 
> > JavaScript code that creates dynamic URLs is always a problem for web
> > crawlers.
> >
> > Most web sites try to make their content crawlable by creating
alternative
> > static links to the content.
> >
> > I think Google now does some analysis/execution of JS code, but it's a
> > tricky problem.
> >
> > I would suggest modifying the HTML parser to explicitly look for calls
> > being made to your function, and generate appropriate outlinks.
> >
> > -- Ken
> >
> >
> >
> > On Sep 14, 2009, at 8:04am, Mohamed Parvez wrote:
> >
> >  Can anyone please through some light on this
> >>
> >> Thanks/Regards,
> >> Parvez
> >>
> >>
> >> On Fri, Sep 11, 2009 at 3:23 PM, Mohamed Parvez <parvez@gmail.com>
wrote:
> >>
> >>  We have a JavaScript function, which takes some prams and builds an
URL
> >>> and
> >>> then uses  window.location to send the user to that URL.
> >>>
> >>> Our website uses this feature a lot and most of the urls are built
using
> >>> this function.
> >>>
> >>> I am trying to crawl using Nutch and I am also using the parse-js
plugin.
> >>>
> >>> But it does not look like Nautch is able to crawl these URLs.
> >>>
> >>> Am I doing something wrong or Nutch is not able to crawl URLs build by
> >>> JavaScript function.
> >>>
> >>> ----
> >>> Thanks/Regards,
> >>> Parvez
> >>>
> >>>
> >>>
> > --------------------------
> > Ken Krugler
> > TransPac Software, Inc.
> > <http://www.transpac.com>
> > +1 530-210-6378
> >
> >



Mime
View raw message