nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Lundgren <slundg...@qsfllc.com>
Subject Re: [MASSMAIL]Re: website structure discovery?
Date Mon, 30 Mar 2015 21:14:31 GMT
I’m using url-regexfilter.txt to not only keep nutch from leaving a site that’s in seed.txt
but also to keep nutch very focussed on the URLs within the seed I want nutch to crawl. For
example in seed.txt is http://bizjournals.com/triangle, I want to crawl http://www.bizjournals.com/triangle/news
but not http://www.bizjournals.com/triangle/jobs/, http://www.bizjournals.com/triangle/calendar/
or http://www.bizjournals.com/triangle/people/

Figuring out these regex’s involves me mousing over links of a site in Chrome browser and
a text-only browser. It’s a little time consuming and I have a 200+ sites to set up. I’ll
trying standing up a separate instance of nutch plus the link-extractor and D3.js solution.

Scott Lundgren
Software Engineer
(704) 973-7388
slundgren@qsfllc.com<mailto:slundgren@qsfllc.com>

QuietStream Financial, LLC<http://www.quietstreamfinancial.com>
11121 Carmel Commons Boulevard | Suite 250
Charlotte, North Carolina 28226

Our Portfolio of Commercial Real Estate Solutions:
•        <http://www.defeasewithease.com> Commercial Defeasance<http://www.defeasewithease.com/>
(Defease With Ease®)
•        Fairview Real Estate Solutions<http://www.fairviewres.com/>
•        Great River Mortgage Capital<http://www.greatrivermortgagecapital.com/>
•        Tax Credit Asset Management<http://www.tcamre.com/>
•        Radian Generation<http://www.radiangeneration.com/>
•        EntityKeeper<http://www.entitykeeper.com/>™
•        Crowd With Ease<http://www.crowdwithease.com>™
•        FullCapitalStack<http://www.fullcapitalstack.com>™
•        CrowdRabbit<http://www.crowdrabbit.com>™

On Mar 30, 2015, at 3:32 PM, Jorge Luis Betancourt González <jlbetancourt@uci.cu<mailto:jlbetancourt@uci.cu>>
wrote:

What are you using url-regexfilter.txt for? What is your goal? crawl only the websites of
your interest? meaning not "leaving" your seed URLs? If the website design changes as long
as the URLs are the same this shouldn't be such a big deal.

By default Nutch doesn't index the link the structure (inlinks & outlinks) of each page,
you can use [1] which will allow you to store this information in Solr/ES, although this only
works for Nutch 1.x, after this you can write some small application that will generate what
you want, for instance I've used [1] and d3.js to create some simple graphs about the link
structure of the crawled sites, this is not exactly what you want but can be a starting point.
I think that a sitemap generator shouldn't be too hard to create from the indexed inlinks
& outlinks, or pulling the data directly out of Nutch stored info.

[1] https://github.com/jorgelbg/links-extractor

----- Original Message -----
From: "Scott Lundgren" <slundgren@qsfllc.com<mailto:slundgren@qsfllc.com>>
To: user@nutch.apache.org<mailto:user@nutch.apache.org>
Sent: Monday, March 30, 2015 12:48:40 PM
Subject: [MASSMAIL]Re: website structure discovery?

Sorta. I’m using Nutch to crawl and index very specific areas of content on a test website
resulting in a highly crafted url-regexfilter.tx file. The downside is a brittle process is
a website redesign breaks the setup. It’s also a slow process that I have to do for each
site and eventually I want to be crawling & indexing about several hundred specific sites.
So I need a way to index and “onboard” a new site in an automated way.

So I’m wondering if Nutch is the best spider/tool to run through an entire site and the
resulting output is a visual graph or text representation of the site’s directory/URL structure
when a sitemap file is not available.

Scott Lundgren
Software Engineer
(704) 973-7388
slundgren@qsfllc.com<mailto:slundgren@qsfllc.com><mailto:slundgren@qsfllc.com>

QuietStream Financial, LLC<http://www.quietstreamfinancial.com>
11121 Carmel Commons Boulevard | Suite 250
Charlotte, North Carolina 28226

Our Portfolio of Commercial Real Estate Solutions:
•        <http://www.defeasewithease.com> Commercial Defeasance<http://www.defeasewithease.com/>
(Defease With Ease®)
•        Fairview Real Estate Solutions<http://www.fairviewres.com/>
•        Great River Mortgage Capital<http://www.greatrivermortgagecapital.com/>
•        Tax Credit Asset Management<http://www.tcamre.com/>
•        Radian Generation<http://www.radiangeneration.com/>
•        EntityKeeper<http://www.entitykeeper.com/>™
•        Crowd With Ease<http://www.crowdwithease.com>™
•        FullCapitalStack<http://www.fullcapitalstack.com>™
•        CrowdRabbit<http://www.crowdrabbit.com>™

On Mar 30, 2015, at 10:28 AM, Mattmann, Chris A (3980) <chris.a.mattmann@jpl.nasa.gov<mailto:chris.a.mattmann@jpl.nasa.gov><mailto:chris.a.mattmann@jpl.nasa.gov>>
wrote:

Hi Scott,

It’s a pretty good tool for that - it is a Web Crawler, which
is used to discover the web graph of a domain or of the entire
internet - from pages, to documents, to images, to other web
resources.

Nutch crawls, identifies URLs, fetches them, parses, them and
indexes them for search. It can do in a scalable fashion to
grow with the size of what you are trying to discover.

Does that help?

Cheers,
Chris


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov<mailto:chris.a.mattmann@nasa.gov><mailto:chris.a.mattmann@nasa.gov>
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Scott Lundgren <slundgren@qsfllc.com<mailto:slundgren@qsfllc.com><mailto:slundgren@qsfllc.com>>
Reply-To: "user@nutch.apache.org<mailto:user@nutch.apache.org><mailto:user@nutch.apache.org>"
<user@nutch.apache.org<mailto:user@nutch.apache.org><mailto:user@nutch.apache.org>>
Date: Monday, March 30, 2015 at 5:56 AM
To: "user@nutch.apache.org<mailto:user@nutch.apache.org><mailto:user@nutch.apache.org>"
<user@nutch.apache.org<mailto:user@nutch.apache.org><mailto:user@nutch.apache.org>>
Subject: website structure discovery?

If I want to crawl & learn the directory & information structure of a
website is nutch a good tool for this problem?
Would you recommend a different tool?

Scott Lundgren
Software Engineer
(704) 973-7388
slundgren@qsfllc.com<mailto:slundgren@qsfllc.com><mailto:slundgren@qsfllc.com><mailto:slundgren@qsfllc.com>

QuietStream Financial, LLC<http://www.quietstreamfinancial.com>
11121 Carmel Commons Boulevard | Suite 250
Charlotte, North Carolina 28226

Our Portfolio of Commercial Real Estate Solutions:
•        <http://www.defeasewithease.com> Commercial
Defeasance<http://www.defeasewithease.com/> (Defease With Ease®)
•        Fairview Real Estate Solutions<http://www.fairviewres.com/>
•        Great River Mortgage
Capital<http://www.greatrivermortgagecapital.com/>
•        Tax Credit Asset Management<http://www.tcamre.com/>
•        Radian Generation<http://www.radiangeneration.com/>
•        EntityKeeper<http://www.entitykeeper.com/>™
•        Crowd With Ease<http://www.crowdwithease.com>™
•        FullCapitalStack<http://www.fullcapitalstack.com>™
•        CrowdRabbit<http://www.crowdrabbit.com>™



Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message