nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Aristov <alexander.aris...@gmail.com>
Subject Re: Will Solr/Nutch crawl multi websites (aka a mini google with faceted search)?
Date Tue, 13 Sep 2011 08:56:25 GMT
Hi

I would start from another. Crawling such sites is not an easy task. Your
parser shall be very smart.

I would investigate if your web sites have public API which could be used to
do searches and then agregating results into one set.

Best Regards
Alexander Aristov


On 12 September 2011 16:28, Markus Jelsma <markus.jelsma@openindex.io>wrote:

> I don't know much about alternative pieces of software. I do know that
> making
> parse plugins in Nutch is quite easy and flexible with full access to the
> DOM.
>
> On Monday 12 September 2011 14:15:49 dpt9876 wrote:
> > Ok nice. So its possible. Do you think this is a better method than
> > scraping using an alternate? It seems to me it is in that it will work
> > better with my end state, being Solr faceted search and I can remove
> > layers of complexity. On Sep 12, 2011 8:03 PM, "Markus Jelsma-2 [via
> > Lucene]" <
> >
> > ml-node+s472066n3329431h0@n3.nabble.com> wrote:
> > > Yes you can. As Ken replied in your Solr thread you must create custom
> >
> > parse
> >
> > > and indexing filters. The parse filter is needed to extract the
> >
> > information
> >
> > > and store it in the document and the index filter is used to pass that
> > > new
> > >
> > > information to the Solr index.
> > >
> > > On Monday 12 September 2011 12:55:49 dpt9876 wrote:
> > >> Hi, the friendly guys at the Solr user group pointed me here.
> > >>
> > >> I am wondering if Nutch/Solr will do the following for a project I am
> > >> working on.
> > >> I want to create a search engine with facets for potentially hundreds
> of
> > >> websites.
> > >> Similar to say crawling amazon + buy.com + ebay and someone can
> search
> > >> these 3 sites from my 1 website.
> > >> (I realise there are better ways of doing the above example, its for
> > >> illustrative purposes).
> > >> Eventually I would build that search crawl to index say 200 or 1000
> > >> merchants.
> > >> Someone would come to my site and search for "digital camera".
> > >>
> > >> They would get results from all 3 indexes and hopefully dynamic facets
> > >> eg Price $100-200
> > >> Price 200-300
> > >> Resolution 1mp-2mp
> > >>
> > >> etc etc
> > >>
> > >> Can this be done on the fly?
> > >>
> > >> I ask this because I am currently developing webscrapers to crawl
> these
> > >> websites, dump that data into a db, then was thinking of tacking on a
> >
> > solr
> >
> > >> server to crawl my db.
> > >>
> > >> Problem with that approach is that crawling the worlds ecommerce sites
> >
> > will
> >
> > >> take forever, when it seems solr might do that for me? (I have read
> > >> about multiple indexes etc).
> > >>
> > >> Many thanks
> > >>
> > >> --
> >
> > >> View this message in context:
> >
> http://lucene.472066.n3.nabble.com/Will-Solr-Nutch-crawl-multi-websites-ak
> >
> > >> a-a-mini-google-with-faceted-search-tp3329346p3329346.html Sent from
> the
> > >> Nutch - User mailing list archive at Nabble.com.
> > >
> > > --
> > > Markus Jelsma - CTO - Openindex
> > > http://www.linkedin.com/in/markus17
> > > 050-8536620 / 06-50258350
> > >
> > >
> > > _______________________________________________
> > > If you reply to this email, your message will be added to the
> discussion
> >
> > below:
> >
> >
> http://lucene.472066.n3.nabble.com/Will-Solr-Nutch-crawl-multi-websites-aka
> > -a-mini-google-with-faceted-search-tp3329346p3329431.html
> >
> > > To unsubscribe from Will Solr/Nutch crawl multi websites (aka a mini
> >
> > google with faceted search)?, visit
> >
> http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscri
> >
> be_by_code&node=3329346&code=ZGFuaW50aGV0cm9waWNzQGdtYWlsLmNvbXwzMzI5MzQ2fC
> > 04MDk0NTc1ODg=
> >
> >
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/Will-Solr-Nutch-crawl-multi-websites-ak
> > a-a-mini-google-with-faceted-search-tp3329346p3329454.html Sent from the
> > Nutch - User mailing list archive at Nabble.com.
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message