lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <markus.jel...@openindex.io>
Subject RE: Which open-source crawler to use with SolrJ and Postgresql ?
Date Tue, 16 Feb 2016 19:45:30 GMT
Hello - Nutch 1.x is much more feature rich than 2.x, both can do tremendous large crawls with
ease. I haven't tried all others mentioned except ManifoldCF, which is very good in retrieving
data from shared file systems and stuff like filenet.

We use Nutch 1.x for most of our crawls, small and large. And actively create issues and commits.
Nutch 2.x is fun though in case your primary data store is not a Hadoop sequence file, but
any store supported by Apache Gora, which has matured and stabilized a lot.

Markus

 
 
-----Original message-----
> From:Davis, Daniel (NIH/NLM) [C] <daniel.davis@nih.gov>
> Sent: Tuesday 16th February 2016 17:08
> To: solr-user@lucene.apache.org
> Subject: RE: Which open-source crawler to use with SolrJ and Postgresql ?
> 
> I'm far, far from an expert on this sort of thing, but my personal experience 1-year
ago was that Nutch-1 was easier to use, and the blog post I link below suggests that the abstraction
layer in Nutch-2 really costs some time.    I expect that Nutch-2 has matured some since then,
but going with Nutch-1 is not a bad choice.
> 
> http://digitalpebble.blogspot.com/2013/09/nutch-fight-17-vs-221.html
> 
> There are other dogs in this fight, as shown by the SolrEcosystem wiki page:
> 
> https://wiki.apache.org/solr/SolrEcosystem
> 
> - Apache Manifold CF has a crawler for web pages and a GUI to configure and start things
that must be done by hand for Nutch (unless there is a front-end I don't know about).    Web
crawling is not the prime reason for which Manifold CF exists.
> - Heritrix is a good crawler, dedicated to handling broad and incremental crawling well.
> - Narconex Collectors is sort of a toolkit for building such crawlers.
> - Aspire (by Search Technologies) seems a bit complex, but has a web crawler.    Again
it's more of a toolkit for building such crawlers.
> 
> I sure which I knew which one to go with ;)
> 
> Dan Davis, Systems/Applications Architect (Contractor),
> Office of Computer and Communications Systems,
> National Library of Medicine, NIH
> 
> 
> 
> -----Original Message-----
> From: Emir Arnautovic [mailto:emir.arnautovic@sematext.com] 
> Sent: Tuesday, February 16, 2016 10:58 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Which open-source crawler to use with SolrJ and Postgresql ?
> 
> Markus,
> Ticket I run into is for Nutch2 and NUTCH-2197 is for Nutch1.
> 
> Haven't been using Nutch for a while so cannot recommend version.
> 
> Thanks,
> Emir
> 
> On 16.02.2016 16:37, Markus Jelsma wrote:
> > Nutch has Solr 5 cloud support in trunk, i committed it earlier this month.
> > https://issues.apache.org/jira/browse/NUTCH-2197
> >
> > Markus
> >   
> > -----Original message-----
> >> From:Emir Arnautovic <emir.arnautovic@sematext.com>
> >> Sent: Tuesday 16th February 2016 16:26
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: Which open-source crawler to use with SolrJ and Postgresql ?
> >>
> >> Hi,
> >> It is most common to use Nutch as crawler, but it seems that it still 
> >> does not have support for SolrCloud (if I am reading this ticket 
> >> correctly https://issues.apache.org/jira/browse/NUTCH-1662). Anyway, 
> >> I would recommend Nutch with standard http client.
> >>
> >> Regards,
> >> Emir
> >>
> >> On 16.02.2016 16:02, Victor D'agostino wrote:
> >>> Hi
> >>>
> >>> I am building a Solr 5 architecture with 3 Solr nodes and 1 zookeeper.
> >>> The database backend is postgresql 9 on RHEL 6.
> >>>
> >>> I am looking for a free open-source crawler which use SolrJ.
> >>>
> >>> What do you guys recommend ?
> >>>
> >>> Best regards
> >>> Victor d'Agostino
> >>>
> >>>
> >>> 
> >>> ________________
> >>> Ce message et les éventuels documents joints peuvent contenir des 
> >>> informations confidentielles. Au cas où il ne vous serait pas 
> >>> destiné, nous vous remercions de bien vouloir le supprimer et en 
> >>> aviser immédiatement l'expéditeur. Toute utilisation de ce message 
> >>> non conforme à sa destination, toute diffusion ou publication, 
> >>> totale ou partielle et quel qu'en soit le moyen est formellement 
> >>> interdite. Les communications sur internet n'étant pas sécurisées, 
> >>> l'intégrité de ce message n'est pas assurée et la société émettrice

> >>> ne peut être tenue pour responsable de son contenu.
> >> --
> >> Monitoring * Alerting * Anomaly Detection * Centralized Log 
> >> Management Solr & Elasticsearch Support * http://sematext.com/
> >>
> >>
> 
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr & Elasticsearch
Support * http://sematext.com/
> 
> 

Mime
View raw message