lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rafał Kuć <r....@solr.pl>
Subject Re: AW: Website (crawler for) indexing
Date Thu, 06 Sep 2012 14:20:51 GMT
Hello!

I think that really depends on what you want to achieve and what parts
of your current system you would like to reuse. If it is only HTML
processing I would let Nutch and Solr do that. Of course you can
extend Nutch (it has a plugin API) and implement the custom logic you
need as a Nutch plugin. There is even an example of Nutch plugin
available (http://wiki.apache.org/nutch/WritingPluginExample), but its
for Nutch 1.3. 

-- 
Regards,
 Rafał Kuć
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch

> Thanks Rafał and Markus for your comments.

> I think Droids it has serious problem with URL parameters in
> current version (0.2.0) from Maven central:
> https://issues.apache.org/jira/browse/DROIDS-144

> I knew about Nutch, but I haven't been able to implement a crawler
> with it. Have you done that or seen an example application?
> It's probably easy to call a Nutch jar and make it index a website and maybe I will have
to do that.
> But as we already have a Java implementation to index other
> sources, it would be nice if we could integrate the crawling part too.

> Regards,
> Alexander 

> ------------------------------------

> Hello!

> You can implement your own crawler using Droids
> (http://incubator.apache.org/droids/) or use Apache Nutch
> (http://nutch.apache.org/), which is very easy to integrate with
> Solr and is very powerful crawler.

> --
> Regards,
>  Rafał Kuć
>  Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch

>> This may be a bit off topic: How do you index an existing website and 
>> control the data going into index?

>> We already have Java code to process the HTML (or XHTML) and turn it 
>> into a SolrJ Document (removing tags and other things we do not want 
>> in the index). We use SolrJ for indexing.
>> So I guess the question is essentially which Java crawler could be useful.

>> We used to use wget on command line in our publishing process, but we do no longer
want to do that.

>> Thanks,
>> Alexander


Mime
View raw message