lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rafał Kuć <r....@solr.pl>
Subject Re: Website (crawler for) indexing
Date Wed, 05 Sep 2012 15:12:32 GMT
Hello!

You can implement your own crawler using Droids
(http://incubator.apache.org/droids/) or use Apache Nutch
(http://nutch.apache.org/), which is very easy to integrate with Solr
and is very powerful crawler.

-- 
Regards,
 Rafał Kuć
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch

> This may be a bit off topic: How do you index an existing website
> and control the data going into index?

> We already have Java code to process the HTML (or XHTML) and turn
> it into a SolrJ Document (removing tags and other things we do not
> want in the index). We use SolrJ for indexing.
> So I guess the question is essentially which Java crawler could be useful.

> We used to use wget on command line in our publishing process, but we do no longer want
to do that.

> Thanks,
> Alexander


Mime
View raw message