lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lochschmied, Alexander" <Alexander.Lochschm...@vishay.com>
Subject AW: Website (crawler for) indexing
Date Thu, 06 Sep 2012 13:59:37 GMT
Thanks Rafał and Markus for your comments.

I think Droids it has serious problem with URL parameters in current version (0.2.0) from
Maven central:
https://issues.apache.org/jira/browse/DROIDS-144

I knew about Nutch, but I haven't been able to implement a crawler with it. Have you done
that or seen an example application?
It's probably easy to call a Nutch jar and make it index a website and maybe I will have to
do that.
But as we already have a Java implementation to index other sources, it would be nice if we
could integrate the crawling part too.

Regards,
Alexander 

------------------------------------

Hello!

You can implement your own crawler using Droids
(http://incubator.apache.org/droids/) or use Apache Nutch (http://nutch.apache.org/), which
is very easy to integrate with Solr and is very powerful crawler.

--
Regards,
 Rafał Kuć
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch

> This may be a bit off topic: How do you index an existing website and 
> control the data going into index?

> We already have Java code to process the HTML (or XHTML) and turn it 
> into a SolrJ Document (removing tags and other things we do not want 
> in the index). We use SolrJ for indexing.
> So I guess the question is essentially which Java crawler could be useful.

> We used to use wget on command line in our publishing process, but we do no longer want
to do that.

> Thanks,
> Alexander

Mime
View raw message