lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lochschmied, Alexander" <>
Subject Website (crawler for) indexing
Date Wed, 05 Sep 2012 15:05:15 GMT
This may be a bit off topic: How do you index an existing website and control the data going
into index?

We already have Java code to process the HTML (or XHTML) and turn it into a SolrJ Document
(removing tags and other things we do not want in the index). We use SolrJ for indexing.
So I guess the question is essentially which Java crawler could be useful.

We used to use wget on command line in our publishing process, but we do no longer want to
do that.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message