lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dominique Bejean <>
Subject Re: Website (crawler for) indexing
Date Fri, 07 Sep 2012 15:26:27 GMT
May be you can take a look at Crawl-Anywhere which have administration 
web interface, solr indexer and search web application.



Le 05/09/12 17:05, Lochschmied, Alexander a écrit :
> This may be a bit off topic: How do you index an existing website and control the data
going into index?
> We already have Java code to process the HTML (or XHTML) and turn it into a SolrJ Document
(removing tags and other things we do not want in the index). We use SolrJ for indexing.
> So I guess the question is essentially which Java crawler could be useful.
> We used to use wget on command line in our publishing process, but we do no longer want
to do that.
> Thanks,
> Alexander

View raw message