lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus.Mirsberger" <markus.mirsber...@gmx.de>
Subject Re: Crawl Anywhere -
Date Mon, 11 Feb 2013 07:03:17 GMT
Hi,

did you try Heritrix?

The documents are stored as html inside an warc file which can be 
postprocessed easily.


Cheers,
Markus


On 11.02.2013 12:16, SivaKarthik wrote:
> Dear Erick,
>     Thanks for ur relpy..
>     ya..nutch can meet my requirement...
>    but the problem is, i want to store the crawled document in html or xml
> format instead of mapreduce format..
>    not sure nutch plugins available to convert into xml files.
>    please share me if you any idea .
>
> ThankYou
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/ANNOUNCE-Web-Crawler-tp2607831p4039619.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Mime
View raw message