hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Judd" <d...@zvents.com>
Subject Heritrix-Hadoop connector (hdfs-writer-processor)
Date Fri, 26 Jan 2007 01:23:55 GMT

I've written an extension to the Internet Archive's open source "Heritrix"
crawler that extends it to write into HDFS in SequenceFile format.  The key
is the URL and the value is the HTTP response with some additional
metadata.  It's actually quite simple to use, just drop a few jar files into
the Heritrix lib/ directory and you're good to go.  Here's a link to the
download page:  http://www.zvents.com/labs/hdfs_writer_processor .  For
those of you who are interested, give it a whirl and feel free to send me

- Doug Judd

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message