hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From stack <st...@duboce.net>
Subject Re: HBase performance tuning
Date Fri, 28 Mar 2008 06:24:14 GMT
Goel, Ankur wrote:
>  ...
> I'll check and let you know if the code can be contributed.
> Once I get a green, I'll make some modifications to make it
> more generic and share with you folks to understand how we can 
> Improve it further before posting.

A while back, I had a go at making such a Writer: see 
http://www.duboce.net/~stack/hbase-writer.tgz.  Its old, probably won't 
work w/ current hbase -- I haven't tried it -- and its for Heritrix 1.x 
generation but shouldn't be hard to update.  When I left it, I was 
trying to mavenize it and was to put needed jars -- hadoop, etc. -- up 
on the Archive's build box.   Publishing such a Writer is a little 
awkward given the different licenses.  Having maven pull jars seemed 
like one way of working within the constraints imposed by licensing 
(Archive is apparently moving toward Apache licensing which should 
alleviate at least the above issue).


> Thanks
> -Ankur
> -----Original Message-----
> From: stack [mailto:stack@duboce.net] 
> Sent: Thursday, March 27, 2008 10:08 PM
> To: hbase-user@hadoop.apache.org
> Subject: Re: HBase performance tuning
> I have some familiarity with that crawler.
> Tell us more about your writer.   Is it proprietary?  If not, can we get
> it into a place where others could use it if wanted?
> Thanks,
> St.Ack
> Goel, Ankur wrote:
>> I am crawling the web indeed, but only the sites that are present in 
>> my seedlist. The crawler used here is heritrix 2.0 - 
>> http://webteam.archive.org/confluence/display/Heritrix/2.0.0.
>> I developed a Heritrix specific HBase writer that can be integrated 
>> with Heritrix to write the crawled content directly into Hbase.
>> -Ankur

View raw message