hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dru Jensen <drujen...@gmail.com>
Subject Re: Unknown Scanner Exception
Date Tue, 12 Aug 2008 19:22:38 GMT
Hi Andy,

I am pulling html from different web pages and storing it in hbase.  I  
tried to use heretrix and nutch but they don't have big table support  
(yet) and I don't need to index, I just need to store them for  
archiving purposes.

I think it would be an excellent example or use case since it seems  
I'm not the only one running into these issues.

My next big challenge is performance.  It took 18 hours to pull 8000  
pages and the task never completed.  It launched 4 MR tasks.  Not sure  
if I got a lock on the table that wouldn't release or what happened.   
I am going to add more logging and try to track down what is causing  
the slowness.

I am storing the results in a column family in the same table I am  
scanning.   Maybe I should use a different table to store the results?

Is it better to commit during the reduce task or inside the map task?

Earlier versions, I was using the IdentityTableReducer but if the map  
task failed, I would lose all results up to that point which (after  
running for 18 hrs) made me want to change career paths.

Now I am getting an instance of the HTable and committing directly in  
the map task.  Although I don't lose work up to this point, I'm  
worried this isn't the best way for performance.

Which way do you recommend?


On Aug 12, 2008, at 10:45 AM, Andrew Purtell wrote:

> Dru,
> My USE issues with TableMap were also related to HTTP
> transactions in the map taking too long. Might make for a useful
> design note. I'd be curious to know more details about what you
> are trying to accomplish if you are willing to share them...
>   - Andy
>> From: Dru Jensen <drujensen@gmail.com>
>> Subject: Re: Unknown Scanner Exception
>> To: hbase-user@hadoop.apache.org
>> Date: Tuesday, August 12, 2008, 10:00 AM
>> J-D and Andy,
>> This seems to solve the problem.  I thought I had set this
>> parameter before but realized I set the "master" lease time
>> instead of the "region server" lease time.
> [...]
>> The MR task makes http calls, so I also needed to set the
>> timeout on the call to make sure it doesn't take longer than
>> the ping back to the server.
> [...]

View raw message