hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bryan Keller <brya...@gmail.com>
Subject Task tracker timeout with filtered table scan
Date Thu, 31 May 2012 16:27:26 GMT
I have a large table that I am running a map reduce job on. The job scans for a particular
column value in the table using a TableInputFormat with a filter on the scan. This value only
matches a few rows, so most of the rows are filtered out.

The problem is that the TableInputFormat  will not report status back to the task tracker
until the regionserver sends back a row matching the filter. If there are only few matching
rows, and the table is very large, it can take a while for a row to come back from the regionserver.
This can result in a task tracker timeout. The problem is exacerbated with large region file
sizes.

I can sort of work around this by increasing the mapred.task.timeout property, but that doesn't
seem very optimal. The other solution would be to not use a filter, and to filter out rows
in the map reduce job, which would increase I/O. Any other solutions? It seems the TableInputFormat
shouldn't wait for the regionserver to report back status to the task tracker.

Mime
View raw message