hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lukas Nalezenec (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-10413) Tablesplit.getLength returns 0
Date Mon, 03 Feb 2014 16:24:09 GMT

    [ https://issues.apache.org/jira/browse/HBASE-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13889593#comment-13889593

Lukas Nalezenec commented on HBASE-10413:

I made big changes in code.
You can check it and discus it in https://github.com/apache/hbase/pull/8/files .

 I have to write unit tests before making the patch.

- I need help with unit test. Is there some simple unit test helper/utility i can use ? I
need to create table with some regions and then work with their sizes. It should be local,
there should be some level of abstraction. 

- I have added configuration option for disabling this feature:
  Is there some policy about new configuration options ? 
  Should i move the configuration key constant to some place ? 
  Should be the feature disabled or enabled by default ?

- Computation of region sizes might be slow. We might need some parallelization.

from mail:
+  public void setLength(long length) {
This method in TableSplit can be package private.

I think that lot of people uses Table Split in their custom Input format. IMHO this method
should be part of API.

> Tablesplit.getLength returns 0
> ------------------------------
>                 Key: HBASE-10413
>                 URL: https://issues.apache.org/jira/browse/HBASE-10413
>             Project: HBase
>          Issue Type: Bug
>          Components: Client, mapreduce
>    Affects Versions:
>            Reporter: Lukas Nalezenec
>            Assignee: Lukas Nalezenec
> InputSplits should be sorted by length but TableSplit does not contain real getLength
>   @Override
>   public long getLength() {
>     // Not clear how to obtain this... seems to be used only for sorting splits
>     return 0;
>   }
> This is causing us problem with scheduling - we have got jobs that are supposed to finish
in limited time but they get often stuck in last mapper working on large region.
> Can we implement this method ? 
> What is the best way ?
> We were thinking about estimating size by size of files on HDFS.
> We would like to get Scanner from TableSplit, use startRow, stopRow and column families
to get corresponding region than computing size of HDFS for given region and column family.

> Update:
> This ticket was about production issue - I talked with guy who worked on this and he
said our production issue was probably not directly caused by getLength() returning 0. 

This message was sent by Atlassian JIRA

View raw message