hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Schubert Zhang (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-1778) Improve PerformanceEvaluation
Date Fri, 04 Sep 2009 06:12:57 GMT

    [ https://issues.apache.org/jira/browse/HBASE-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12751326#action_12751326
] 

Schubert Zhang commented on HBASE-1778:
---------------------------------------

It is because the old code use TextInputFormat.  That's not correct in this case. 
The TextInputFormat extends FileInputFormat. 

There are following mistakes on concept-level or implementation-level for mapreduce programming.

1. The FileInputFormat treads the input file as its input data, and its recordReader reads
lines from the input split of this input file. Its getSplits() method splits input file base
on size in bytes, but in fact, the old code want to split the input file base on lines, it
is wrong usage here. 

I think the old coder did not understand how the FileInputFormat/TextInputFormat  works. I
just implement a new PeInputFormat and PeInputSplit to avoid above misusage and confusion.

2.  I have not change the total architecture of PerformanceEvaluation to keep the current
workflow. 
But in fact, in my opinion, the current MapReduce implementation for PerformanceEvaluation
is not a regular one.

In fact, aside from the above described spliting mechanism,  the real input data in this case
is not from the input file. It is not data, but only a range configuration info in file. So,
each data/row to be inserted into HBase table should come from the recordReader of InputFormt,
and in this case, the recordReader should generate sequential or random rows.

In my test, we found the old code sometimes gives wrong number of maps. e.g, I set the map
number as 40 (4clients*10), but I got 41 splits (maps). It is caused by the wrong split method
which split the file base on bytes.  After I add debug log in getSplits(), I got following
wrong splited maps:

split0: startRow=0 perClientRunRows=104857 totalRows=4194304
startRow=1048576 perClientRunRows=104857 totalRows=4194304

split1: startRow=2097152 perClientRunRows=104857 totalRows=4194304

split2: startRow=3145728 perClientRunRows=104857 totalRows=4194304

split3: startRow=4194304 perClientRunRows=104857 totalRows=4194304

split4: startRow=5242880 perClientRunRows=104857 totalRows=4194304

split5: startRow=6291456 perClientRunRows=104857 totalRows=4194304

split6: startRow=7340032 perClientRunRows=104857 totalRows=4194304

split7: startRow=8388608 perClientRunRows=104857 totalRows=4194304

split8: startRow=9437184 perClientRunRows=104857 totalRows=4194304

split9: startRow=10485760 perClientRunRows=104857 totalRows=4194304

split10: startRow=11534336 perClientRunRows=104857 totalRows=4194304

split11: startRow=12582912 perClientRunRows=104857 totalRows=4194304

split12: startRow=13631488 perClientRunRows=104857 totalRows=4194304

split13: startRow=14680064 perClientRunRows=104857 totalRows=4194304

split14: startRow=15728640 perClientRunRows=104857 totalRows=4194304

split15: null

split16: startRow=16777216 perClientRunRows=104857 totalRows=4194304
......
(many other splits are omit here...)

We can see, some splits include two row-ranges, and some splits have nothing.





> Improve PerformanceEvaluation
> -----------------------------
>
>                 Key: HBASE-1778
>                 URL: https://issues.apache.org/jira/browse/HBASE-1778
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: test
>    Affects Versions: 0.20.0
>            Reporter: Schubert Zhang
>            Assignee: Schubert Zhang
>            Priority: Minor
>             Fix For: 0.20.1
>
>         Attachments: HBase-0.20.0-PE.pdf, HBASE-1778.patch
>
>
> Current PerformanceEvaluation class have two problems:
> - It is not updated for hadoop-0.20.0. 
> - The approach to split maps is not strict. Need to provide correct InputSplit and InputFormat
classes. Current code uses TextInputFormat and FileSplit, it is not reasonable.
> We will fix these problems.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message