hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Enis Soztutar (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HBASE-8369) MapReduce over snapshot files
Date Fri, 25 Oct 2013 22:13:35 GMT

    [ https://issues.apache.org/jira/browse/HBASE-8369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13805748#comment-13805748
] 

Enis Soztutar edited comment on HBASE-8369 at 10/25/13 10:11 PM:
-----------------------------------------------------------------

Uploaded the slides about this issue here: http://www.slideshare.net/enissoz/mapreduce-over-snapshots.
It also contains some numbers for performaance comparison. 
Here are the raw numbers (in MB/s)

||Data size||	6.6 G||	13.2G||	19.8 G||	26.4 G||
|StoreFileCount per region|	3|	6|	9|	12|
|Scan(MB/s)	|8.2	|7.6	|11.2	|7.2|
|SnapshotScan(MB/s)	|60.8	|59.5	|55.3	|46.7|
|ScanMR(MB/s)	|75.9	|80.4	|82.3	|140.7|
|SnapshotScanMR(MB/s)	|198.6	|275.6	|311.6	|329.4|

Main takeaway, seems to be the single scanner speeds improve 5-6x, from 11MB/s to 55MB/s.
That is also half of raw disk speed (for a single disk). Do not read much into MR test speed
improvements when store file increases. That is due to job launch costs taking relatively
smaller percentage when data sizes increase. 


was (Author: enis):
Uploaded the slides about this issue here: http://www.slideshare.net/enissoz/mapreduce-over-snapshots.
It also contains some numbers for performaance comparison. 
Here are the raw numbers (in MB/s)

||Data size||	6.6 G||	13.2G||	19.8 G||	26.4 G||
|StoreFileCount per region|	3|	6|	9|	12|
|Scan	|8.2	|7.6	|11.2	|7.2|
|SnapshotScan	|60.8	|59.5	|55.3	|46.7|
|ScanMR	|75.9	|80.4	|82.3	|140.7|
|SnapshotScanMR	|198.6	|275.6	|311.6	|329.4|

Main takeaway, seems to be the single scanner speeds improve 5-6x, from 11MB/s to 55MB/s.
That is also half of raw disk speed (for a single disk). Do not read much into MR test speed
improvements when store file increases. That is due to job launch costs taking relatively
smaller percentage when data sizes increase. 

> MapReduce over snapshot files
> -----------------------------
>
>                 Key: HBASE-8369
>                 URL: https://issues.apache.org/jira/browse/HBASE-8369
>             Project: HBase
>          Issue Type: New Feature
>          Components: mapreduce, snapshots
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>             Fix For: 0.98.0
>
>         Attachments: HBASE-8369-0.94.patch, HBASE-8369-0.94_v2.patch, HBASE-8369-0.94_v3.patch,
HBASE-8369-0.94_v4.patch, HBASE-8369-0.94_v5.patch, HBASE-8369-trunk_v1.patch, HBASE-8369-trunk_v2.patch,
HBASE-8369-trunk_v3.patch, hbase-8369_v0.patch
>
>
> The idea is to add an InputFormat, which can run the mapreduce job over snapshot files
directly bypassing hbase server layer. The IF is similar in usage to TableInputFormat, taking
a Scan object from the user, but instead of running from an online table, it runs from a table
snapshot. We do one split per region in the snapshot, and open an HRegion inside the RecordReader.
A RegionScanner is used internally for doing the scan without any HRegionServer bits. 
> Users have been asking and searching for ways to run MR jobs by reading directly from
hfiles, so this allows new use cases if reading from stale data is ok:
>  - Take snapshots periodically, and run MR jobs only on snapshots.
>  - Export snapshots to remote hdfs cluster, run the MR jobs at that cluster without HBase
cluster.
>  - (Future use case) Combine snapshot data with online hbase data: Scan from yesterday's
snapshot, but read today's data from online hbase cluster. 



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message