hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erik Holstad (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-1261) Improvements to flushing, compacting, and splitting
Date Fri, 24 Apr 2009 23:19:30 GMT

    [ https://issues.apache.org/jira/browse/HBASE-1261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12702616#action_12702616

Erik Holstad commented on HBASE-1261:

wo types of compactions, minor and major.
Tasks that can be performed during a compaction:
1. Sorting the KeyValue from the different file
2. Checking TTL
3. Checking number of versions to keep
4. Removing KeyValues that are effected by deletes
5. Removing the deletes

For a major compaction all of the above is done, so the question is what we
should do for a minor. 

The thing is that when merging the different files into one list we need to sort
the smallest entry in all the files to be able to tell which one to put in first
,which means that we already are doing most of the work. So adding 2,3 and 4
will only add 3 extra checks, but will probably give better performance in the
system in total, but needs to be tested. The deletes can't be removes for a non
major compaction, so they will be kept in the new file, but will only apply to 
older files, which will speed up reading.

Today when doing the merge we open up a scanner for all the n files that we want
to merge and for every value we put into the new file we do n compares, so if
the total number of inserts are m we get something like m*n compares since we
are not using any of the information from the last compare, so we end up redoing
a lot of work. 

I propose a TreeSet or a Heap structure so we get something like m*log(n). 
After that we just need to do some testing to see how much the extra compares
will add and how big the read performance gain will be by enforcing the extra

> Improvements to flushing, compacting, and splitting
> ---------------------------------------------------
>                 Key: HBASE-1261
>                 URL: https://issues.apache.org/jira/browse/HBASE-1261
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: documentation, master, regionserver
>    Affects Versions: 0.19.0
>            Reporter: Evgeny Ryabitskiy
>            Assignee: Evgeny Ryabitskiy
> This issue concludes my task for GSoC 2009 
> The idea is to make all futures from Bigtable paper and other ideas in this processes
of to flushing, compacting and splitting.
> Some talk in IRC related to this issue creation:
> [22:37] <jgray2> bigtable has an optimization which, when flushing the memcache,
will actually write out a new storefile that merges the existing storefile with the memcache
> [22:37] <jgray2> that way you don't have more than one storefile
> [22:37] <jgray2> which is a huge inefficiency

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message