hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Billy Pearson" <sa...@pearsonwholesale.com>
Subject Re: Data serialization doesn't seem to respect MAX_VERSIONS
Date Tue, 16 Sep 2008 02:17:10 GMT
The reason the values are stored and the files are large is the only time we 
remove > max_versions is in a major compaction
this is defaulted to once a day in the hbase-default.xml. we do minor 
compactions of small groups of the map files more often but can not
enforce the max_versions until the major compaction when we combine all all 
the map files and can see all the versions and keep only the latest X 

HBASE-871 make it where we can set each column level to run a major 
compaction at different times to help with table/column family that receive 
lots of updates to the same rows over a days time.


"Adal Chiriliuc" <adalc@adobe.com> wrote in 

We've been inserting data into Hbase and we found out that the size of the 
files on local disk/HDFS is much larger than expected.

So I made a small script which updates over Thrift the same row many times. 
The table was created with MAX_VERSIONS = 1.

This is what I found:

If I modify the same cell 100.000 times, the final region "data" file on 
disk contains around 50.000 of those modifications after I shutdown Hbase.

If I modify the same cell 200.000 times, the final region "data" file on 
disk contains around 100.000 of those modifications after I shutdown Hbase.

client = thrift_util.create_client(Hbase.Client, "localhost", 9090, 30.0)
cd = ColumnDescriptor()
cd.name = "test:"
cd.maxVersions = 1
client.createTable("bug_test", [cd])

for i in range(100000):
                mutation = Mutation()
                mutation.column = "test:column"
                mutation.value = "version_%d" % i
                client.mutateRow("bug_test", "single_row", [mutation])
                if i % 1000 == 0:
                                print i

Is this expected behavior? Our use case involves multiple updates of the 
same cell using big blobs of data (25 KB).

Note: when getting a cell/scanning the table, everything is ok, only the 
last inserted version of the cell is returned. The older values of the cell 
are only present in the storage files.

Best regards,

View raw message