hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adal Chiriliuc <ad...@adobe.com>
Subject RE: Data serialization doesn't seem to respect MAX_VERSIONS
Date Mon, 15 Sep 2008 16:05:19 GMT
Forgot to specify: this happens using the latest trunk version.

-----Original Message-----
From: Adal Chiriliuc [mailto:adalc@adobe.com]
Sent: 15 septembrie 2008 19:03
To: hbase-dev@hadoop.apache.org
Subject: Data serialization doesn't seem to respect MAX_VERSIONS

Hello,

We've been inserting data into Hbase and we found out that the size of the files on local
disk/HDFS is much larger than expected.

So I made a small script which updates over Thrift the same row many times. The table was
created with MAX_VERSIONS = 1.

This is what I found:

If I modify the same cell 100.000 times, the final region "data" file on disk contains around
50.000 of those modifications after I shutdown Hbase.

If I modify the same cell 200.000 times, the final region "data" file on disk contains around
100.000 of those modifications after I shutdown Hbase.

client = thrift_util.create_client(Hbase.Client, "localhost", 9090, 30.0)
cd = ColumnDescriptor()
cd.name = "test:"
cd.maxVersions = 1
client.createTable("bug_test", [cd])

for i in range(100000):
                mutation = Mutation()
                mutation.column = "test:column"
                mutation.value = "version_%d" % i
                client.mutateRow("bug_test", "single_row", [mutation])
                if i % 1000 == 0:
                                print i

Is this expected behavior? Our use case involves multiple updates of the same cell using big
blobs of data (25 KB).

Note: when getting a cell/scanning the table, everything is ok, only the last inserted version
of the cell is returned. The older values of the cell are only present in the storage files.

Best regards,
Adal

Mime
View raw message