pig-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Stevens <i.stev...@syncapse.com>
Subject Simple Pig query returns inaccurate result size for HBase tables of 1.8m+ rows
Date Wed, 05 Jan 2011 21:14:12 GMT
Hi everyone. In considering Pig for our HBase querying needs, I've run into a discrepancy between
the size of Pig's result set and the size of the table being queried. I hope this is due to
a misunderstanding of HBase and Pig on my part. The test case which generates the discrepancy
is fairly simple, however.

The link below contains a Jython script which populates an HBase table with data in two column
familes. A corresponding Pig query retrieves data for one column and saves it to a CSV:

https://gist.github.com/766929

The Jython script has the following usage:

> jython hbase_test.py [table] [column count] [row count] [batch count]

This will populate a table named [table] with two column families. The first contains static
data. The second contains the given number of columns, populated with data.

The Pig query will return an inaccurate number of results for certain table sizes and configurations,
most notably with tables exceeding 1.8 million rows in length and with more than 2 columns
in the queried column family, eg.

> jython hbase_test.py test 3 1800000 100000

For instance, if I execute the above command and the corresponding Pig query, the results
number 905914. Note that if the table is re-populated and queried a second time, a different
number results. If I run the query again without re-populating the table, I get the same number
of results. The HBase shell returns an accurate row count.

Some notes on reproducing this issue (or not):

* If the Jython script doesn't populate the meta column family, the issue goes away with the
same query.
* If the Jython script populates 2 columns instead of 3, the issue goes away with the same
query.
* The size of the column key or its value may influence whether the issue occurs.
   For instance, if I change the script to store 'value_%d' instead of 'value_%d_%d', retaining
the random int, the issue goes away with the same query.

I am using Pig 0.8.0 and HBase 0.20.6 on a MacBook running Snow Leopard using the stock Java
that came with the OS. Attached is a log of the Pig console output. The error logs contain
nothing of import.

Am I doing anything incorrectly? Is there a way I can work around this issue without compromising
the column family being queried?

This appears to be a fairly simple case of Pig/HBase usage. Can anyone else reproduce the
issue?

thanks,
Ian.


Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message