hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Carter Shanklin <>
Subject Re: Interesting claims that seem untrue
Date Mon, 16 Sep 2013 20:58:14 GMT

If nothing else I'm glad it was interesting enough to generate some
discussion. These sorts of stats are always subjects of a lot of
controversy. I have seen a lot of these sorts of charts float around in
confidential slide decks and I think it's good to have them out in the open
where anyone can critique and correct them.

In this case Ed, you've pointed out a legitimate flaw in my analysis. Doing
the analysis again I found that previously, due to a bug in my scripts,
JIRAs that didn't have Hudson comments in them were not counted (this was
one way it was identifying SVN commit IDs which I have since removed due to
flakiness). Brock's patch was the single largest victim of this bug but not
the only one, there were some from Cloudera, NexR, Hortonworks, Facebook
even 2 from you Ed. The interested can see a full list of exclusions here:
I apologize to those under-represented, there wasn't any intent on my part
to minimize anyone's work. The impact in final totals is Cloudera +5.4%,
NexR +0.8%, Facebook -2.7%, Hortonworks -3.3%. I will be updating the blog
later today with relevant corrections.

There is going to be continued interest in seeing charts like these, for
example when Hive 12 is officially done. Sanjay suggested that LoC counts
may not be the best way to represent true contribution. I agree that not
all lines of code are created equal, for example a few monster patches
recently went in re-arranging HCatalog namespaces and I think also
indentation style. This (hopefully) mechanical work is not on the same
footing as adding new query language features. Still it is work and
wouldn't be fair to pretend it didn't happen. If anyone has ideas on better
ways to fairly capture contribution I'm open to suggestions.

On Thu, Sep 12, 2013 at 7:19 AM, Edward Capriolo <>wrote:

> I was reading the horton-works blog and found an interesting article.
> There is a very interesting graphic which attempts to demonstrate lines of
> code in the 12 release.
> Although I do not know how they are calculated, they are probably counting
> code generated by tests output, but besides that they are wrong.
> One claim is that Cloudera contributed 4,244 lines of code.
> So to debunk that claim:
> In Brock Noland from
> cloudera, created the ptest2 testing framework. He did all the work for
> ptest2 in hive 12, and it is clearly more then 4,244
> This consists of 84 java files
> [edward@desksandra ptest2]$ find . -name "*.java" | wc -l
> 84
> and by itself is 8001 lines of code.
> [edward@desksandra ptest2]$ find . -name "*.java" | xargs cat | wc -l
> 8001
> [edward@desksandra hive-trunk]$ wc -l HIVE-4675.patch
> 7902 HIVE-4675.patch
> This is not the only feature from cloudera in hive 12.
> There is also a section of the article that talks of a "ROAD MAP" for hive
> features. I did not know we (hive) had a road map. I have advocated
> switching to feature based release and having a road map before, but it was
> suggested that might limit people from itch-scratching.

Carter Shanklin
Director, Product Management
(M): +1.650.644.8795 (T): @cshanklin <>

NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message