hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Keith Wiley <kwi...@keithwiley.com>
Subject Re: Need rank()
Date Tue, 02 Apr 2013 19:51:20 GMT
Yep, the original article is definitely erroneous in this regard.  I figured out that eventually.
 I'm not sure how much I can trust that resource now.  I may have to look elsewhere.  I agree
that Edward's description is pretty good, but as I said earlier, I can't actually use his
code, so I'm trying to cobble a workable solution together from the various resources available.
 Ritesh's article, despite the error in the Hive syntax, is still useful in that it enables
one to quickly compile a simple rank jar without relying on git, maven, or other project dependencies
-- problems which have plagued me with Edward's approach.  So, if I can use Ritesh's method
to write a simple rank function, and Edward's accurate description of how to construct the
query, then I can put all the pieces together into a workable solution.

I'll let you know if I get it.

On Apr 2, 2013, at 10:56 , Igor Tatarinov wrote:

> You are getting the error because you are ORDERing BY rank but rank is not in the top
SELECT
> 
> Also, DISTRIBUTE BY/SORT BY are done after SELECT so you have to use a subquery:
> SELECT ..., rank(user)
> FROM (SELECT ... DISTRIBUTE BY ... SORT BY)
> 
> igor
> decide.com
> 
> 
> On Tue, Apr 2, 2013 at 10:03 AM, Keith Wiley <kwiley@keithwiley.com> wrote:
> On Apr 1, 2013, at 16:12 , Alexander Pivovarov wrote:
> 
> > http://ragrawal.wordpress.com/2011/11/18/extract-top-n-records-in-each-group-in-hadoophive/
> 
> Is there any possibility there is a bug in Ritesh Agrawal's query statement from that
article?  I created a test table with the exact column names from the example in the article
and used a minimally altered version of the command (I removed the where clause to simplify
things a bit) and got an error which suggests there is something slightly wrong with the command
(or perhaps the table has to be configured a special way).  Here's what I get when I almost
perfectly duplicate that example:
> 
> hive> describe test;
> OK
> user    string
> category        string
> value   int
> Time taken: 0.082 seconds
> ==================================================
> hive> select * from test;
> OK
> user1   cat1    1
> user1   cat1    2
> user1   cat1    3
> user1   cat2    10
> user1   cat2    20
> user1   cat2    30
> user2   cat1    11
> user2   cat1    21
> user2   cat1    31
> user2   cat2    5
> user2   cat2    6
> user2   cat2    7
> Time taken: 0.202 seconds
> ==================================================
> hive> SELECT user, category, value
>     > FROM (
>     > SELECT user, category, rank(user) as rank, value
>     > FROM test
>     > DISTRIBUTE BY user
>     > SORT BY user, value desc
>     > ) a
>     > WHERE rank > 1
>     > ORDER BY user, rank;
> FAILED: Error in semantic analysis: Line 9:15 Invalid table alias or column reference
'rank': (possible column names are: _col0, _col1, _col2)
> hive>
> 
> ________________________________________________________________________________
> Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com
> 
> "What I primarily learned in grad school is how much I *don't* know.
> Consequently, I left grad school with a higher ignorance to knowledge ratio than
> when I entered."
>                                            --  Keith Wiley
> ________________________________________________________________________________
> 
> 


________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"Luminous beings are we, not this crude matter."
                                           --  Yoda
________________________________________________________________________________


Mime
View raw message