hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nitin Pawar <nitinpawar...@gmail.com>
Subject Re: hive query to select top 10 product of each subcategory and select most recent product info
Date Fri, 11 Apr 2014 08:58:40 GMT
will it be a good idea to just get top 10 ranked products by whatever your
ranking is based on and then join it with its metadata (self join or any
other way) ?


On Fri, Apr 11, 2014 at 1:52 PM, Mohit Durgapal <durgapalmohit@gmail.com>wrote:

> Hi Nitin,
>
> The ddl is as follows:
>
> CREATE EXTERNAL TABLE user_logs(
> users_iduuid    string,
> siteid  int,
> site_catid      int,
> stext   string,
> catg    int,   // CATEGORY
> scatg   int, // SUBCATEGORY
> catgname        string,
> scatgname       string,
> brand   string,    // PRODUCT BRAND NAME
> prrange string,
> curr    int,
> pname   string, // product name
> pid     int,  // product ID
> price   string,  //Product Price
> prodnbr int,
> mrp     string,  //MRP
> prURL string, //Product url
> prIMGURL string, //Product Image URL
> opr     string,
> oid     string,
> txsucc  string,
> last_updated    string //timestamp
> )
>
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
>
>
> I am looking for an output where I have top 10 products from each
> subcategory(on the basis of count) with all their information like product
> name, price, url, imgurl. Any there will be multiple entries for the same
> products (pids) within the same subcategory, In that case I have to pick
> the product info that is latest(by last_updated field).
>
>
> I have written a query but it is considering a multiple entries of
> product as  different products If price or any other info changes for that
> product.
>
>
>
> select siteid,site_catid,catg,scatg,COLLECT_SET(PRODDESC) from
> (
>   select
> PRODDESC,displays,siteid,site_catid,catg,scatg,rank(siteid,site_catid,catg,scatg)
> as row_number from
>   (
>   select count(*) as
> displays,siteid,site_catid,catg,scatg,CONCAT('{','pname:',pname,',price:',price,',','mrp:',mrp,',curr:',curr,',pid:',pid,'}')
> as PRODDESC from
>   user_logs group by siteid,site_catid,catg,scatg,pid,pname,price,mrp,curr
> order by siteid,site_catid,catg,scatg,displays desc
>   ) A
>   ) B
> WHERE row_number < 10
> group by siteid,site_catid,catg,scatg
> order by siteid,site_catid,catg,scatg desc;
>
> The rank() method simply helps in fetching top 10 within a subcategory.
> Every time it encounters the same combination of
> siteid,site_catid,catg,scatg it increments row_number goes till 10.
>
> The problem above is that I am forced to put product info such as
> "pname,price,mrp,...." in the group by clause otherwise I will not be able
> to get  that information  in "select". Therefore, even if someone changes
> just the price a product(this happens very frequently) it is considered a
> different product by the above query. And that is something I don't want.
>
> I hope I have made it a little more clear?  Thanks for your reply :)
>
>
>
> On Fri, Apr 11, 2014 at 12:45 PM, Nitin Pawar <nitinpawar432@gmail.com>wrote:
>
>> may be you can share your table ddl, your query and what output r u
>> looking for
>>
>>
>> On Fri, Apr 11, 2014 at 12:26 PM, Mohit Durgapal <durgapalmohit@gmail.com
>> > wrote:
>>
>>> I have a hive table partitioned by dates. It contains ecomm data in the
>>> format siteid,sitecatid,catid,subcatgid,pid,pname,pprice,pmrp,pdesc....
>>>
>>>
>>>
>>> What I need to do is to run a query on table above in hive for top 10
>>> products(count wise) in each sub category. What adds a bit more complexity
>>> is that I need all the information of the product. Now when I do group by
>>> with only subcatg,pid, I can only select the same fields. But I want all
>>> the data for that product coming in the same row as subcatg & prodid like
>>> prodname, proddesc,price, mrp,imageurl. And since some information like
>>> price & proddesc of a product keep on changing I want to pick the latest
>>> column values(according to a date field) for a pid if we are able to do a
>>> group by on subcatg,pid.
>>>
>>>
>>> I am not able to find a solution to my problem in hive. Any help would
>>> be much appreciated.
>>>
>>>
>>> Regards
>>> Mohit
>>>
>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>


-- 
Nitin Pawar

Mime
View raw message