Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B9292DC7B for ; Mon, 19 Nov 2012 22:55:44 +0000 (UTC) Received: (qmail 83629 invoked by uid 500); 19 Nov 2012 22:55:43 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 83577 invoked by uid 500); 19 Nov 2012 22:55:43 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 83568 invoked by uid 99); 19 Nov 2012 22:55:43 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 19 Nov 2012 22:55:43 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of edlinuxguru@gmail.com designates 209.85.223.176 as permitted sender) Received: from [209.85.223.176] (HELO mail-ie0-f176.google.com) (209.85.223.176) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 19 Nov 2012 22:55:37 +0000 Received: by mail-ie0-f176.google.com with SMTP id 13so1715937iea.35 for ; Mon, 19 Nov 2012 14:55:16 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=PNFOMocVmbcqqqC6zdvs/CUEE9LBuLQuc8dmMLUr50A=; b=c6t3Q331vu42IYK2qg4fNYKUOnV68/fjiVMCMzAkhn17yhhIMtVCzq/AURePhRPxvz 6Ic2jz7Pdmucp35YZ6Ve96XACqjr94XhhZ6wJg0+jtjR+CqVGjrNr5urLNRKBg6cacur wSklqpkxVV+mBTURT04v4tTgo+25MmNcVOe7ABvIfdq3gHOfL1Gi/ZecyqzTFGSOqHHa PhIC+vyNg91r/i0dWWL8sN4R2rGVfPQi9ZIyIOqgh74HUbDhNRnNjYyfstECr1H2xoXO Wl59EVlQffCaTAdhHb9W4Sg7AgEqpAsjubLw3EUmEyPI0A542D57MuPVv9Z9jrJdnw7A k2QA== MIME-Version: 1.0 Received: by 10.50.178.106 with SMTP id cx10mr8287584igc.24.1353365716233; Mon, 19 Nov 2012 14:55:16 -0800 (PST) Received: by 10.64.97.106 with HTTP; Mon, 19 Nov 2012 14:55:16 -0800 (PST) In-Reply-To: References: Date: Mon, 19 Nov 2012 17:55:16 -0500 Message-ID: Subject: Re: how to obtain the latest record for each user in a hive table? From: Edward Capriolo To: user@hive.apache.org Content-Type: text/plain; charset=ISO-8859-1 X-Virus-Checked: Checked by ClamAV on apache.org On Mon, Nov 19, 2012 at 4:02 PM, qiaoresearcher wrote: > The table format is something like: > > user_id visiting_time visiting_web_page > user1 time11 page_string_11 > user1 time12 page_string_12 > user1 time13 page_string_13 <-- latest time > stamp > user1 time14 page_string_14 <-- also latest > time stamp, same as the one above > user1 time15 page_string_15 > ... ... ..... > user2 time21 page_string_21 > user2 time22 page_string_22 > user2 time23 page_string_23 > user2 time24 page_string_24 > user2 time25 page_string_25 > .... ..... .... > > Assume time13 and time14 are the latest time stamp for user1, time22 and > time25 are the latest time stamp for user2, > how to obtain output like: > user1 [page_string_13, page_string_14] > user2 [page_string_22, page_string_25] > > ps: run { select user, max(visiting_time) group by user } WILL return > result like: > user1, time13 (same as time 14) > user2, time22 (same as time 25) > > many thanks in advance! > Part of the solution is to use hive collect_set UDF. This collects group into a hive array type. select userid,visitingtime, collect_Set(visiting_web_page) from table group by userid,visitingtime; Collect_set de-duplicates. If you do not want this use the collect_udf I wrote. https://github.com/edwardcapriolo/hive-collect If you want only the newest one you can use rank. https://github.com/edwardcapriolo/hive-rank