Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 56205 invoked from network); 2 Apr 2010 14:25:56 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 2 Apr 2010 14:25:56 -0000 Received: (qmail 80742 invoked by uid 500); 2 Apr 2010 11:35:55 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 80674 invoked by uid 500); 2 Apr 2010 11:35:54 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 80666 invoked by uid 99); 2 Apr 2010 11:35:54 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 02 Apr 2010 11:35:54 +0000 X-ASF-Spam-Status: No, hits=0.6 required=10.0 tests=AWL,FREEMAIL_FROM,FREEMAIL_REPLY,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of prasen.bea@gmail.com designates 209.85.212.48 as permitted sender) Received: from [209.85.212.48] (HELO mail-vw0-f48.google.com) (209.85.212.48) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 02 Apr 2010 11:35:48 +0000 Received: by vws14 with SMTP id 14so1081320vws.35 for ; Fri, 02 Apr 2010 04:35:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:received:message-id:subject:from:to:content-type :content-transfer-encoding; bh=FopIHIVE4Oudsth0iQwMLoiWvtCDtV1lO0ZcLTsrlhA=; b=G78RWT6SWMQRNtH1lZNtkYGcpu83pKljtTfNmBSfPXKbNWBkBKSrswVaBeNYb9KItK d7akDpgd+zLS5r7kp+Ju38LwStDlnrljwAzFu0jURuxKTPZNUuLn+5uXBoiDzt9w4Chw YGGqqtmI4yEoKJx3NHAHEtlU62yKekFYAKRfg= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=WXSKD8yRX4Vp3Si2YdMn9nalSjVzNA7xjVBon/s5WqVU/QdyYZhXbFR/hqNjG6yIV1 FVipZdYmAHbCMwFIfkybF3TPjcL39sm3q6D7UNSdZSXi+cmLxElwkGTaT1OkwQgyaWY8 klnBeDNRO0zTW0G6X63mRflAPmAE0bz7vZi7Y= MIME-Version: 1.0 Received: by 10.220.10.195 with HTTP; Fri, 2 Apr 2010 04:35:27 -0700 (PDT) In-Reply-To: References: Date: Fri, 2 Apr 2010 07:35:27 -0400 Received: by 10.220.107.94 with SMTP id a30mr1106378vcp.15.1270208128036; Fri, 02 Apr 2010 04:35:28 -0700 (PDT) Message-ID: Subject: Re: Lucene Challenge - sum, count, avg, etc. From: prasenjit mukherjee To: java-user@lucene.apache.org, jeacott@hardlight.com.au Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Pig generally takes csv-type flat files as input. And then you do join/group-by/sum/count etc on the variables ( aka relations ) For Michael's example with following data: *Affiliate / SaleDate / SaleAmount* * mike / 2010-03-01 / 10.00 * john / 2010-03-01 / 10.00 One can write following pig-script : r1 =3D load 'data.csv' as ( affiliate:string, saledate:string, amount:int ) r1 =3D filter r1 by saledate > my_udf_convert_in_secs(2010-03-01) AND saledate < my_udf_convert_in_secs(2010-03-06) r2 =3D group r1 by affiliate; r3 =3D foreach r2 generate affiliate, SUM(amount) as totalrevenue; r3 =3D order r3 by totalrevenue; DUMP r3 into 'output.csv' Remember that its only because of the additional sorting requirement that we are forced to use pig, otherwise lucene can do the job ( except the sorting ) much faster. -Prasen On Thu, Apr 1, 2010 at 9:40 PM, Jason Eacott wrote= : > Thanks for the ref - didn't know about Pig before. > the language and approach looks useful, so now I'm wondering if it > couldn't be used > across lucene over hadoop too. If data was indexed in lucene and Pig knew= that, > then it could make for an interesting alternate lucene query language. > > could this work? > > > prasenjit mukherjee wrote: >> This looks like a use case more suited =A0for Pig ( over Hadoop ). >> >> It could be difficult for lucene to do sort and sum simultaneously as >> sorting itself depends upon summed value. >> >> On Thu, Apr 1, 2010 at 11:47 PM, Michel Nadeau wrote: >>> Well that's my problem: we have a lot of records of all types (afiiliat= es, >>> sales) so looping tons of records each time isn't possible. >>> >>> - Mike >>> akaris@gmail.com >>> >>> --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org