From accumulo-user-return-401-apmail-incubator-accumulo-user-archive=incubator.apache.org@incubator.apache.org Fri Mar 2 21:00:02 2012 Return-Path: X-Original-To: apmail-incubator-accumulo-user-archive@minotaur.apache.org Delivered-To: apmail-incubator-accumulo-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5D5DF95B4 for ; Fri, 2 Mar 2012 21:00:02 +0000 (UTC) Received: (qmail 1866 invoked by uid 500); 2 Mar 2012 21:00:02 -0000 Delivered-To: apmail-incubator-accumulo-user-archive@incubator.apache.org Received: (qmail 1846 invoked by uid 500); 2 Mar 2012 21:00:02 -0000 Mailing-List: contact accumulo-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: accumulo-user@incubator.apache.org Delivered-To: mailing list accumulo-user@incubator.apache.org Received: (qmail 1837 invoked by uid 99); 2 Mar 2012 21:00:02 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 02 Mar 2012 21:00:02 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of bimargulies@gmail.com designates 209.85.212.175 as permitted sender) Received: from [209.85.212.175] (HELO mail-wi0-f175.google.com) (209.85.212.175) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 02 Mar 2012 20:59:54 +0000 Received: by wibhq12 with SMTP id hq12so933601wib.6 for ; Fri, 02 Mar 2012 12:59:34 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=Pr1clAM522X1HUZHiH6mCNpnmeSvK8bxzgeR7bS0qLQ=; b=ufaxi4C5flWGly0PDr8xJjqf0AdrMzvKlp+Ja33XmaoKAcLeg5r9AiypbLQahxQfnK QNDFSxNO+UI3gRDcJErUmdzcPeHREbu8PEcu0xdDPPlXx3DsmE7A+jk2HbhTcIHTx6rd M6TVVgj7bvqLEilP9n17eUaGN898zxQX/WEcScc1onP6aAd3J51SBnGhwp7CIy3OELkM IhPR/JfMEiALajt33rhggkvkzYKLSFrUVQlpIS1gTLUbfqprNgV2MLkpvJtjjIJtzVZS vzXVCKgJKZBoizXF3eZk/7I0aljiKeUPUlpsfFf27lNFU3JGHMwAivfnzZMIbz4YVgDc wkCQ== MIME-Version: 1.0 Received: by 10.180.78.6 with SMTP id x6mr6782982wiw.18.1330721974458; Fri, 02 Mar 2012 12:59:34 -0800 (PST) Received: by 10.180.101.33 with HTTP; Fri, 2 Mar 2012 12:59:34 -0800 (PST) Date: Fri, 2 Mar 2012 15:59:34 -0500 Message-ID: Subject: Writing an iterator that calculates on compaction From: Benson Margulies To: accumulo-user@incubator.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Folks, I am trying to get organized to get my feet wet in using the ability of accumulo to compute near the data. I beg your pardon in advance for the following exercise in laying =C2=A0out what I have in mind and asking for some pointers -- particularly to examples on the 1.4 branch of code that I could warp to achieve my nefarious purposes. So, start with this data model: =C2=A0 ROWID =C2=A0 CF =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0CQ =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0V =C2=A0 itemid =C2=A0'context' =C2=A0 dimension =C2=A0 =C2=A0 value =C2=A0 itemid =C2=A0something =C2=A0 else =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0entirely... In short, for an 'item', there's a sparse feature vector associated with it (identified by cf=3D'context'), and some other things. Meanwhile, in another table we have: =C2=A0 clusterid =C2=A0'items' =C2=A0itemid1 =C2=A0 =C2=A0 =C2=A0 -blank- =C2=A0 clusterid =C2=A0'items' =C2=A0itemid2 =C2=A0 =C2=A0 =C2=A0 -blank- In other words, a cluster is a grouping of the items from the first group, identified by their rowids. My initial test of my ability to find my way around a brightly lit room with a flashlight is to calculate the centrolds of these clusters, and store them as an additional CF: =C2=A0 =C2=A0 CF=3D'centroid' CQ=3Ddimension V=3Dvalue And the my second test is to calculate the distance from each item to the centroid of it's cluster, and store that. Finally, I want to peruse items in descending order of their distance-from-centroid values. TIA