Return-Path: X-Original-To: apmail-accumulo-user-archive@www.apache.org Delivered-To: apmail-accumulo-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CC5F010C19 for ; Tue, 16 Jul 2013 23:25:17 +0000 (UTC) Received: (qmail 32738 invoked by uid 500); 16 Jul 2013 23:25:17 -0000 Delivered-To: apmail-accumulo-user-archive@accumulo.apache.org Received: (qmail 32642 invoked by uid 500); 16 Jul 2013 23:25:17 -0000 Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@accumulo.apache.org Delivered-To: mailing list user@accumulo.apache.org Received: (qmail 32634 invoked by uid 99); 16 Jul 2013 23:25:17 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Jul 2013 23:25:17 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of josh.elser@gmail.com designates 209.85.213.169 as permitted sender) Received: from [209.85.213.169] (HELO mail-ye0-f169.google.com) (209.85.213.169) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Jul 2013 23:25:11 +0000 Received: by mail-ye0-f169.google.com with SMTP id m1so368815yen.28 for ; Tue, 16 Jul 2013 16:24:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; bh=aMrPJeWyaLYujXuLiAVb3elTDO4d+atQXykeaDHKJG0=; b=efgawi44kcaOVPQaZcHgq1DyOXyIK+DM2vGQGHUL5ik43LdamLP8sgf1Q3pvXCL5fB 7Om+qT/DK6JFeUzl3zIn1Luk9llKAnWKXrrwAw8qnnjRWu3s5Q3yyeBDaCGkRTcLDXKQ F7ByKTqmMI997EJNWEBnUlg2jHN03yWBL2qUEmsWqCGqUuQ4wod3fmR1K2ffKDbyslRE 388RuI0Y7mHqoIk7+b7IjJEUJOsPy7RbHE08xH2GW4PEYtgKrKsuZ0E9hlUUKcxVqy7x iq5wXMli0YJaf+T9FShQnThYmwZF2+kQbtS2JmcjeG2jHnxxCCr88Dp9bA9yTbpXG4cJ fKgA== X-Received: by 10.236.89.81 with SMTP id b57mr1153842yhf.28.1374017090838; Tue, 16 Jul 2013 16:24:50 -0700 (PDT) Received: from [192.168.2.210] (pool-72-81-136-94.bltmmd.fios.verizon.net. [72.81.136.94]) by mx.google.com with ESMTPSA id u25sm3563419yhl.23.2013.07.16.16.24.49 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 16 Jul 2013 16:24:50 -0700 (PDT) Message-ID: <51E5D654.7090909@gmail.com> Date: Tue, 16 Jul 2013 19:25:08 -0400 From: Josh Elser User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130610 Thunderbird/17.0.6 MIME-Version: 1.0 To: user@accumulo.apache.org Subject: Re: accumulo for a bi-map? References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Instead of keeping all match scores inside of one Value, have you considered thinking about your data in term of edges? key:abcd->efgh score, value:88% key:abcd->ijkl score, value:90% key:efgh->abcd score, value:88% key:ijkl->abcd score, value:90% If you do go the route of storing both directions in Accumulo, a structure like this will likely be much easier to maintain, as you're not trying to manage difficult aggregation rules for multiple updates to the matches for a single record. Additionally, you should get really good compression (and even better in 1.5) when you have large row prefixes (many matches for abcd will equate to abcd being stored "once"). You could also store all of the features for a record in a key which only has the record in the row. key:abcd feature:foo1 key:abcd feature:foo2 etc. Also, I'd encourage you to try to upgrade to 1.5.0 if you can, but, if not, definitely update to 1.4.3 as it fixes a fair number of bugs. It's as simple as stopping Accumulo, and copying in the 1.4.3 Accumulo jar files to $ACCUMULO_HOME/lib, and removing the 1.4.1 jars. (apparently Dave Marion and I think alike) - Josh On 07/16/2013 05:28 PM, Marc Reichman wrote: > We are using accumulo as a mechanism to store feature data (binary > byte[]) for some simple keys which are used for a search algorithm. We > currently search by iterating over the feature space using > AccumuloRowInputFormat. Results come out of a reducer into HDFS, > currently in a SequenceFile. > > A customer has asked if we can store our results somewhere in our > Hadoop infrastructure, and also perform nightly searches of everything > vs everything to keep match results up to date. > > To me, the storage of the results in alternate column families (from > the features) would be a way way to store the matches alongside the > key rows: > (key: abcd, features:{...}, matches{ 'm0: efgh-88%, 'm1': ijkl-90%, > ..., 'mN': etc } > (key: ijkl, features:{...}, matches{ 'm0: efgh-88%, 'm1': abcd-90%, > ..., 'mN': etc } > > Match scores are equal between two items regardless of perspective, so > a->b is 90% as b->a is 90%. > > Is there a way to simply add columns to an existing family without > having to name them or keep track of how many there are? Am I better > off making a column family for each match key and then store score and > other fields in columns? Making one column with the key as the name > and the score as the value for each match under one family? > > Ideally I would have some form of bidirectional map so I could look at > any key and find all the results as other keys, and find any results > to get other matches. > > One approach is to simply add both sides of the relationship every > time anything matches anything else, which seems a bit wasteful, > space-wise. > > Curious if any pre-existing ideas are out there. Currently on hadoop > 1.0.3/accumulo 1.4.1, not set in (hard) concrete. > > Thanks, > Marc > >