Return-Path: X-Original-To: apmail-accumulo-user-archive@www.apache.org Delivered-To: apmail-accumulo-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 85C98C4E3 for ; Thu, 18 Jul 2013 16:49:28 +0000 (UTC) Received: (qmail 70967 invoked by uid 500); 18 Jul 2013 16:49:28 -0000 Delivered-To: apmail-accumulo-user-archive@accumulo.apache.org Received: (qmail 70891 invoked by uid 500); 18 Jul 2013 16:49:28 -0000 Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@accumulo.apache.org Delivered-To: mailing list user@accumulo.apache.org Received: (qmail 70883 invoked by uid 99); 18 Jul 2013 16:49:27 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 18 Jul 2013 16:49:27 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of josh.elser@gmail.com designates 209.85.216.170 as permitted sender) Received: from [209.85.216.170] (HELO mail-qc0-f170.google.com) (209.85.216.170) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 18 Jul 2013 16:49:22 +0000 Received: by mail-qc0-f170.google.com with SMTP id s1so1841090qcw.29 for ; Thu, 18 Jul 2013 09:49:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; bh=HNP/NOknxqRN2uuPWGZzPcwFRj9QXJfxVG2QNpPSIG8=; b=IOyrVeHHt0+XHfCdO5Uq1sivUJhst4zjBhhi4+r8q4HspvgSK0jR935SvrsxFCtmgq 144ygWEjMuj4KuNWh81w/qAtztuqW0fAzG1sFQrDJ6KeyrdI4C1du4/044KhPL4N6juR XlcSKy/GniQBawn7WukqjplRDyzCmkP0V3uogFdnSjrwvFqzRgMabMzfoE7kRJx6tPSb +OO7vLWwiQlKxOL0VAKs6i2kr+e0BjdfgZ7AN7fK/6YeoiBkgoxYNnMsGmVYxBR34if+ oua9+798fW9JLvwg2zUgFG3RYJ1/83eZIcVFsaGEUbSBX4s+FzcHjnsp8J+p5m1ztkAr qz5g== X-Received: by 10.49.37.225 with SMTP id b1mr13762056qek.24.1374166141357; Thu, 18 Jul 2013 09:49:01 -0700 (PDT) Received: from [144.51.26.21] ([144.51.26.21]) by mx.google.com with ESMTPSA id 15sm17563645qaa.9.2013.07.18.09.49.00 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 18 Jul 2013 09:49:00 -0700 (PDT) Message-ID: <51E81C6E.8030700@gmail.com> Date: Thu, 18 Jul 2013 12:48:46 -0400 From: Josh Elser User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:17.0) Gecko/20130620 Thunderbird/17.0.7 MIME-Version: 1.0 To: user@accumulo.apache.org Subject: Re: accumulo for a bi-map? References: <51E5D654.7090909@gmail.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Just be aware that if you have extremely wide matches (one record matching many other records), you've now forced these records to only ever be hosted on one tabletserver (as a row cannot be split across a tablet). Given the size of what you outlined so far, you'd probably have to get up to the scale of tens of millions before this is a problem. On 7/18/13 12:15 PM, Marc Reichman wrote: > I have implemented an approach like Dave Marion's, where on a match > during search I insert two rows: > > Row____ > > > > Column Family____ > > > > Column Qualifier____ > > > > Value____ > > > > > > abcd____ > > > > ijkl____ > > > > 90____ > > > > __ __ > > > > > > ijkl____ > > > > abcd____ > > > > 90____ > > > > __ __ > > __ > > This works great for what I need to get, all abcd matches, all ijkl > matches, specifically abcd->ijkl or reversed. For threshold filtering, > I'm currently getting all of the results (from these cases) and then not > retaining items below my threshold. I've looked at some ways to use a > scan iterator to do this but I'm coming up short. Best idea I've had yet > is to extend the ColumnQualifierFilter to see if I can do a "greater > than" instead of an equals to accept or not. Any thoughts? > > > > On Wed, Jul 17, 2013 at 10:26 AM, Marc Reichman > > wrote: > > Thank you all for your responses. Some follow-up thoughts/questions: > > The use cases I'm chasing right now for retrieval are shaping up to be: > 1. Get one ABCD->IJKL match score > 2. Get all ABCD->* match scores > 3. Either of the above, only greater than a specified threshold. > > It's looking like the results may go into a different table than the > original features, so I can work a little more flexibly. > > So far, Dave Marion's approach seems most closely suited to this, > but in a different table I wouldn't get the features back if I just > did a basic scan for the row key without other factors, which would > satisfy use case #2. I can satisfy case #1 easily if I make the > targets (IJKL) a qualifier and constrain by it on my scan as needed. > > For #3, I'm a bit confused at a best way to do this. A simple > solution would be to just pull all the results from the #1/#2 cases > and filter out undesirables in my client-side code. Assuming > key:source, fam:target, col:score, is there some form of iterator or > filter I could use to process the column names and throw out what I > don't want with decent data locality for the processing? > > Would it make any major impact if the scores were not integers but > doubles? I'm already anticipating having to parse doubles from the > scores as-stored in byte[] string form, but I don't know if the > performance impact would make any difference doing that locally > after or in an iterator. > > I feel like this is close and I appreciate the guidance. > > Thanks, > Marc > > > On Tue, Jul 16, 2013 at 6:25 PM, Josh Elser > wrote: > > Instead of keeping all match scores inside of one Value, have > you considered thinking about your data in term of edges? > > key:abcd->efgh score, value:88% > key:abcd->ijkl score, value:90% > key:efgh->abcd score, value:88% > key:ijkl->abcd score, value:90% > > If you do go the route of storing both directions in Accumulo, a > structure like this will likely be much easier to maintain, as > you're not trying to manage difficult aggregation rules for > multiple updates to the matches for a single record. > Additionally, you should get really good compression (and even > better in 1.5) when you have large row prefixes (many matches > for abcd will equate to abcd being stored "once"). > > You could also store all of the features for a record in a key > which only has the record in the row. > > key:abcd feature:foo1 > key:abcd feature:foo2 > etc. > > Also, I'd encourage you to try to upgrade to 1.5.0 if you can, > but, if not, definitely update to 1.4.3 as it fixes a fair > number of bugs. It's as simple as stopping Accumulo, and copying > in the 1.4.3 Accumulo jar files to $ACCUMULO_HOME/lib, and > removing the 1.4.1 jars. > > (apparently Dave Marion and I think alike) > > - Josh > > > On 07/16/2013 05:28 PM, Marc Reichman wrote: > > We are using accumulo as a mechanism to store feature data > (binary byte[]) for some simple keys which are used for a > search algorithm. We currently search by iterating over the > feature space using AccumuloRowInputFormat. Results come out > of a reducer into HDFS, currently in a SequenceFile. > > A customer has asked if we can store our results somewhere > in our Hadoop infrastructure, and also perform nightly > searches of everything vs everything to keep match results > up to date. > > To me, the storage of the results in alternate column > families (from the features) would be a way way to store the > matches alongside the key rows: > (key: abcd, features:{...}, matches{ 'm0: efgh-88%, 'm1': > ijkl-90%, ..., 'mN': etc } > (key: ijkl, features:{...}, matches{ 'm0: efgh-88%, 'm1': > abcd-90%, ..., 'mN': etc } > > Match scores are equal between two items regardless of > perspective, so a->b is 90% as b->a is 90%. > > Is there a way to simply add columns to an existing family > without having to name them or keep track of how many there > are? Am I better off making a column family for each match > key and then store score and other fields in columns? Making > one column with the key as the name and the score as the > value for each match under one family? > > Ideally I would have some form of bidirectional map so I > could look at any key and find all the results as other > keys, and find any results to get other matches. > > One approach is to simply add both sides of the relationship > every time anything matches anything else, which seems a bit > wasteful, space-wise. > > Curious if any pre-existing ideas are out there. Currently > on hadoop 1.0.3/accumulo 1.4.1, not set in (hard) concrete. > > Thanks, > Marc > > > > >