Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@accumulo.apache.org
Received-SPF: pass (athena.apache.org: domain of josh.elser@gmail.com
 designates 209.85.213.169 as permitted sender)
Message-ID: <51E5D654.7090909@gmail.com>
Date: Tue, 16 Jul 2013 19:25:08 -0400
From: Josh Elser <josh.elser@gmail.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:17.0) Gecko/20130610 Thunderbird/17.0.6
MIME-Version: 1.0
To: user@accumulo.apache.org
Subject: Re: accumulo for a bi-map?
References: 
 <CADDp_G8g1V0A_LwQzVgCpHHPg9u77Ps5Ly6MwzWEowvK9udiOw@mail.gmail.com>
In-Reply-To: 
 <CADDp_G8g1V0A_LwQzVgCpHHPg9u77Ps5Ly6MwzWEowvK9udiOw@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Instead of keeping all match scores inside of one Value, have you 
considered thinking about your data in term of edges?

key:abcd->efgh score, value:88%
key:abcd->ijkl score, value:90%
key:efgh->abcd score, value:88%
key:ijkl->abcd score, value:90%

If you do go the route of storing both directions in Accumulo, a 
structure like this will likely be much easier to maintain, as you're 
not trying to manage difficult aggregation rules for multiple updates to 
the matches for a single record. Additionally, you should get really 
good compression (and even better in 1.5) when you have large row 
prefixes (many matches for abcd will equate to abcd being stored "once").

You could also store all of the features for a record in a key which 
only has the record in the row.

key:abcd feature:foo1
key:abcd feature:foo2
etc.

Also, I'd encourage you to try to upgrade to 1.5.0 if you can, but, if 
not, definitely update to 1.4.3 as it fixes a fair number of bugs. It's 
as simple as stopping Accumulo, and copying in the 1.4.3 Accumulo jar 
files to $ACCUMULO_HOME/lib, and removing the 1.4.1 jars.

(apparently Dave Marion and I think alike)

- Josh

On 07/16/2013 05:28 PM, Marc Reichman wrote:
> We are using accumulo as a mechanism to store feature data (binary 
> byte[]) for some simple keys which are used for a search algorithm. We 
> currently search by iterating over the feature space using 
> AccumuloRowInputFormat. Results come out of a reducer into HDFS, 
> currently in a SequenceFile.
>
> A customer has asked if we can store our results somewhere in our 
> Hadoop infrastructure, and also perform nightly searches of everything 
> vs everything to keep match results up to date.
>
> To me, the storage of the results in alternate column families (from 
> the features) would be a way way to store the matches alongside the 
> key rows:
> (key: abcd, features:{...}, matches{ 'm0: efgh-88%, 'm1': ijkl-90%, 
> ..., 'mN': etc }
> (key: ijkl, features:{...}, matches{ 'm0: efgh-88%, 'm1': abcd-90%, 
> ..., 'mN': etc }
>
> Match scores are equal between two items regardless of perspective, so 
> a->b is 90% as b->a is 90%.
>
> Is there a way to simply add columns to an existing family without 
> having to name them or keep track of how many there are? Am I better 
> off making a column family for each match key and then store score and 
> other fields in columns? Making one column with the key as the name 
> and the score as the value for each match under one family?
>
> Ideally I would have some form of bidirectional map so I could look at 
> any key and find all the results as other keys, and find any results 
> to get other matches.
>
> One approach is to simply add both sides of the relationship every 
> time anything matches anything else, which seems a bit wasteful, 
> space-wise.
>
> Curious if any pre-existing ideas are out there. Currently on hadoop 
> 1.0.3/accumulo 1.4.1, not set in (hard) concrete.
>
> Thanks,
> Marc
>
>