Return-Path: X-Original-To: apmail-accumulo-user-archive@www.apache.org Delivered-To: apmail-accumulo-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2F56B10BBA for ; Tue, 16 Jul 2013 23:16:48 +0000 (UTC) Received: (qmail 12939 invoked by uid 500); 16 Jul 2013 23:16:48 -0000 Delivered-To: apmail-accumulo-user-archive@accumulo.apache.org Received: (qmail 12906 invoked by uid 500); 16 Jul 2013 23:16:48 -0000 Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@accumulo.apache.org Delivered-To: mailing list user@accumulo.apache.org Received: (qmail 12898 invoked by uid 99); 16 Jul 2013 23:16:47 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Jul 2013 23:16:47 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of dlmarion@comcast.net designates 76.96.62.16 as permitted sender) Received: from [76.96.62.16] (HELO qmta01.westchester.pa.mail.comcast.net) (76.96.62.16) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Jul 2013 23:16:41 +0000 Received: from omta18.westchester.pa.mail.comcast.net ([76.96.62.90]) by qmta01.westchester.pa.mail.comcast.net with comcast id 17zJ1m0051wpRvQ51BGLPx; Tue, 16 Jul 2013 23:16:20 +0000 Received: from DaveLaptop ([69.137.58.229]) by omta18.westchester.pa.mail.comcast.net with comcast id 1BGK1m00V4wkiyD3eBGK3H; Tue, 16 Jul 2013 23:16:20 +0000 From: "Dave Marion" To: References: In-Reply-To: Subject: RE: accumulo for a bi-map? Date: Tue, 16 Jul 2013 19:16:25 -0400 Message-ID: <008901ce827a$7e7152a0$7b53f7e0$@comcast.net> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_NextPart_000_008A_01CE8258.F7646D90" X-Mailer: Microsoft Outlook 14.0 Thread-Index: AQK3Y51mzkN0vWDoUDNIEY+ZebZd/5eWKhyQ Content-Language: en-us DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=comcast.net; s=q20121106; t=1374016580; bh=iONT+h8xrV184f7Oox0h6INFFj2wwzTHyXNMcSXKdZA=; h=Received:Received:From:To:Subject:Date:Message-ID:MIME-Version: Content-Type; b=NW1w+qFkzrui2EWvJ52K/juyfOcmxQFbA/bx+tzV98CypF2cY8zv42nS1+ckiVTFE 2rfTbySw4EtMuo8YdiUEF45cbyXLj4sn2IZ26A0PN+5y7LFt/eYwqBnS2W9h79rW/j 4CQme1OURBum9/sXabERIoCPecqVCpfHxKZzUZO87yVLipUbhonXEQxvzIcifz77GY pKH8Y9QLcPHfMx+6pYoaEX083mj2WXxBeCD5nfYkdoQXcdIy2ol4IU5OdEyUBgp+uw 7sEhzaXiA9NFSzrkSD/Tg+2UDIhggCHuIbhpHyT8nYJMl5B9fxdk3VVKKaqKQ9L6Dm HY3PmiSluZG7Q== X-Virus-Checked: Checked by ClamAV on apache.org This is a multipart message in MIME format. ------=_NextPart_000_008A_01CE8258.F7646D90 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit I'm not sure how familiar you are with Accumulo, but you do not need to specify your columns when you create the table. You could create a table that stores the feature vector for your source followed by columns for the related objects. Sounds like you are already thinking down this path. For example: Row Column Family Column Qualifier Value abcd Feature vector abcd efgh 88 abcd ijkl 90 ijkl Feature vector ijkl abcd 90 The RFile format will compress repeating row, colf, and colq values down to 1. Not sure how you are searching, but you could switch the colq and colf in the example above to sort by relative score. Requirements change over time, so the table format above would also allow you to store different versions of the same relationship so that you could track the history over time if that became important. It would also allow you to provide a different score for each direction of the relationship if that matters later. From: Marc Reichman [mailto:mreichman@pixelforensics.com] Sent: Tuesday, July 16, 2013 5:28 PM To: user@accumulo.apache.org Subject: accumulo for a bi-map? We are using accumulo as a mechanism to store feature data (binary byte[]) for some simple keys which are used for a search algorithm. We currently search by iterating over the feature space using AccumuloRowInputFormat. Results come out of a reducer into HDFS, currently in a SequenceFile. A customer has asked if we can store our results somewhere in our Hadoop infrastructure, and also perform nightly searches of everything vs everything to keep match results up to date. To me, the storage of the results in alternate column families (from the features) would be a way way to store the matches alongside the key rows: (key: abcd, features:{...}, matches{ 'm0: efgh-88%, 'm1': ijkl-90%, ..., 'mN': etc } (key: ijkl, features:{...}, matches{ 'm0: efgh-88%, 'm1': abcd-90%, ..., 'mN': etc } Match scores are equal between two items regardless of perspective, so a->b is 90% as b->a is 90%. Is there a way to simply add columns to an existing family without having to name them or keep track of how many there are? Am I better off making a column family for each match key and then store score and other fields in columns? Making one column with the key as the name and the score as the value for each match under one family? Ideally I would have some form of bidirectional map so I could look at any key and find all the results as other keys, and find any results to get other matches. One approach is to simply add both sides of the relationship every time anything matches anything else, which seems a bit wasteful, space-wise. Curious if any pre-existing ideas are out there. Currently on hadoop 1.0.3/accumulo 1.4.1, not set in (hard) concrete. Thanks, Marc ------=_NextPart_000_008A_01CE8258.F7646D90 Content-Type: text/html; charset="us-ascii" Content-Transfer-Encoding: quoted-printable

I’m not sure how familiar you are with Accumulo, but you do not = need to specify your columns when you create the table. You could create = a table that stores the feature vector for your source followed by = columns for the related objects. Sounds like you are already thinking = down this path. For example:

 

 

Row

Column Family

Column Qualifier

Value

abcd

 

 

Feature vector

abcd

efgh

88

 

abcd

ijkl

90

 

ijkl

 

 

Feature vector

ijkl

abcd

90

 

 

The RFile format will compress repeating row, colf, and colq values = down to 1. Not sure how you are searching, but you could switch the colq = and colf in the example above to sort by relative score. Requirements = change over time, so the table format above would also allow you to = store different versions of the same relationship so that you could = track the history over time if that became important. It would also = allow you to provide a different score for each direction of the = relationship if that matters later.

 

From:= = Marc Reichman [mailto:mreichman@pixelforensics.com]
Sent: = Tuesday, July 16, 2013 5:28 PM
To: = user@accumulo.apache.org
Subject: accumulo for a = bi-map?

 

We are = using accumulo as a mechanism to store feature data (binary byte[]) for = some simple keys which are used for a search algorithm. We currently = search by iterating over the feature space using AccumuloRowInputFormat. = Results come out of a reducer into HDFS, currently in a = SequenceFile.

 

A = customer has asked if we can store our results somewhere in our Hadoop = infrastructure, and also perform nightly searches of everything vs = everything to keep match results up to date.

 

To me, the storage of the results in alternate column = families (from the features) would be a way way to store the matches = alongside the key rows:

(key: abcd, features:{...}, matches{ 'm0: efgh-88%, = 'm1': ijkl-90%, ..., 'mN': etc }

(key: ijkl, features:{...}, matches{ 'm0: efgh-88%, = 'm1': abcd-90%, ..., 'mN': etc }

 

Match scores are equal between two items regardless of = perspective, so a->b is 90% as b->a is = 90%.

 

Is there a way to simply add columns to an existing = family without having to name them or keep track of how many there are? = Am I better off making a column family for each match key and then store = score and other fields in columns? Making one column with the key as the = name and the score as the value for each match under one = family?

 

Ideally I would have some form of bidirectional map so = I could look at any key and find all the results as other keys, and find = any results to get other matches.

 

One approach is to simply add both sides of the = relationship every time anything matches anything else, which seems a = bit wasteful, space-wise.

 

Curious if any pre-existing ideas are out there. = Currently on hadoop 1.0.3/accumulo 1.4.1, not set in (hard) = concrete.

 

Thanks,

Marc

 

 

------=_NextPart_000_008A_01CE8258.F7646D90--