Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@accumulo.apache.org
Received-SPF: pass (athena.apache.org: domain of dlmarion@comcast.net
 designates 76.96.62.16 as permitted sender)
From: "Dave Marion" <dlmarion@comcast.net>
To: <user@accumulo.apache.org>
References: 
 <CADDp_G8g1V0A_LwQzVgCpHHPg9u77Ps5Ly6MwzWEowvK9udiOw@mail.gmail.com>
In-Reply-To: 
 <CADDp_G8g1V0A_LwQzVgCpHHPg9u77Ps5Ly6MwzWEowvK9udiOw@mail.gmail.com>
Subject: RE: accumulo for a bi-map?
Date: Tue, 16 Jul 2013 19:16:25 -0400
Message-ID: <008901ce827a$7e7152a0$7b53f7e0$@comcast.net>
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----=_NextPart_000_008A_01CE8258.F7646D90"
Thread-Index: AQK3Y51mzkN0vWDoUDNIEY+ZebZd/5eWKhyQ
Content-Language: en-us

This is a multipart message in MIME format.

------=_NextPart_000_008A_01CE8258.F7646D90
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: 7bit

I'm not sure how familiar you are with Accumulo, but you do not need to
specify your columns when you create the table. You could create a table
that stores the feature vector for your source followed by columns for the
related objects. Sounds like you are already thinking down this path. For
example:

 
Row

Column Family

Column Qualifier

Value


abcd

 
Feature vector


abcd

efgh

88

 
abcd

ijkl

90

 
ijkl

 
Feature vector


ijkl

abcd

90

 
The RFile format will compress repeating row, colf, and colq values down to
1. Not sure how you are searching, but you could switch the colq and colf in
the example above to sort by relative score. Requirements change over time,
so the table format above would also allow you to store different versions
of the same relationship so that you could track the history over time if
that became important. It would also allow you to provide a different score
for each direction of the relationship if that matters later.

 
From: Marc Reichman [mailto:mreichman@pixelforensics.com] 
Sent: Tuesday, July 16, 2013 5:28 PM
To: user@accumulo.apache.org
Subject: accumulo for a bi-map?

 
We are using accumulo as a mechanism to store feature data (binary byte[])
for some simple keys which are used for a search algorithm. We currently
search by iterating over the feature space using AccumuloRowInputFormat.
Results come out of a reducer into HDFS, currently in a SequenceFile.

 
A customer has asked if we can store our results somewhere in our Hadoop
infrastructure, and also perform nightly searches of everything vs
everything to keep match results up to date.

 
To me, the storage of the results in alternate column families (from the
features) would be a way way to store the matches alongside the key rows:

(key: abcd, features:{...}, matches{ 'm0: efgh-88%, 'm1': ijkl-90%, ...,
'mN': etc }

(key: ijkl, features:{...}, matches{ 'm0: efgh-88%, 'm1': abcd-90%, ...,
'mN': etc }

 
Match scores are equal between two items regardless of perspective, so a->b
is 90% as b->a is 90%.

 
Is there a way to simply add columns to an existing family without having to
name them or keep track of how many there are? Am I better off making a
column family for each match key and then store score and other fields in
columns? Making one column with the key as the name and the score as the
value for each match under one family?

 
Ideally I would have some form of bidirectional map so I could look at any
key and find all the results as other keys, and find any results to get
other matches.

 
One approach is to simply add both sides of the relationship every time
anything matches anything else, which seems a bit wasteful, space-wise.

 
Curious if any pre-existing ideas are out there. Currently on hadoop
1.0.3/accumulo 1.4.1, not set in (hard) concrete.

 
Thanks,

Marc

 
------=_NextPart_000_008A_01CE8258.F7646D90
Content-Type: text/html;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

<html xmlns:v=3D"urn:schemas-microsoft-com:vml" =
xmlns:o=3D"urn:schemas-microsoft-com:office:office" =
xmlns:w=3D"urn:schemas-microsoft-com:office:word" =
xmlns:m=3D"http://schemas.microsoft.com/office/2004/12/omml" =
xmlns=3D"http://www.w3.org/TR/REC-html40"><head><meta =
http-equiv=3DContent-Type content=3D"text/html; =
charset=3Dus-ascii"><meta name=3DGenerator content=3D"Microsoft Word 14 =
(filtered medium)"><style><!--
/* Font Definitions */
@font-face
	{font-family:"Cordia New";
	panose-1:2 11 3 4 2 2 2 2 2 4;}
@font-face
	{font-family:"Cordia New";
	panose-1:2 11 3 4 2 2 2 2 2 4;}
@font-face
	{font-family:Calibri;
	panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
	{font-family:Tahoma;
	panose-1:2 11 6 4 3 5 4 4 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
	{margin:0in;
	margin-bottom:.0001pt;
	font-size:12.0pt;
	font-family:"Times New Roman","serif";}
a:link, span.MsoHyperlink
	{mso-style-priority:99;
	color:blue;
	text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
	{mso-style-priority:99;
	color:purple;
	text-decoration:underline;}
span.EmailStyle17
	{mso-style-type:personal-reply;
	font-family:"Calibri","sans-serif";
	color:#1F497D;}
.MsoChpDefault
	{mso-style-type:export-only;
	font-family:"Calibri","sans-serif";}
@page WordSection1
	{size:8.5in 11.0in;
	margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
	{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext=3D"edit" spidmax=3D"1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext=3D"edit">
<o:idmap v:ext=3D"edit" data=3D"1" />
</o:shapelayout></xml><![endif]--></head><body lang=3DEN-US link=3Dblue =
vlink=3Dpurple><div class=3DWordSection1><p class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'>I&#8217;m not sure how familiar you are with Accumulo, but you do not =
need to specify your columns when you create the table. You could create =
a table that stores the feature vector for your source followed by =
columns for the related objects. Sounds like you are already thinking =
down this path. For example:<o:p></o:p></span></p><p =
class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'><o:p>&nbsp;</o:p></span></p><p class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'><o:p>&nbsp;</o:p></span></p><table class=3DMsoTableGrid border=3D1 =
cellspacing=3D0 cellpadding=3D0 =
style=3D'border-collapse:collapse;border:none'><tr><td width=3D160 =
valign=3Dtop style=3D'width:119.7pt;border:solid windowtext =
1.0pt;padding:0in 5.4pt 0in 5.4pt'><p class=3DMsoNormal align=3Dcenter =
style=3D'text-align:center'><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'>Row<o:p></o:p></span></p></td><td width=3D160 valign=3Dtop =
style=3D'width:119.7pt;border:solid windowtext =
1.0pt;border-left:none;padding:0in 5.4pt 0in 5.4pt'><p class=3DMsoNormal =
align=3Dcenter style=3D'text-align:center'><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'>Column Family<o:p></o:p></span></p></td><td width=3D160 valign=3Dtop =
style=3D'width:119.7pt;border:solid windowtext =
1.0pt;border-left:none;padding:0in 5.4pt 0in 5.4pt'><p class=3DMsoNormal =
align=3Dcenter style=3D'text-align:center'><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'>Column Qualifier<o:p></o:p></span></p></td><td width=3D160 =
valign=3Dtop style=3D'width:119.7pt;border:solid windowtext =
1.0pt;border-left:none;padding:0in 5.4pt 0in 5.4pt'><p class=3DMsoNormal =
align=3Dcenter style=3D'text-align:center'><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'>Value<o:p></o:p></span></p></td></tr><tr><td width=3D160 valign=3Dtop =
style=3D'width:119.7pt;border:solid windowtext =
1.0pt;border-top:none;padding:0in 5.4pt 0in 5.4pt'><p =
class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'>abcd<o:p></o:p></span></p></td><td width=3D160 valign=3Dtop =
style=3D'width:119.7pt;border-top:none;border-left:none;border-bottom:sol=
id windowtext 1.0pt;border-right:solid windowtext 1.0pt;padding:0in =
5.4pt 0in 5.4pt'><p class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'><o:p>&nbsp;</o:p></span></p></td><td width=3D160 valign=3Dtop =
style=3D'width:119.7pt;border-top:none;border-left:none;border-bottom:sol=
id windowtext 1.0pt;border-right:solid windowtext 1.0pt;padding:0in =
5.4pt 0in 5.4pt'><p class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'><o:p>&nbsp;</o:p></span></p></td><td width=3D160 valign=3Dtop =
style=3D'width:119.7pt;border-top:none;border-left:none;border-bottom:sol=
id windowtext 1.0pt;border-right:solid windowtext 1.0pt;padding:0in =
5.4pt 0in 5.4pt'><p class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'>Feature vector<o:p></o:p></span></p></td></tr><tr><td width=3D160 =
valign=3Dtop style=3D'width:119.7pt;border:solid windowtext =
1.0pt;border-top:none;padding:0in 5.4pt 0in 5.4pt'><p =
class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'>abcd<o:p></o:p></span></p></td><td width=3D160 valign=3Dtop =
style=3D'width:119.7pt;border-top:none;border-left:none;border-bottom:sol=
id windowtext 1.0pt;border-right:solid windowtext 1.0pt;padding:0in =
5.4pt 0in 5.4pt'><p class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'>efgh<o:p></o:p></span></p></td><td width=3D160 valign=3Dtop =
style=3D'width:119.7pt;border-top:none;border-left:none;border-bottom:sol=
id windowtext 1.0pt;border-right:solid windowtext 1.0pt;padding:0in =
5.4pt 0in 5.4pt'><p class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'>88<o:p></o:p></span></p></td><td width=3D160 valign=3Dtop =
style=3D'width:119.7pt;border-top:none;border-left:none;border-bottom:sol=
id windowtext 1.0pt;border-right:solid windowtext 1.0pt;padding:0in =
5.4pt 0in 5.4pt'><p class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'><o:p>&nbsp;</o:p></span></p></td></tr><tr><td width=3D160 =
valign=3Dtop style=3D'width:119.7pt;border:solid windowtext =
1.0pt;border-top:none;padding:0in 5.4pt 0in 5.4pt'><p =
class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'>abcd<o:p></o:p></span></p></td><td width=3D160 valign=3Dtop =
style=3D'width:119.7pt;border-top:none;border-left:none;border-bottom:sol=
id windowtext 1.0pt;border-right:solid windowtext 1.0pt;padding:0in =
5.4pt 0in 5.4pt'><p class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'>ijkl<o:p></o:p></span></p></td><td width=3D160 valign=3Dtop =
style=3D'width:119.7pt;border-top:none;border-left:none;border-bottom:sol=
id windowtext 1.0pt;border-right:solid windowtext 1.0pt;padding:0in =
5.4pt 0in 5.4pt'><p class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'>90<o:p></o:p></span></p></td><td width=3D160 valign=3Dtop =
style=3D'width:119.7pt;border-top:none;border-left:none;border-bottom:sol=
id windowtext 1.0pt;border-right:solid windowtext 1.0pt;padding:0in =
5.4pt 0in 5.4pt'><p class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'><o:p>&nbsp;</o:p></span></p></td></tr><tr><td width=3D160 =
valign=3Dtop style=3D'width:119.7pt;border:solid windowtext =
1.0pt;border-top:none;padding:0in 5.4pt 0in 5.4pt'><p =
class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'>ijkl<o:p></o:p></span></p></td><td width=3D160 valign=3Dtop =
style=3D'width:119.7pt;border-top:none;border-left:none;border-bottom:sol=
id windowtext 1.0pt;border-right:solid windowtext 1.0pt;padding:0in =
5.4pt 0in 5.4pt'><p class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'><o:p>&nbsp;</o:p></span></p></td><td width=3D160 valign=3Dtop =
style=3D'width:119.7pt;border-top:none;border-left:none;border-bottom:sol=
id windowtext 1.0pt;border-right:solid windowtext 1.0pt;padding:0in =
5.4pt 0in 5.4pt'><p class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'><o:p>&nbsp;</o:p></span></p></td><td width=3D160 valign=3Dtop =
style=3D'width:119.7pt;border-top:none;border-left:none;border-bottom:sol=
id windowtext 1.0pt;border-right:solid windowtext 1.0pt;padding:0in =
5.4pt 0in 5.4pt'><p class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'>Feature vector<o:p></o:p></span></p></td></tr><tr><td width=3D160 =
valign=3Dtop style=3D'width:119.7pt;border:solid windowtext =
1.0pt;border-top:none;padding:0in 5.4pt 0in 5.4pt'><p =
class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'>ijkl<o:p></o:p></span></p></td><td width=3D160 valign=3Dtop =
style=3D'width:119.7pt;border-top:none;border-left:none;border-bottom:sol=
id windowtext 1.0pt;border-right:solid windowtext 1.0pt;padding:0in =
5.4pt 0in 5.4pt'><p class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'>abcd<o:p></o:p></span></p></td><td width=3D160 valign=3Dtop =
style=3D'width:119.7pt;border-top:none;border-left:none;border-bottom:sol=
id windowtext 1.0pt;border-right:solid windowtext 1.0pt;padding:0in =
5.4pt 0in 5.4pt'><p class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'>90<o:p></o:p></span></p></td><td width=3D160 valign=3Dtop =
style=3D'width:119.7pt;border-top:none;border-left:none;border-bottom:sol=
id windowtext 1.0pt;border-right:solid windowtext 1.0pt;padding:0in =
5.4pt 0in 5.4pt'><p class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'><o:p>&nbsp;</o:p></span></p></td></tr></table><p =
class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'><o:p>&nbsp;</o:p></span></p><p class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'> The RFile format will compress repeating row, colf, and colq values =
down to 1. Not sure how you are searching, but you could switch the colq =
and colf in the example above to sort by relative score. Requirements =
change over time, so the table format above would also allow you to =
store different versions of the same relationship so that you could =
track the history over time if that became important. It would also =
allow you to provide a different score for each direction of the =
relationship if that matters later.<o:p></o:p></span></p><p =
class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'><o:p>&nbsp;</o:p></span></p><p class=3DMsoNormal><b><span =
style=3D'font-size:10.0pt;font-family:"Tahoma","sans-serif"'>From:</span>=
</b><span style=3D'font-size:10.0pt;font-family:"Tahoma","sans-serif"'> =
Marc Reichman [mailto:mreichman@pixelforensics.com] <br><b>Sent:</b> =
Tuesday, July 16, 2013 5:28 PM<br><b>To:</b> =
user@accumulo.apache.org<br><b>Subject:</b> accumulo for a =
bi-map?<o:p></o:p></span></p><p =
class=3DMsoNormal><o:p>&nbsp;</o:p></p><div><p class=3DMsoNormal>We are =
using accumulo as a mechanism to store feature data (binary byte[]) for =
some simple keys which are used for a search algorithm. We currently =
search by iterating over the feature space using AccumuloRowInputFormat. =
Results come out of a reducer into HDFS, currently in a =
SequenceFile.<o:p></o:p></p><div><p =
class=3DMsoNormal><o:p>&nbsp;</o:p></p></div><div><p class=3DMsoNormal>A =
customer has asked if we can store our results somewhere in our Hadoop =
infrastructure, and also perform nightly searches of everything vs =
everything to keep match results up to date.<o:p></o:p></p></div><div><p =
class=3DMsoNormal><o:p>&nbsp;</o:p></p></div><div><p =
class=3DMsoNormal>To me, the storage of the results in alternate column =
families (from the features) would be a way way to store the matches =
alongside the key rows:<o:p></o:p></p></div><div><p =
class=3DMsoNormal>(key: abcd, features:{...}, matches{ 'm0: efgh-88%, =
'm1': ijkl-90%, ..., 'mN': etc }<o:p></o:p></p></div><div><p =
class=3DMsoNormal>(key: ijkl, features:{...}, matches{ 'm0: efgh-88%, =
'm1': abcd-90%, ..., 'mN': etc }<o:p></o:p></p></div><div><p =
class=3DMsoNormal><o:p>&nbsp;</o:p></p></div><div><p =
class=3DMsoNormal>Match scores are equal between two items regardless of =
perspective, so a-&gt;b is 90% as b-&gt;a is =
90%.<o:p></o:p></p></div><div><p =
class=3DMsoNormal><o:p>&nbsp;</o:p></p></div><div><p =
class=3DMsoNormal>Is there a way to simply add columns to an existing =
family without having to name them or keep track of how many there are? =
Am I better off making a column family for each match key and then store =
score and other fields in columns? Making one column with the key as the =
name and the score as the value for each match under one =
family?<o:p></o:p></p></div><div><p =
class=3DMsoNormal><o:p>&nbsp;</o:p></p></div><div><p =
class=3DMsoNormal>Ideally I would have some form of bidirectional map so =
I could look at any key and find all the results as other keys, and find =
any results to get other matches.<o:p></o:p></p></div><div><p =
class=3DMsoNormal><o:p>&nbsp;</o:p></p></div><div><p =
class=3DMsoNormal>One approach is to simply add both sides of the =
relationship every time anything matches anything else, which seems a =
bit wasteful, space-wise.<o:p></o:p></p></div><div><p =
class=3DMsoNormal><o:p>&nbsp;</o:p></p></div><div><p =
class=3DMsoNormal>Curious if any pre-existing ideas are out there. =
Currently on hadoop 1.0.3/accumulo 1.4.1, not set in (hard) =
concrete.<o:p></o:p></p></div><div><p =
class=3DMsoNormal><o:p>&nbsp;</o:p></p></div><div><p =
class=3DMsoNormal>Thanks,<o:p></o:p></p></div><div><p =
class=3DMsoNormal>Marc<o:p></o:p></p></div><div><p =
class=3DMsoNormal><o:p>&nbsp;</o:p></p></div><div><p =
class=3DMsoNormal><o:p>&nbsp;</o:p></p></div></div></div></body></html>
------=_NextPart_000_008A_01CE8258.F7646D90--