Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of java8964@hotmail.com
 designates 65.55.111.113 as permitted sender)
Message-ID: <BLU162-W24F163443CE2A81BB8E63DD0040@phx.gbl>
Content-Type: multipart/alternative;
	boundary="_e90080a2-ea11-469b-9832-4a37e951f449_"
From: java8964 java8964 <java8964@hotmail.com>
To: <user@hadoop.apache.org>
Subject: Question related to Decompressor interface
Date: Sat, 9 Feb 2013 15:49:31 -0500
Importance: Normal
MIME-Version: 1.0

--_e90080a2-ea11-469b-9832-4a37e951f449_
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable


HI=2C=20
Currently I am researching about options of encrypting the data in the MapR=
educe=2C as we plan to use the Amazon EMR or EC2 services for our data.
I am thinking that the compression codec is good place to integrate with th=
e encryption logic=2C and I found out there are some people having the same=
 idea as mine.
I google around and found out this code:
https://github.com/geisbruch/HadoopCryptoCompressor/
It doesn't seem maintained any more=2C but it gave me a starting point. I d=
ownload the source code=2C and try to do some tests with it.
It doesn't work out of box. There are some bugs I have to fix to make it wo=
rk. I believe it contains 'AES' as an example algorithm.
But right now=2C I faced a problem when I tried to use it in my testing Map=
Reduer program. Here is the stack trace I got:
2013-02-08 23:16:47=2C038 INFO org.apache.hadoop.io.compress.crypto.CryptoB=
asicDecompressor: buf length =3D 512=2C and offset =3D 0=2C length =3D -132=
967308java.lang.IndexOutOfBoundsException    at java.nio.ByteBuffer.wrap(By=
teBuffer.java:352)    at org.apache.hadoop.io.compress.crypto.CryptoBasicDe=
compressor.setInput(CryptoBasicDecompressor.java:100)    at org.apache.hado=
op.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.j=
ava:97)    at org.apache.hadoop.io.compress.DecompressorStream.read(Decompr=
essorStream.java:83)    at java.io.InputStream.read(InputStream.java:82)   =
 at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:209) =
   at org.apache.hadoop.util.LineReader.readLine(LineReader.java:173)    at=
 org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRe=
cordReader.java:114)    at org.apache.hadoop.mapred.MapTask$NewTrackingReco=
rdReader.nextKeyValue(MapTask.java:458)    at org.apache.hadoop.mapreduce.t=
ask.MapContextImpl.nextKeyValue(MapContextImpl.java:76)    at org.apache.ha=
doop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.jav=
a:85)    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139)    at o=
rg.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:645)    at org.ap=
ache.hadoop.mapred.MapTask.run(MapTask.java:325)    at org.apache.hadoop.ma=
pred.Child$4.run(Child.java:268)    at java.security.AccessController.doPri=
vileged(Native Method)    at javax.security.auth.Subject.doAs(Subject.java:=
396)    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupIn=
formation.java:1332)    at org.apache.hadoop.mapred.Child.main(Child.java:2=
62)
I know the error is thrown out of this custom CryptoBasicDecompressor class=
=2C but I really have questions related to the interface it implemented: De=
compressor.
There is limited document about this interface=2C for example=2C when and h=
ow the method setInput() will be invoked. If I want to write my own Decompr=
essor=2C what do these methods mean in the interface?In the above case=2C I=
 enable some debug information=2C you can see that in this case=2C the byte=
[] array passed to setInput method=2C only have 512 as the length=2C but th=
e 3rd parameter of length passed in is a negative number: -132967308. That =
caused the IndexOutOfBoundsException. If I check the GzipDecompressor class=
 of this method in the hadoop=2C the code will also throw IndexOutoutBounds=
Exception in this case=2C so this is a RuntimeException case. Why it happen=
ed in my test case?
Here is my test case:
I have a simpel log text file about 700k. I encrypted it with above code us=
ing 'AES'. I can encrypted and decrypted to get my original content. The fi=
le name is foo.log.crypto=2C this file extension is registered to invoke th=
is CryptoBasicDecompressor in my testing hadoop using CDH4.1.2 release (had=
oop 2.0). Everything works as I expected. The CryptoBasicDecompressor is in=
voked when the input file is foo.log.crypto=2C as you can see in the above =
stack trace. But I don't know why the 3rd parameter (length) in setInput() =
is a negative number at runtime.
In additional to it=2C I also have further questions related to use Compres=
sor/Decompressor to handle the encrypting/decrypting file. Ideally=2C I won=
der if the encrypting/decrypting can support file splits. This maybe depend=
s the algorithm we are using=2C is that right? If so=2C what kind of algori=
thm can do that? I am not sure if it likes the compressor cases=2C most of =
them do not support file split. If so=2C it maybe not good for my requireme=
nts.
If we have a 1G file=2C encrypted in the Amazone S3=2C after it copied to t=
he HDFS of Amazon EMR=2C can each block of the date be decrypted independen=
tly by each mapper=2C then passed to the underline RecorderReader to be pro=
cessed totally concurrently? Does any one do this before? If so=2C what enc=
ryption algorithm does support it? Any idea?
Thanks
Yong 		 	   		  =

--_e90080a2-ea11-469b-9832-4a37e951f449_
Content-Type: text/html; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<html>
<head>
<style><!--
.hmmessage P
{
margin:0px=3B
padding:0px
}
body.hmmessage
{
font-size: 10pt=3B
font-family:Tahoma
}
--></style></head>
<body class=3D'hmmessage'><div dir=3D'ltr'>
HI=2C&nbsp=3B<div><br></div><div>Currently I am researching about options o=
f encrypting the data in the MapReduce=2C as we plan to use the Amazon EMR =
or EC2 services for our data.</div><div><br></div><div>I am thinking that t=
he compression codec is good place to integrate with the encryption logic=
=2C and I found out there are some people having the same idea as mine.</di=
v><div><br></div><div>I google around and found out this code:</div><div><b=
r></div><div>https://github.com/geisbruch/HadoopCryptoCompressor/</div><div=
><br></div><div>It doesn't seem maintained any more=2C but it gave me a sta=
rting point. I download the source code=2C and try to do some tests with it=
.</div><div><br></div><div>It doesn't work out of box. There are some bugs =
I have to fix to make it work. I believe it contains 'AES' as an example&nb=
sp=3Balgorithm.</div><div><br></div><div>But right now=2C I faced a problem=
 when I tried to use it in my testing MapReduer program. Here is the stack =
trace I got:</div><div><br></div><div><div>2013-02-08 23:16:47=2C038 INFO o=
rg.apache.hadoop.io.compress.crypto.CryptoBasicDecompressor: buf length =3D=
 512=2C and offset =3D 0=2C length =3D -132967308</div><div>java.lang.Index=
OutOfBoundsException</div><div>&nbsp=3B &nbsp=3B at java.nio.ByteBuffer.wra=
p(ByteBuffer.java:352)</div><div>&nbsp=3B &nbsp=3B at org.apache.hadoop.io.=
compress.crypto.CryptoBasicDecompressor.setInput(CryptoBasicDecompressor.ja=
va:100)</div><div>&nbsp=3B &nbsp=3B at org.apache.hadoop.io.compress.BlockD=
ecompressorStream.decompress(BlockDecompressorStream.java:97)</div><div>&nb=
sp=3B &nbsp=3B at org.apache.hadoop.io.compress.DecompressorStream.read(Dec=
ompressorStream.java:83)</div><div>&nbsp=3B &nbsp=3B at java.io.InputStream=
.read(InputStream.java:82)</div><div>&nbsp=3B &nbsp=3B at org.apache.hadoop=
.util.LineReader.readDefaultLine(LineReader.java:209)</div><div>&nbsp=3B &n=
bsp=3B at org.apache.hadoop.util.LineReader.readLine(LineReader.java:173)</=
div><div>&nbsp=3B &nbsp=3B at org.apache.hadoop.mapreduce.lib.input.LineRec=
ordReader.nextKeyValue(LineRecordReader.java:114)</div><div>&nbsp=3B &nbsp=
=3B at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValu=
e(MapTask.java:458)</div><div>&nbsp=3B &nbsp=3B at org.apache.hadoop.mapred=
uce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:76)</div><div>&nbs=
p=3B &nbsp=3B at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.=
nextKeyValue(WrappedMapper.java:85)</div><div>&nbsp=3B &nbsp=3B at org.apac=
he.hadoop.mapreduce.Mapper.run(Mapper.java:139)</div><div>&nbsp=3B &nbsp=3B=
 at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:645)</div><d=
iv>&nbsp=3B &nbsp=3B at org.apache.hadoop.mapred.MapTask.run(MapTask.java:3=
25)</div><div>&nbsp=3B &nbsp=3B at org.apache.hadoop.mapred.Child$4.run(Chi=
ld.java:268)</div><div>&nbsp=3B &nbsp=3B at java.security.AccessController.=
doPrivileged(Native Method)</div><div>&nbsp=3B &nbsp=3B at javax.security.a=
uth.Subject.doAs(Subject.java:396)</div><div>&nbsp=3B &nbsp=3B at org.apach=
e.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)=
</div><div>&nbsp=3B &nbsp=3B at org.apache.hadoop.mapred.Child.main(Child.j=
ava:262)</div></div><div><br></div><div>I know the error is thrown out of t=
his custom&nbsp=3B<span style=3D"font-size: 10pt=3B">CryptoBasicDecompresso=
r class=2C but I really have questions related to the interface it implemen=
ted:&nbsp=3B</span>Decompressor.</div><div><br></div><div>There is limited =
document about this interface=2C for example=2C when and how the method set=
Input() will be invoked. If I want to write my own Decompressor=2C what do =
these methods mean in the interface?</div><div>In the above case=2C I enabl=
e some debug information=2C you can see that in this case=2C the byte[] arr=
ay passed to setInput method=2C only have 512 as the length=2C but the 3rd =
parameter of length passed in is a negative number: -132967308. That caused=
 the IndexOutOfBoundsException. If I check the GzipDecompressor class of th=
is method in the hadoop=2C the code will also throw IndexOutoutBoundsExcept=
ion in this case=2C so this is a RuntimeException case. Why it happened in =
my test case?</div><div><br></div><div>Here is my test case:</div><div><br>=
</div><div>I have a simpel log text file about 700k. I encrypted it with ab=
ove code using 'AES'. I can encrypted and decrypted to get my original cont=
ent. The file name is foo.log.crypto=2C this file extension is registered t=
o invoke this&nbsp=3B<span style=3D"font-size: 10pt=3B">CryptoBasicDecompre=
ssor in my testing hadoop using CDH4.1.2 release (hadoop 2.0). Everything w=
orks as I expected. The&nbsp=3B</span><span style=3D"font-size: 10pt=3B">Cr=
yptoBasicDecompressor is invoked when the input file is foo.log.crypto=2C a=
s you can see in the above stack trace. But I don't know why the 3rd parame=
ter (length) in setInput() is a negative number at runtime.</span></div><di=
v><span style=3D"font-size: 10pt=3B"><br></span></div><div><span style=3D"f=
ont-size: 10pt=3B">In additional to it=2C I also have further questions rel=
ated to use Compressor/Decompressor to handle the encrypting/decrypting fil=
e. Ideally=2C I wonder if the encrypting/decrypting can support file splits=
. This maybe depends the&nbsp=3B</span><span style=3D"font-size: 10pt=3B">a=
lgorithm we are using=2C is that right? If so=2C what kind of&nbsp=3B</span=
><span style=3D"font-size: 10pt=3B">algorithm can do that? I am not sure if=
 it likes the compressor cases=2C most of them do not support file split. I=
f so=2C it maybe not good for my requirements.</span></div><div><span style=
=3D"font-size: 10pt=3B"><br></span></div><div><span style=3D"font-size: 10p=
t=3B">If we have a 1G file=2C encrypted in the Amazone S3=2C after it copie=
d to the HDFS of Amazon EMR=2C can each block of the date be decrypted inde=
pendently by each mapper=2C then passed to the underline RecorderReader to =
be processed totally concurrently? Does any one do this before? If so=2C wh=
at encryption algorithm does support it? Any idea?</span></div><div><span s=
tyle=3D"font-size: 10pt=3B"><br></span></div><div><span style=3D"font-size:=
 10pt=3B">Thanks</span></div><div><span style=3D"font-size: 10pt=3B"><br></=
span></div><div><span style=3D"font-size: 10pt=3B">Yong</span></div> 		 	  =
 		  </div></body>
</html>=

--_e90080a2-ea11-469b-9832-4a37e951f449_--