Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 66DD1E058 for ; Sat, 9 Feb 2013 20:50:04 +0000 (UTC) Received: (qmail 8226 invoked by uid 500); 9 Feb 2013 20:49:59 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 8137 invoked by uid 500); 9 Feb 2013 20:49:59 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 8126 invoked by uid 99); 9 Feb 2013 20:49:58 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 09 Feb 2013 20:49:58 +0000 X-ASF-Spam-Status: No, hits=2.4 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of java8964@hotmail.com designates 65.55.111.113 as permitted sender) Received: from [65.55.111.113] (HELO blu0-omc2-s38.blu0.hotmail.com) (65.55.111.113) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 09 Feb 2013 20:49:52 +0000 Received: from BLU162-W24 ([65.55.111.71]) by blu0-omc2-s38.blu0.hotmail.com with Microsoft SMTPSVC(6.0.3790.4675); Sat, 9 Feb 2013 12:49:31 -0800 X-EIP: [0cvvYaf/gMmCpndtpT5yu6yVu3brLcxY] X-Originating-Email: [java8964@hotmail.com] Message-ID: Content-Type: multipart/alternative; boundary="_e90080a2-ea11-469b-9832-4a37e951f449_" From: java8964 java8964 To: Subject: Question related to Decompressor interface Date: Sat, 9 Feb 2013 15:49:31 -0500 Importance: Normal MIME-Version: 1.0 X-OriginalArrivalTime: 09 Feb 2013 20:49:31.0823 (UTC) FILETIME=[F5F893F0:01CE0706] X-Virus-Checked: Checked by ClamAV on apache.org --_e90080a2-ea11-469b-9832-4a37e951f449_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable HI=2C=20 Currently I am researching about options of encrypting the data in the MapR= educe=2C as we plan to use the Amazon EMR or EC2 services for our data. I am thinking that the compression codec is good place to integrate with th= e encryption logic=2C and I found out there are some people having the same= idea as mine. I google around and found out this code: https://github.com/geisbruch/HadoopCryptoCompressor/ It doesn't seem maintained any more=2C but it gave me a starting point. I d= ownload the source code=2C and try to do some tests with it. It doesn't work out of box. There are some bugs I have to fix to make it wo= rk. I believe it contains 'AES' as an example algorithm. But right now=2C I faced a problem when I tried to use it in my testing Map= Reduer program. Here is the stack trace I got: 2013-02-08 23:16:47=2C038 INFO org.apache.hadoop.io.compress.crypto.CryptoB= asicDecompressor: buf length =3D 512=2C and offset =3D 0=2C length =3D -132= 967308java.lang.IndexOutOfBoundsException at java.nio.ByteBuffer.wrap(By= teBuffer.java:352) at org.apache.hadoop.io.compress.crypto.CryptoBasicDe= compressor.setInput(CryptoBasicDecompressor.java:100) at org.apache.hado= op.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.j= ava:97) at org.apache.hadoop.io.compress.DecompressorStream.read(Decompr= essorStream.java:83) at java.io.InputStream.read(InputStream.java:82) = at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:209) = at org.apache.hadoop.util.LineReader.readLine(LineReader.java:173) at= org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRe= cordReader.java:114) at org.apache.hadoop.mapred.MapTask$NewTrackingReco= rdReader.nextKeyValue(MapTask.java:458) at org.apache.hadoop.mapreduce.t= ask.MapContextImpl.nextKeyValue(MapContextImpl.java:76) at org.apache.ha= doop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.jav= a:85) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139) at o= rg.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:645) at org.ap= ache.hadoop.mapred.MapTask.run(MapTask.java:325) at org.apache.hadoop.ma= pred.Child$4.run(Child.java:268) at java.security.AccessController.doPri= vileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:= 396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupIn= formation.java:1332) at org.apache.hadoop.mapred.Child.main(Child.java:2= 62) I know the error is thrown out of this custom CryptoBasicDecompressor class= =2C but I really have questions related to the interface it implemented: De= compressor. There is limited document about this interface=2C for example=2C when and h= ow the method setInput() will be invoked. If I want to write my own Decompr= essor=2C what do these methods mean in the interface?In the above case=2C I= enable some debug information=2C you can see that in this case=2C the byte= [] array passed to setInput method=2C only have 512 as the length=2C but th= e 3rd parameter of length passed in is a negative number: -132967308. That = caused the IndexOutOfBoundsException. If I check the GzipDecompressor class= of this method in the hadoop=2C the code will also throw IndexOutoutBounds= Exception in this case=2C so this is a RuntimeException case. Why it happen= ed in my test case? Here is my test case: I have a simpel log text file about 700k. I encrypted it with above code us= ing 'AES'. I can encrypted and decrypted to get my original content. The fi= le name is foo.log.crypto=2C this file extension is registered to invoke th= is CryptoBasicDecompressor in my testing hadoop using CDH4.1.2 release (had= oop 2.0). Everything works as I expected. The CryptoBasicDecompressor is in= voked when the input file is foo.log.crypto=2C as you can see in the above = stack trace. But I don't know why the 3rd parameter (length) in setInput() = is a negative number at runtime. In additional to it=2C I also have further questions related to use Compres= sor/Decompressor to handle the encrypting/decrypting file. Ideally=2C I won= der if the encrypting/decrypting can support file splits. This maybe depend= s the algorithm we are using=2C is that right? If so=2C what kind of algori= thm can do that? I am not sure if it likes the compressor cases=2C most of = them do not support file split. If so=2C it maybe not good for my requireme= nts. If we have a 1G file=2C encrypted in the Amazone S3=2C after it copied to t= he HDFS of Amazon EMR=2C can each block of the date be decrypted independen= tly by each mapper=2C then passed to the underline RecorderReader to be pro= cessed totally concurrently? Does any one do this before? If so=2C what enc= ryption algorithm does support it? Any idea? Thanks Yong = --_e90080a2-ea11-469b-9832-4a37e951f449_ Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable
HI=2C =3B

Currently I am researching about options o= f encrypting the data in the MapReduce=2C as we plan to use the Amazon EMR = or EC2 services for our data.

I am thinking that t= he compression codec is good place to integrate with the encryption logic= =2C and I found out there are some people having the same idea as mine.

I google around and found out this code:
https://github.com/geisbruch/HadoopCryptoCompressor/

It doesn't seem maintained any more=2C but it gave me a sta= rting point. I download the source code=2C and try to do some tests with it= .

It doesn't work out of box. There are some bugs = I have to fix to make it work. I believe it contains 'AES' as an example&nb= sp=3Balgorithm.

But right now=2C I faced a problem= when I tried to use it in my testing MapReduer program. Here is the stack = trace I got:

2013-02-08 23:16:47=2C038 INFO o= rg.apache.hadoop.io.compress.crypto.CryptoBasicDecompressor: buf length =3D= 512=2C and offset =3D 0=2C length =3D -132967308
java.lang.Index= OutOfBoundsException
 =3B  =3B at java.nio.ByteBuffer.wra= p(ByteBuffer.java:352)
 =3B  =3B at org.apache.hadoop.io.= compress.crypto.CryptoBasicDecompressor.setInput(CryptoBasicDecompressor.ja= va:100)
 =3B  =3B at org.apache.hadoop.io.compress.BlockD= ecompressorStream.decompress(BlockDecompressorStream.java:97)
&nb= sp=3B  =3B at org.apache.hadoop.io.compress.DecompressorStream.read(Dec= ompressorStream.java:83)
 =3B  =3B at java.io.InputStream= .read(InputStream.java:82)
 =3B  =3B at org.apache.hadoop= .util.LineReader.readDefaultLine(LineReader.java:209)
 =3B &n= bsp=3B at org.apache.hadoop.util.LineReader.readLine(LineReader.java:173)
 =3B  =3B at org.apache.hadoop.mapreduce.lib.input.LineRec= ordReader.nextKeyValue(LineRecordReader.java:114)
 =3B  = =3B at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValu= e(MapTask.java:458)
 =3B  =3B at org.apache.hadoop.mapred= uce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:76)
&nbs= p=3B  =3B at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.= nextKeyValue(WrappedMapper.java:85)
 =3B  =3B at org.apac= he.hadoop.mapreduce.Mapper.run(Mapper.java:139)
 =3B  =3B= at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:645)
 =3B  =3B at org.apache.hadoop.mapred.MapTask.run(MapTask.java:3= 25)
 =3B  =3B at org.apache.hadoop.mapred.Child$4.run(Chi= ld.java:268)
 =3B  =3B at java.security.AccessController.= doPrivileged(Native Method)
 =3B  =3B at javax.security.a= uth.Subject.doAs(Subject.java:396)
 =3B  =3B at org.apach= e.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)=
 =3B  =3B at org.apache.hadoop.mapred.Child.main(Child.j= ava:262)

I know the error is thrown out of t= his custom =3BCryptoBasicDecompresso= r class=2C but I really have questions related to the interface it implemen= ted: =3BDecompressor.

There is limited = document about this interface=2C for example=2C when and how the method set= Input() will be invoked. If I want to write my own Decompressor=2C what do = these methods mean in the interface?
In the above case=2C I enabl= e some debug information=2C you can see that in this case=2C the byte[] arr= ay passed to setInput method=2C only have 512 as the length=2C but the 3rd = parameter of length passed in is a negative number: -132967308. That caused= the IndexOutOfBoundsException. If I check the GzipDecompressor class of th= is method in the hadoop=2C the code will also throw IndexOutoutBoundsExcept= ion in this case=2C so this is a RuntimeException case. Why it happened in = my test case?

Here is my test case:

=
I have a simpel log text file about 700k. I encrypted it with ab= ove code using 'AES'. I can encrypted and decrypted to get my original cont= ent. The file name is foo.log.crypto=2C this file extension is registered t= o invoke this =3BCryptoBasicDecompre= ssor in my testing hadoop using CDH4.1.2 release (hadoop 2.0). Everything w= orks as I expected. The =3BCr= yptoBasicDecompressor is invoked when the input file is foo.log.crypto=2C a= s you can see in the above stack trace. But I don't know why the 3rd parame= ter (length) in setInput() is a negative number at runtime.

In additional to it=2C I also have further questions rel= ated to use Compressor/Decompressor to handle the encrypting/decrypting fil= e. Ideally=2C I wonder if the encrypting/decrypting can support file splits= . This maybe depends the =3Ba= lgorithm we are using=2C is that right? If so=2C what kind of =3Balgorithm can do that? I am not sure if= it likes the compressor cases=2C most of them do not support file split. I= f so=2C it maybe not good for my requirements.

If we have a 1G file=2C encrypted in the Amazone S3=2C after it copie= d to the HDFS of Amazon EMR=2C can each block of the date be decrypted inde= pendently by each mapper=2C then passed to the underline RecorderReader to = be processed totally concurrently? Does any one do this before? If so=2C wh= at encryption algorithm does support it? Any idea?

Thanks

Yong
= = --_e90080a2-ea11-469b-9832-4a37e951f449_--