spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Venkat, Ankam" <Ankam.Ven...@centurylink.com>
Subject RE: How to 'Pipe' Binary Data in Apache Spark
Date Thu, 22 Jan 2015 16:34:10 GMT
How much time it takes to port it?

Spark committers:  Please let us know your thoughts.

Regards,
Venkat

From: Frank Austin Nothaft [mailto:fnothaft@berkeley.edu]
Sent: Thursday, January 22, 2015 9:08 AM
To: Venkat, Ankam
Cc: Nick Allen; user@spark.apache.org
Subject: Re: How to 'Pipe' Binary Data in Apache Spark

Venkat,

No problem!

So, creating a custom InputFormat or using sc.binaryFiles alone is not the right solution.
  We also need the modified version of RDD.pipe to support binary data?  Is my understanding
correct?

Yep! That is correct. The custom InputFormat allows Spark to load binary formatted data from
disk/HDFS/S3/etc..., but then the default RDD.pipe reads/writes text to a pipe, so you'd need
the custom mapPartitions call.


If yes, this can be added as new enhancement Jira request?

The code that I have right now is fairly custom to my application, but if there was interest,
I would be glad to port it for the Spark core.

Regards,

Frank Austin Nothaft
fnothaft@berkeley.edu<mailto:fnothaft@berkeley.edu>
fnothaft@eecs.berkeley.edu<mailto:fnothaft@eecs.berkeley.edu>
202-340-0466

On Jan 22, 2015, at 7:11 AM, Venkat, Ankam <Ankam.Venkat@centurylink.com<mailto:Ankam.Venkat@centurylink.com>>
wrote:


Thanks Frank for your response.

So, creating a custom InputFormat or using sc.binaryFiles alone is not the right solution.
  We also need the modified version of RDD.pipe to support binary data?  Is my understanding
correct?

If yes, this can be added as new enhancement Jira request?

Nick:  What's your take on this?

Regards,
Venkat Ankam


From: Frank Austin Nothaft [mailto:fnothaft@berkeley.edu]
Sent: Wednesday, January 21, 2015 12:30 PM
To: Venkat, Ankam
Cc: Nick Allen; user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: How to 'Pipe' Binary Data in Apache Spark

Hi Venkat/Nick,

The Spark RDD.pipe method pipes text data into a subprocess and then receives text data back
from that process. Once you have the binary data loaded into an RDD properly, to pipe binary
data to/from a subprocess (e.g., you want the data in the pipes to contain binary, not text),
you need to implement your own, modified version of RDD.pipe. The implementation<https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PipedRDD.scala>
of RDD.pipe spawns a process per partition (IIRC), as well as threads for writing to and reading
from the process (as well as stderr for the process). When writing via RDD.pipe, Spark calls
*.toString on the object, and pushes that text representation down the pipe. There is an example
of how to pipe binary data from within a mapPartitions call using the Scala API in lines 107-177
of this file<https://github.com/bigdatagenomics/avocado/blob/master/avocado-core/src/main/scala/org/bdgenomics/avocado/genotyping/ExternalGenotyper.scala>.
This specific code contains some nastiness around the packaging of downstream libraries that
we rely on in that project, so I'm not sure if it is the cleanest way, but it is a workable
way.

Regards,

Frank Austin Nothaft
fnothaft@berkeley.edu<mailto:fnothaft@berkeley.edu>
fnothaft@eecs.berkeley.edu<mailto:fnothaft@eecs.berkeley.edu>
202-340-0466

On Jan 21, 2015, at 9:17 AM, Venkat, Ankam <Ankam.Venkat@centurylink.com<mailto:Ankam.Venkat@centurylink.com>>
wrote:



I am trying to solve similar problem.  I am using option # 2 as suggested by Nick.

I have created an RDD with sc.binaryFiles for a list of .wav files.  But, I am not able to
pipe it to the external programs.

For example:
>>> sq = sc.binaryFiles("wavfiles")  <-- All .wav files stored on "wavfiles" directory
on HDFS
>>> sq.keys().collect() <-- works fine.  Shows the list of file names.
>>> sq.values().collect() <-- works fine.  Shows the content of the files.
>>> sq.values().pipe(lambda x: subprocess.call(['/usr/local/bin/sox', '-t' 'wav',
'-', '-n', 'stats'])).collect() <-- Does not work.  Tried different options.
AttributeError: 'function' object has no attribute 'read'

Any suggestions?

Regards,
Venkat Ankam

From: Nick Allen [mailto:nick@nickallen.org]
Sent: Friday, January 16, 2015 11:46 AM
To: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: How to 'Pipe' Binary Data in Apache Spark

I just wanted to reiterate the solution for the benefit of the community.

The problem is not from my use of 'pipe', but that 'textFile' cannot be used to read in binary
data. (Doh) There are a couple options to move forward.

1. Implement a custom 'InputFormat' that understands the binary input data. (Per Sean Owen)

2. Use 'SparkContext.binaryFiles' to read in the entire binary file as a single record. This
will impact performance as it prevents the use of more than one mapper on the file's data.

In my specific case for #1 I can only find one project from RIPE-NCC (https://github.com/RIPE-NCC/hadoop-pcap)
that does this. Unfortunately, it appears to only support a limited set of network protocols.


On Fri, Jan 16, 2015 at 10:40 AM, Nick Allen <nick@nickallen.org<mailto:nick@nickallen.org>>
wrote:
Per your last comment, it appears I need something like this:

https://github.com/RIPE-NCC/hadoop-pcap

Thanks a ton.  That get me oriented in the right direction.

On Fri, Jan 16, 2015 at 10:20 AM, Sean Owen <sowen@cloudera.com<mailto:sowen@cloudera.com>>
wrote:
Well it looks like you're reading some kind of binary file as text.
That isn't going to work, in Spark or elsewhere, as binary data is not
even necessarily the valid encoding of a string. There are no line
breaks to delimit lines and thus elements of the RDD.

Your input has some record structure (or else it's not really useful
to put it into an RDD). You can encode this as a SequenceFile and read
it with objectFile.

You could also write a custom InputFormat that knows how to parse pcap
records directly.

On Fri, Jan 16, 2015 at 3:09 PM, Nick Allen <nick@nickallen.org<mailto:nick@nickallen.org>>
wrote:
> I have an RDD containing binary data. I would like to use 'RDD.pipe' to pipe
> that binary data to an external program that will translate it to
> string/text data. Unfortunately, it seems that Spark is mangling the binary
> data before it gets passed to the external program.
>
> This code is representative of what I am trying to do. What am I doing
> wrong? How can I pipe binary data in Spark?  Maybe it is getting corrupted
> when I read it in initially with 'textFile'?
>
> bin = sc.textFile("binary-data.dat")
> csv = bin.pipe ("/usr/bin/binary-to-csv.sh")
> csv.saveAsTextFile("text-data.csv")
>
> Specifically, I am trying to use Spark to transform pcap (packet capture)
> data to text/csv so that I can perform an analysis on it.
>
> Thanks!
>
> --
> Nick Allen <nick@nickallen.org<mailto:nick@nickallen.org>>



--
Nick Allen <nick@nickallen.org<mailto:nick@nickallen.org>>



--
Nick Allen <nick@nickallen.org<mailto:nick@nickallen.org>>
This communication is the property of CenturyLink and may contain confidential or privileged
information. Unauthorized use of this communication is strictly prohibited and may be unlawful.
If you have received this communication in error, please immediately notify the sender by
reply e-mail and destroy all copies of the communication and any attachments.

This communication is the property of CenturyLink and may contain confidential or privileged
information. Unauthorized use of this communication is strictly prohibited and may be unlawful.
If you have received this communication in error, please immediately notify the sender by
reply e-mail and destroy all copies of the communication and any attachments.

This communication is the property of CenturyLink and may contain confidential or privileged
information. Unauthorized use of this communication is strictly prohibited and may be unlawful.
If you have received this communication in error, please immediately notify the sender by
reply e-mail and destroy all copies of the communication and any attachments.

Mime
View raw message