incubator-crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <>
Subject Re: PTypes and parallel do
Date Wed, 01 Aug 2012 01:17:48 GMT
Hey Jonathan,

With respect to handling arrays (as opposed to Collection<T)>s), there
are lots of different serialization approaches you could take
depending on what you are planning to do with the output, and so my
general inclination has been to not impose a particular approach on
people within the core PTypeFamily stuff. That may or may not be the
right solution, and it's certainly something we need feedback on.

My advice is to think about how you want to consume or output the
array and then use the _derived_ method on the PTypeFamily of your
choice to convert from some base format (like a String or a byte
array) to the float array you want to work with. Here's some code that
I just spun up that uses Avro's bytes() type as the base for working
with a float array:

  private static final MapFn<ByteBuffer, float[]> IN = new
MapFn<ByteBuffer, float[]>() {
    public float[] map(ByteBuffer input) {
      float[] f = new float[(input.limit() - input.position()) / 4];
      return f;
  private static final MapFn<float[], ByteBuffer> OUT = new
MapFn<float[], ByteBuffer>() {
    public ByteBuffer map(float[] input) {
      byte[] bb = new byte[input.length * 4];
      for (int i = 0; i < input.length; i++) {
        int f = Float.floatToRawIntBits(input[i]);
        bb[4*i] = (byte)((f >> 24) & 0xff);
        bb[4*i + 1] = (byte)((f >> 16) & 0xff);
        bb[4*i + 2] = (byte)((f >> 8) & 0xff);
        bb[4*i + 3] = (byte)((f >> 0) & 0xff);
      return ByteBuffer.wrap(bb);

  public static final PType<float[]> FLOAT_ARRAYS =
Avros.derived(float[].class, IN, OUT, Avros.bytes());

  public void testInAndOut() throws Exception {
    float[] data = new float[] { 17.29f, 2.0f, 3.14f };
    float[] out =;
    assertEquals(data.length, out.length);
    for (int i = 0; i < out.length; i++) {
      assertEquals(data[i], out[i], 0.0001f);


On Tue, Jul 31, 2012 at 5:34 PM, Jonathan Dinu <> wrote:
> Hi,
> I am pretty new to Crunch and I am having a difficult time working around PTypes specifically
as the third argument to parallelDo when serializing the collection.
> I have been using some of the PTypeFamily methods for primitives like Strings but I am
trying to create a PCollection output that contains Arrays of Floats.  What I want out is
PCollection<Float[]> but i cannot seem to coerce the right Ptype.  I am not sure if
this is the best approach or if I should be using a different type to store my records in
the collection.  Or if there is a way to register custom types and define how they should
be serialized to disk.
> Any help is greatly appreciated.
> Thanks,
> Joanthan

View raw message