mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <sro...@gmail.com>
Subject Re: Use of JSON for Serialization in Mahout
Date Mon, 09 May 2011 19:19:19 GMT
There is no problem with using a third-party library licensed under
the Apache license.

However in the case of this particular library: I would not use JSON,
no. We just got done removing usages of it, for one, but, even so it
was never used for key/value serialization.

It's a somewhat verbose format and just not appropriate at scale,
where a compact binary format can save terabytes of storage, network
transfer, not to mention hours of CPU.

I don't think it's hard or time-consuming to write Writable
implementations for the few new key/value classes you'll need. Most
everything you'll want is written by Mahout or Hadoop already. The
read / write method you'd implement are just tens of lines of code
anyway.

On Mon, May 9, 2011 at 8:03 PM, Dhruv <dhruv21@gmail.com> wrote:
> Cloud 9 is an easy to use Hadoop MapReduce library by Jimmy Lin from the
> University of Maryland using the Apache 2.0 license (
> http://www.umiacs.umd.edu/~jimmylin/cloud9/docs/). The library contains a
> very convenient, lightweight JSON serializable class. One can use this class
> instead of rolling your own custom serializable objects and it could help me
> for the GSOC .
>
> What are Mahout's/ASF's policies regarding the use of such open third party
> libraries?
>
> What is the general opinion regarding using JSON serialization on Hadoop?
>
> In another email conversation, Grant did mention that JSON is slow and also
> that GSON had been used in the past by Mahout.
>
> Also, I had allocated sufficient time in my proposal, almost one month for
> implementing this custom object during the mapper's implementation so I
> could still just go ahead as planned before.
>

Mime
View raw message