mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hector Yee <hector....@gmail.com>
Subject Re: Hadoop serialization compression and precision loss
Date Thu, 14 Jul 2011 14:50:02 GMT
Its not lossy, that would be a disaster if it was. You specify the
compressor so you can use what  codecs are supported, e.g. LZO

On Thu, Jul 14, 2011 at 7:40 AM, Dhruv Kumar <dkumar@ecs.umass.edu> wrote:

> On Thu, Jul 14, 2011 at 10:29 AM, Sean Owen <srowen@gmail.com> wrote:
>
> > Serialization itself has no effect on accuracy; doubles are encoded
> exactly
> > as they are in memory.
> > That's not to say that there may be an accuracy issue in how some
> > computation proceeds, but it is not a function of serialization.
> >
>
> Interesting, are there factors specific to Hadoop (not just subtleties of
> Java or the OS) which can affect accuracy and I should be concerned about?
>
> Also, Sequence File stores compressed key value pairs does it not? Is that
> compression lossy?
>
>
> > On Thu, Jul 14, 2011 at 2:54 PM, Dhruv Kumar <dkumar@ecs.umass.edu>
> wrote:
> >
> > > What are the algorithms and codecs used in Hadoop to compress data and
> > pass
> > > it around between mappers and reducers? I'm curious to understand the
> > > effects it has (if any) on double precision values.
> > >
> > > So far my trainer (MAHOUT-627) uses unscaled EM training and I'm soon
> > > starting the work on using log-scaled values for improved accuracy and
> > > minimizing underflow. It will be interesting to compare the accuracy of
> > the
> > > unscaled and log scaled variants so I'm curious.
> > >
> >
>



-- 
Yee Yang Li Hector
http://hectorgon.blogspot.com/ (tech + travel)
http://hectorgon.com (book reviews)

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message