I have the data and I just verified that the problem described is still
happening. Do you want me to try something else on it ?
On Tue, Mar 17, 2009 at 4:33 PM, Benjamin Reed <breed@yahoo-inc.com> wrote:
> is there a way to reproduce the dataset?
> thanx
> ben
>
> -----Original Message-----
> From: Tamir Kamara [mailto:tamirkamara@gmail.com]
> Sent: Tuesday, March 17, 2009 6:19 AM
> To: pig-user@hadoop.apache.org
> Subject: Re: bzip/gzip
>
> Sure. My query is simple enough:
>
> --links = LOAD '/user/hadoop/links/links.txt.bz2' AS (target:int,
> source:int);
> links = LOAD '/user/hadoop/links/links-gz/*' AS (target:int, source:int);
> a = filter links by target==98;
> a1 = foreach a generate source;
> b = JOIN links by source, a1 by source USING "replicated";
> c = group b by links::source;
> d = foreach c generate group as source, COUNT(*);
> dump d;
>
> I used the same source file to create both the bz file and the splitted gz
> files. The right results were produced with the gz files and bz results
> were
> off by 1 or 2 for all records.
>
> Thanks,
> Tamir
>
>
> On Tue, Mar 17, 2009 at 3:08 PM, Benjamin Reed <breed@yahoo-inc.com>
> wrote:
>
> > can you give more information on the wrong results you are getting? it
> > would be great if we could reproduce the problem.
> >
> > ben
> >
> > -----Original Message-----
> > From: Tamir Kamara [mailto:tamirkamara@gmail.com]
> > Sent: Monday, March 16, 2009 11:10 AM
> > To: pig-user@hadoop.apache.org
> > Subject: Re: bzip/gzip
> >
> > Hi,
> >
> > I did some testing with both gzip and bzip2.
> > As Alan wrote, bz has the advantage of being splittable out of the box
> but
> > the disadvantage is its performance both in compression and decompression
> -
> > bz is slow I don't think the smaller file is worth it.
> > I also got wrong results when using bz files with the latest trunk which
> > suggests that there're still some problems. I've emailed the details of
> the
> > problem here a week ago.
> > For now, when I need to I split the files manually and use gzip before
> > moving them into the dfs within a specific directory and then load that
> > entire directory with pig.
> > I also tried to use lzo but had some problems with it. What I did see is
> > that lzo is faster than gzip but produces larger files.
> > As I understand the situation, pig can only write to bz files but read
> also
> > gz, lzo and zlib (handled by hadoop).
> > I originally wanted pig to write normal text files and have hadoop
> compress
> > the output to the other compression types (e.g. lzo), and I configured
> > hadoop as mentioned in the docs but still got an uncompressed output. If
> > anyone knows how to use this feature, please write.
> >
> > Tamir
> >
> >
> > On Mon, Mar 16, 2009 at 5:10 PM, Alan Gates <gates@yahoo-inc.com> wrote:
> >
> > > I haven't worked extensively with compressed data, so I'll let others
> who
> > > have share their experience. But pig does work with bzip data, which
> can
> > be
> > > split. PigStorage checks to see if the input or output file ends in
> .bz,
> > > and if so uses bzip to read/write the data. There have been some bugs
> in
> > > this code, so you should make sure you have the top of trunk version as
> > it's
> > > been fixed fairly recently.
> > >
> > > gzip files cannot be split, and if you gzip your whole file, you can't
> > > really use it with map/reduce or pig. But, hadoop now supports
> > compressing
> > > each block. As I understand it lzo is preferred over gzip for this.
> But
> > > when you use this, it works fine with pig because hadoop handles the
> > > (de)compression underneath pig. You should be able to find info on how
> > to
> > > do this on your cluster in the hadoop docs.
> > >
> > > Alan.
> > >
> > >
> > > On Mar 14, 2009, at 2:00 AM, Vadim Zaliva wrote:
> > >
> > > I am considering starting to use compression for data files I process
> > >> with PIG. I am using trunk version of PIG
> > >> on Hadoop-0.18.3. Uncompressed files are about 500Mb each, and I plan
> > >> to have total few dozen terabytes of uncompressed data.
> > >> DFS block size I am using is 96Mb.
> > >>
> > >> I am looking for a feedback on idea of using {B|G}ZIP compressed
> > >> files. Could PIG handle them? How would it affect splittng?
> > >> I have read somewhere that bzip files could be split, whereas gzip
> > >> could not. Could somebody confirm this?
> > >>
> > >> Thanks!
> > >>
> > >> Vadim
> > >>
> > >
> > >
> >
>
|