Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 448BCC7B2 for ; Tue, 10 Sep 2013 16:21:58 +0000 (UTC) Received: (qmail 15385 invoked by uid 500); 10 Sep 2013 16:21:53 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 15294 invoked by uid 500); 10 Sep 2013 16:21:53 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 15287 invoked by uid 99); 10 Sep 2013 16:21:52 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 Sep 2013 16:21:52 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of chilinglam@gmail.com designates 209.85.215.47 as permitted sender) Received: from [209.85.215.47] (HELO mail-la0-f47.google.com) (209.85.215.47) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 Sep 2013 16:21:46 +0000 Received: by mail-la0-f47.google.com with SMTP id eo20so6424962lab.34 for ; Tue, 10 Sep 2013 09:21:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=F8Bq3kYNoPYWHaUMoZb8B5AkESvW5zDrWRQHO/oXliU=; b=rt/W/5J3ntQVkZMLrxzgLw1pySRMjdlrf9rldgoy08bs3lfGp8K0EEpR2O4OEfaxV3 VNlEJ3jTkeXfBy/mT0XIX9qdjcoJfueCanYiTdsYdugSXdV6ZlrQtMhra2XHbY+/a+kd Ihvm+ikg9ROlbA6ZoDFjYjmaMSMWZ5Hrdc9wTAEsSJ0wSufoUObTX1C6E35L+nxBpXyh 9gYmaWeHgrzNk0n7oNboLBHsml3woe+8eq0M8LxD5ePiSUT2mCTDZWSYcw91AkXklqnh m1a4dKMM+9s0CfNNv7uxAFf4B1QIethfVA0B6Lhim1of/DKRB5RLvXd+ESfk2vuydF7S 1J8Q== MIME-Version: 1.0 X-Received: by 10.112.156.103 with SMTP id wd7mr1724014lbb.48.1378830086412; Tue, 10 Sep 2013 09:21:26 -0700 (PDT) Received: by 10.114.78.104 with HTTP; Tue, 10 Sep 2013 09:21:26 -0700 (PDT) In-Reply-To: References: Date: Tue, 10 Sep 2013 12:21:26 -0400 Message-ID: Subject: Re: Concatenate multiple sequence files into 1 big sequence file From: Jerry Lam To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=089e0116093881a15a04e609e5c6 X-Virus-Checked: Checked by ClamAV on apache.org --089e0116093881a15a04e609e5c6 Content-Type: text/plain; charset=ISO-8859-1 Hi guys, Thank you for all the advices here. I really appreciate it. I read through the code in filecrush and I found out that it is doing exactly what I'm currently doing. The logic resides in CrushReducer.java with the following lines that do the concatenation: while (reader.next(key, value)) { sink.write(key, value); reporter.incrCounter(ReducerCounter.RECORDS_CRUSHED, 1); } I wonder if there are other faster ways to do this? Preferably a solution that involves only streaming a set of sequence files to the final sequence file. Best Regards, Jerry On Tue, Sep 10, 2013 at 11:20 AM, Adam Muise wrote: > Jerry, > > It might not help with this particular file, but you might considered the > approach used at Blackberry when dealing with your data. They block > compressed into small avro files and then concatenated into large avro > files without decompressing. Check out the boom file format here: > > https://github.com/blackberry/hadoop-logdriver > > for now, use filecrush: > https://github.com/edwardcapriolo/filecrush > > Cheers, > > > > > On Tue, Sep 10, 2013 at 11:07 AM, Jerry Lam wrote: > >> Hi Hadoop users, >> >> I have been trying to concatenate multiple sequence files into one. >> Since the total size of the sequence files is quite big (1TB), I won't >> use mapreduce because it requires 1TB in the reducer host to hold the >> temporary data. >> >> I ended up doing what have been suggested in this thread: >> http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201308.mbox/%3CCAOcnVr2CuBdNkXutyydGjw2td19HHYiMwo4=JUa=SrXi51717w@mail.gmail.com%3E >> >> It works very well. I wonder if there is a faster way to append to a >> sequence file. >> >> Currently, the code looks like this (omit opening and closing sequence >> files, exception handling etc): >> >> // each seq is a sequence file >> // writer is a sequence file writer >> for (val seq : seqs) { >> >> reader =new SequenceFile.Reader(conf, >> Reader.file(seq.getPath())); >> >> while (reader.next(readerKey, readerValue)) { >> >> writer.append(readerKey, readerValue); >> >> } >> >> } >> >> Is there a better way to do this? Note that I think it is wasteful to >> deserialize and serialize the key and value in the while loop because the >> program simply append to the sequence file. Also, I don't seem to be able >> to read and write fast enough (about 6MB/sec). >> >> Any advice is appreciated, >> >> >> Jerry >> > > > > -- > * > * > * > * > *Adam Muise* > Solution Engineer > *Hortonworks* > amuise@hortonworks.com > 416-417-4037 > > Hortonworks - Develops, Distributes and Supports Enterprise Apache Hadoop. > > Hortonworks Virtual Sandbox > > Hadoop: Disruptive Possibilities by Jeff Needham > > CONFIDENTIALITY NOTICE > NOTICE: This message is intended for the use of the individual or entity > to which it is addressed and may contain information that is confidential, > privileged and exempt from disclosure under applicable law. If the reader > of this message is not the intended recipient, you are hereby notified that > any printing, copying, dissemination, distribution, disclosure or > forwarding of this communication is strictly prohibited. If you have > received this communication in error, please contact the sender immediately > and delete it from your system. Thank You. --089e0116093881a15a04e609e5c6 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Hi guys,

Thank you for all the advices = here. I really appreciate it.

I read through the c= ode in filecrush and I found out that it is doing exactly what I'm curr= ently doing.
The logic resides in CrushReducer.java with the following lines that d= o the concatenation:

while (reader.next(key, value)) {

sink.write(key, value);

reporter.incrCounter(ReducerCounter.RECORDS_= CRUSHED, 1);

}

I wonder if there are other faster ways to do this? Preferably a so= lution that involves only streaming a set of sequence files to the final se= quence file.

Best Regards,


Jerry



= On Tue, Sep 10, 2013 at 11:20 AM, Adam Muise <amuise@hortonworks.com= > wrote:
Jerry,

I= t might not help with this particular file, but you might considered the ap= proach used at Blackberry when dealing with your data. They block compresse= d into small avro files and then concatenated into large avro files without= decompressing. Check out the boom file format here:


for now, use filecrush:

Cheers,




On Tue, Sep 10, 2013 a= t 11:07 AM, Jerry Lam <chilinglam@gmail.com> wrote:
Hi Hadoop users,

I have been trying to = concatenate multiple sequence files into one.=A0
Since the total = size of the sequence files is quite big (1TB), I won't use mapreduce be= cause it requires 1TB in the reducer host to hold the temporary data.


It works very well. I wonder if there is a faster way t= o append to a sequence file.

Currently, the code l= ooks like this (omit opening and closing sequence files, exception handling= etc):

// each seq is a sequence file
// writer is a= sequence file writer
=A0 =A0 =A0 =A0=A0for (v= al seq : seqs) {

=A0 =A0 =A0 =A0 =A0 reader =3Dnew SequenceFile.Reader(conf,= Reader.file(seq.getPath()));

=A0 =A0 =A0 =A0 =A0 =A0 while (reader.next(readerKey, reade= rValue)) {

=A0=A0 =A0 =A0 =A0 =A0 =A0 =A0writer.append(readerKey, rea= derValue);

=A0 =A0 =A0 =A0 =A0 =A0 }

=A0 =A0 =A0 =A0 }

Is there a better way to do this? Note that = I think it is wasteful to deserialize and serialize the key and value in th= e while loop because the program simply append to the sequence file. Also, = I don't seem to be able to read and write fast enough (about 6MB/sec).<= /p>

Any advice is appreciated,


Jerry




--
=

CONFIDENTIALITY NOTICE
NOTICE: This message is = intended for the use of the individual or entity to which it is addressed a= nd may contain information that is confidential, privileged and exempt from= disclosure under applicable law. If the reader of this message is not the = intended recipient, you are hereby notified that any printing, copying, dis= semination, distribution, disclosure or forwarding of this communication is= strictly prohibited. If you have received this communication in error, ple= ase contact the sender immediately and delete it from your system. Thank Yo= u.

--089e0116093881a15a04e609e5c6--