hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Venner <jason.had...@gmail.com>
Subject Re: Problem to create sequence file for
Date Tue, 27 Oct 2009 14:54:00 GMT
If your string is up to 300MB you will need probably 1.3+gig to write it
1 copy in the string 600MB if your file is all ascii (strings store as
shorts)
1 copy in the byte array as utf8 1 to x3 expansion, say 600MB
1 copy in the on the wire format, say 700MB
possibly
1 copy in a transit buffer on the way to the remote file system, say 720MB
that adds up to 1.9g to -> 2.6g


Hopefully there are not more copies made ;)

Try setting your heap to 3 or 5 gig with a 64bit jvm.

On Tue, Oct 27, 2009 at 9:25 AM, bhushan_mahale <
bhushan_mahale@persistent.co.in> wrote:

> Hi Jason,
>
> Thanks for the reply.
> The string is the entire content of the input text file.
> It could as long as ~300MB.
> I tried increasing jvm heap but unfortunately it was giving same error.
>
> Other option I am thinking is to split input files first.
>
> - Bhushan
> -----Original Message-----
> From: Jason Venner [mailto:jason.hadoop@gmail.com]
> Sent: Tuesday, October 27, 2009 7:19 PM
> To: common-user@hadoop.apache.org
> Subject: Re: Problem to create sequence file for
>
> How large is the string that is being written?
> Does it contain the entire contents of your file?
> You may simple need to increase the heap size with your jvm.
>
>
> On Tue, Oct 27, 2009 at 3:43 AM, bhushan_mahale <
> bhushan_mahale@persistent.co.in> wrote:
>
> > Hi,
> >
> > I have written a code to create sequence files for given text files.
> > The program takes following input parameters:
> >
> >  1.  Local source directory - contains all the input text files
> >  2.  Destination HDFS URI - location on hdfs where sequence file will be
> > copied
> >
> > The key for a sequence-record is the file-name.
> > The value for a sequence-record is the content of the text file.
> >
> > The program runs fine for large number input text files. But if the size
> of
> > a single input text file is > 100 MB then it throws following exception:
> >
> > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> >        at java.lang.String.toCharArray(String.java:2726)
> >        at org.apache.hadoop.io.Text.encode(Text.java:388)
> >        at org.apache.hadoop.io.Text.set(Text.java:178)
> >        at org.apache.hadoop.io.Text.<init>(Text.java:81)
> >        at SequenceFileCreator.create(SequenceFileCreator.java:106)
> >        at SequenceFileCreator.processFile(SequenceFileCreator.java:168)
> >
> > I am using "org.apache.hadoop.io.SequenceFile.Writer" for creating the
> > sequence file. The Text class is used for keyclass and valclass.
> >
> > I tried increasing the max memory for the program but it throws same
> error.
> >
> > Can you provide your suggestions?
> >
> > Thanks,
> > - Bhushan
> >
> >
> > DISCLAIMER
> > ==========
> > This e-mail may contain privileged and confidential information which is
> > the property of Persistent Systems Ltd. It is intended only for the use
> of
> > the individual or entity to which it is addressed. If you are not the
> > intended recipient, you are not authorized to read, retain, copy, print,
> > distribute or use this message. If you have received this communication
> in
> > error, please notify the sender and delete all copies of this message.
> > Persistent Systems Ltd. does not accept any liability for virus infected
> > mails.
> >
>
>
>
> --
> Pro Hadoop, a book to guide you from beginner to hadoop mastery,
> http://www.amazon.com/dp/1430219424?tag=jewlerymall
> www.prohadoopbook.com a community for Hadoop Professionals
>
> DISCLAIMER
> ==========
> This e-mail may contain privileged and confidential information which is
> the property of Persistent Systems Ltd. It is intended only for the use of
> the individual or entity to which it is addressed. If you are not the
> intended recipient, you are not authorized to read, retain, copy, print,
> distribute or use this message. If you have received this communication in
> error, please notify the sender and delete all copies of this message.
> Persistent Systems Ltd. does not accept any liability for virus infected
> mails.
>



-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message