hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Kostyrka <andr...@kostyrka.org>
Subject Re: streaming problem
Date Wed, 19 Mar 2008 10:03:48 GMT
Ok, tracked it down. Seems like Hadoop Streaming "corrupts" the input
files. Any way to force it to pass whole files to one-to-one mapper?

TIA,

Andreas

Am Mittwoch, den 19.03.2008, 09:18 +0100 schrieb Andreas Kostyrka:
> The /home/hadoop/dist/workloadmf script is available on all nodes.
> 
> But it missed one package to run correctly ;(
> 
> Anyway, I still have the problem, that running with
> -reducer NONE, my output gets lost, it seems. Well, some of the
> outputfiles contain a small number of output lines, but not many :(
> (And the expected size of each output file was around 25MB or so :( )
> 
> Ah the joys,
> 
> Andreas
> 
> Am Mittwoch, den 19.03.2008, 10:13 +0530 schrieb Amareshwari
> Sriramadasu:
> > Hi Andreas,
> >  Looks like your mapper is not available to the streaming jar. Where is 
> > your mapper script? Did you use distributed cache to distribute the mapper?
> > You can use -file <mapper-script-path on local fs> to make it part of 
> > jar. or Use -cacheFile /dist/wordloadmf#workloadmf to distribute the 
> > script. Distributing this way will add your script to the PATH.
> > 
> > So, now you command will be:
> > 
> > time bin/hadoop jar contrib/streaming/hadoop-0.16.0-streaming.jar -mapper workloadmf
-reducer NONE -input testlogs/* -output testlogs-output -cacheFile /dist/wordloadmf#workloadmf
> > 
> > or
> > 
> > time bin/hadoop jar contrib/streaming/hadoop-0.16.0-streaming.jar -mapper workloadmf
-reducer NONE -input testlogs/* -output testlogs-output -file <path-on-local-fs>
> > 
> > Thanks,
> > Amareshwari
> > 
> > Andreas Kostyrka wrote:
> > > Some additional details if it's helping, the HDFS is hosted on AWS S3,
> > > and the input file set consists of 152 gzipped Apache log files.
> > >
> > > Thanks,
> > >
> > > Andreas
> > >
> > > Am Dienstag, den 18.03.2008, 22:17 +0100 schrieb Andreas Kostyrka:
> > >   
> > >> Hi!
> > >>
> > >> I'm trying to run a streaming job on Hadoop 1.16.0, I've distributed the
> > >> scripts to be used to all nodes:
> > >>
> > >> time bin/hadoop jar contrib/streaming/hadoop-0.16.0-streaming.jar -mapper
~/dist/workloadmf -reducer NONE -input testlogs/* -output testlogs-output
> > >>
> > >> Now, this gives me:
> > >>
> > >> java.io.IOException: log:null
> > >> R/W/S=1/0/0 in:0=1/2 [rec/s] out:0=0/2 [rec/s]
> > >> minRecWrittenToEnableSkip_=9223372036854775807 LOGNAME=null
> > >> HOST=null
> > >> USER=hadoop
> > >> HADOOP_USER=null
> > >> last Hadoop input: |null|
> > >> last tool output: |null|
> > >> Date: Tue Mar 18 21:06:13 GMT 2008
> > >> java.io.IOException: Broken pipe
> > >> 	at java.io.FileOutputStream.writeBytes(Native Method)
> > >> 	at java.io.FileOutputStream.write(FileOutputStream.java:260)
> > >> 	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> > >> 	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
> > >> 	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:124)
> > >> 	at java.io.DataOutputStream.flush(DataOutputStream.java:106)
> > >> 	at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:96)
> > >> 	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> > >> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:208)
> > >> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2071)
> > >>
> > >>
> > >> 	at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:107)
> > >> 	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> > >> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:208)
> > >> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2071)
> > >>
> > >> Any ideas what my problems could be?
> > >>
> > >> TIA,
> > >>
> > >> Andreas
> > >>     

Mime
View raw message