Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
MIME-Version: 1.0
Date: Thu, 6 Aug 2015 19:44:53 +0530
Message-ID: 
 <CAPx=uk5LPWOB58AH8vp2t_WNzK1fEXEZYWOSBPQLNos0emn5pw@mail.gmail.com>
Subject: Reading data from FTP Server in Hadoop/Cascading
From: Arshad Ali Sayed <arshad.ali.sayed.15@gmail.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=001a113500c8a80675051ca52419

--001a113500c8a80675051ca52419
Content-Type: text/plain; charset=UTF-8

I want to read data from FTP Server.I am providing path of the file which
resides on FTP server in the format *ftp://Username:Password@host/path. *
When I use map reduce program to read data from file  it works fine. I want
to read data from same file through Cascading framework. I am using *Hfs
tap *of cascading framework to read data*. *It throws following exception

*:java.io.IOException: Stream closed*
    at org.apache.hadoop.fs.ftp.FTPInputStream.close(FTPInputStream.java:98)
    at java.io.FilterInputStream.close(Unknown Source)
    at org.apache.hadoop.util.LineReader.close(LineReader.java:83)
    at
org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:168)
    at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.close(MapTask.java:254)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:440)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
    at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)

Below is the code of cascading framework from where I am reading the files:
public class FTPWithHadoopDemo {
    public static void main(String args[]) {
        Tap source = new Hfs(new TextLine(new Fields("line")),
"ftp://user:pwd@xx.xx.xx.xx//input1");
        Tap sink = new Hfs(new TextLine(new Fields("line1")), "OP\\op",
SinkMode.REPLACE);
        Pipe pipe = new Pipe("First");
        pipe = new Each(pipe, new RegexSplitGenerator("\\s+"));
        pipe = new GroupBy(pipe);
        Pipe tailpipe = new Every(pipe, new Count());
        FlowDef flowDef = FlowDef.flowDef().addSource(pipe,
source).addTailSink(tailpipe, sink);
        new HadoopFlowConnector().connect(flowDef).complete();
    }
}

I tried to look in Hadoop Source code for the same exception. I found that
in the MapTask class there is one method runOldMapper which deals with
stream. And in the same method there is finally block where stream gets
closed. When I remove that line from code it works fine. Below is the code:
private <INKEY, INVALUE, OUTKEY, OUTVALUE> void runOldMapper(final JobConf
job, final TaskSplitIndex splitIndex,
            final TaskUmbilicalProtocol umbilical, TaskReporter reporter)
                    throws IOException, InterruptedException,
ClassNotFoundException {
        InputSplit inputSplit = getSplitDetails(new
Path(splitIndex.getSplitLocation()), splitIndex.getStartOffset());

        updateJobWithSplit(job, inputSplit);
        reporter.setInputSplit(inputSplit);

        RecordReader<INKEY, INVALUE> in = isSkipping()
                ? new SkippingRecordReader<INKEY, INVALUE>(inputSplit,
umbilical, reporter)
                : new TrackedRecordReader<INKEY, INVALUE>(inputSplit, job,
reporter);
        job.setBoolean("mapred.skip.on", isSkipping());

        int numReduceTasks = conf.getNumReduceTasks();
        LOG.info("numReduceTasks: " + numReduceTasks);
        MapOutputCollector collector = null;
        if (numReduceTasks > 0) {
            collector = new MapOutputBuffer(umbilical, job, reporter);
        } else {
            collector = new DirectMapOutputCollector(umbilical, job,
reporter);
        }
        MapRunnable<INKEY, INVALUE, OUTKEY, OUTVALUE> runner =
ReflectionUtils.newInstance(job.getMapRunnerClass(),
                job);

        try {
            runner.run(in, new OldOutputCollector(collector, conf),
reporter);
            collector.flush();
        } finally {
            // close
            in.close(); // close input
            collector.close();
        }
    }

I have asked the same question on cascading-user group and they replied "*HFS
supports whatever Hadoop supports, so if you supply a URI in the format
ftp://, it should do the right thing.*" But still I am getting this
exceptions.
please assist me in solving this problem.


Thanks,
Arshadali

--001a113500c8a80675051ca52419
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div><div><div>I want to read data from FTP Server.I am pr=
oviding path of the file which resides on FTP server in the format <b>ftp:/=
/Username:Password@host/path. </b><br>When I use map reduce program to read=
 data from file=C2=A0 it works fine. I want to read data from same file thr=
ough Cascading framework. I am using <b>Hfs tap </b>of cascading framework =
to read data<b>. </b>It throws following exception<b>:<br><br>java.io.IOExc=
eption: Stream closed</b><br>=C2=A0=C2=A0=C2=A0 at org.apache.hadoop.fs.ftp=
.FTPInputStream.close(FTPInputStream.java:98)<br>=C2=A0=C2=A0=C2=A0 at java=
.io.FilterInputStream.close(Unknown Source)<br>=C2=A0=C2=A0=C2=A0 at org.ap=
ache.hadoop.util.LineReader.close(LineReader.java:83)<br>=C2=A0=C2=A0=C2=A0=
 at org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:1=
68)<br>=C2=A0=C2=A0=C2=A0 at org.apache.hadoop.mapred.MapTask$TrackedRecord=
Reader.close(MapTask.java:254)<br>=C2=A0=C2=A0=C2=A0 at org.apache.hadoop.m=
apred.MapTask.runOldMapper(MapTask.java:440)<br>=C2=A0=C2=A0=C2=A0 at org.a=
pache.hadoop.mapred.MapTask.run(MapTask.java:372)<br>=C2=A0=C2=A0=C2=A0 at =
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)<br=
>=C2=A0<br></div><div>Below is the code of cascading framework from where I=
 am reading the files:<br><div style=3D"margin-left:80px">public class FTPW=
ithHadoopDemo {<br>=C2=A0=C2=A0=C2=A0 public static void main(String args[]=
) {<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 Tap source =3D new Hfs(new Tex=
tLine(new Fields(&quot;line&quot;)), &quot;ftp://user:pwd@xx.xx.xx.xx//inpu=
t1&quot;);<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 Tap sink =3D new Hfs(ne=
w TextLine(new Fields(&quot;line1&quot;)), &quot;OP\\op&quot;, SinkMode.REP=
LACE);<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 Pipe pipe =3D new Pipe(&quo=
t;First&quot;);<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 pipe =3D new Each(=
pipe, new RegexSplitGenerator(&quot;\\s+&quot;));<br>=C2=A0=C2=A0=C2=A0 =C2=
=A0=C2=A0=C2=A0 pipe =3D new GroupBy(pipe);<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=
=A0=C2=A0 Pipe tailpipe =3D new Every(pipe, new Count());<br>=C2=A0=C2=A0=
=C2=A0 =C2=A0=C2=A0=C2=A0 FlowDef flowDef =3D FlowDef.flowDef().addSource(p=
ipe, source).addTailSink(tailpipe, sink);<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=
=A0=C2=A0 new HadoopFlowConnector().connect(flowDef).complete();<br>=C2=A0=
=C2=A0=C2=A0 }<br>}<br></div><br></div><div>I tried to look in Hadoop Sourc=
e code for the same exception. I found that in the MapTask class there is o=
ne method runOldMapper which deals with stream. And in the same method ther=
e is finally block where stream gets closed. When I remove that line from c=
ode it works fine. Below is the code:<br>private &lt;INKEY, INVALUE, OUTKEY=
, OUTVALUE&gt; void runOldMapper(final JobConf job, final TaskSplitIndex sp=
litIndex,<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 final=
 TaskUmbilicalProtocol umbilical, TaskReporter reporter)<br>=C2=A0=C2=A0=C2=
=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=
=C2=A0 throws IOException, InterruptedException, ClassNotFoundException {<b=
r>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 InputSplit inputSplit =3D getSplitD=
etails(new Path(splitIndex.getSplitLocation()), splitIndex.getStartOffset()=
);<br><br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 updateJobWithSplit(job, inp=
utSplit);<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 reporter.setInputSplit(i=
nputSplit);<br><br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 RecordReader&lt;IN=
KEY, INVALUE&gt; in =3D isSkipping()<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=
=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 ? new SkippingRecordReader&lt;INK=
EY, INVALUE&gt;(inputSplit, umbilical, reporter)<br>=C2=A0=C2=A0=C2=A0 =C2=
=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 : new TrackedRecordRe=
ader&lt;INKEY, INVALUE&gt;(inputSplit, job, reporter);<br>=C2=A0=C2=A0=C2=
=A0 =C2=A0=C2=A0=C2=A0 job.setBoolean(&quot;mapred.skip.on&quot;, isSkippin=
g());<br><br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 int numReduceTasks =3D c=
onf.getNumReduceTasks();<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 LOG.info(=
&quot;numReduceTasks: &quot; + numReduceTasks);<br>=C2=A0=C2=A0=C2=A0 =C2=
=A0=C2=A0=C2=A0 MapOutputCollector collector =3D null;<br>=C2=A0=C2=A0=C2=
=A0 =C2=A0=C2=A0=C2=A0 if (numReduceTasks &gt; 0) {<br>=C2=A0=C2=A0=C2=A0 =
=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 collector =3D new MapOutputBuffer(umb=
ilical, job, reporter);<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 } else {<b=
r>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 collector =3D ne=
w DirectMapOutputCollector(umbilical, job, reporter);<br>=C2=A0=C2=A0=C2=A0=
 =C2=A0=C2=A0=C2=A0 }<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 MapRunnable&=
lt;INKEY, INVALUE, OUTKEY, OUTVALUE&gt; runner =3D ReflectionUtils.newInsta=
nce(job.getMapRunnerClass(),<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=
=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 job);<br><br>=C2=A0=C2=A0=C2=A0 =C2=A0=
=C2=A0=C2=A0 try {<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=
=A0 runner.run(in, new OldOutputCollector(collector, conf), reporter);<br>=
=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 collector.flush();=
<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 } finally {<br>=C2=A0=C2=A0=C2=A0=
 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 // close<br>=C2=A0=C2=A0=C2=A0 =C2=
=A0=C2=A0=C2=A0<span style=3D"color:rgb(255,0,0)"> =C2=A0=C2=A0=C2=A0 in.cl=
ose(); // close input</span><br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=
=A0=C2=A0=C2=A0 collector.close();<br>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0=
 }<br>=C2=A0=C2=A0=C2=A0 }<br><br></div><div>I have asked the same question=
 on cascading-user group and they replied &quot;<b>HFS supports whatever Ha=
doop supports, so if you supply a URI in the format ftp://, it should do th=
e right thing.</b>&quot; But still I am getting this exceptions.<br></div>p=
lease assist me in solving this problem.<br><br><br></div>Thanks,<br></div>=
Arshadali<br></div>

--001a113500c8a80675051ca52419--