Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of lucejb@gmail.com designates
 209.85.219.54 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CALtSBbbVxtLKYN-q_9N-b9pUKDzbF2qkEirrYYWxmWnJrPDyMA@mail.gmail.com>
References: 
 <CALtSBbaYkmBEcBbtG3yMFFAFNsJ1hCdZeatDLghn3xzBR67Y=A@mail.gmail.com>
	<CAEAKFL8ieWsWShqCa+zJrwhETZpEnP2uBi+-5fRbdVSygch5_w@mail.gmail.com>
	<CALtSBbYvPTG6jykBfy+DveQhrW74iDFuz+JnBjgY_LMFr1haUQ@mail.gmail.com>
	<CAEAKFL_xEWW1rKiZmvk_9CxrDWe_L+7=bGU8qjmM7z3ERvoYjQ@mail.gmail.com>
	<CALtSBbbi7jW0Ne+_i__fMc1DwhumfoDRCswjTOqFUf+epX2-Lw@mail.gmail.com>
	<CAEAKFL9hXNA1gdtWg9=R0mOVkGjDjOTGiN6Lm3j9KR0heqacuA@mail.gmail.com>
	<CALtSBbaNKxeSjRWW1z8QK5fXnA8P+7NjKsT-G7+bC31M+DW1AA@mail.gmail.com>
	<CAOcnVr2ucYTojW5G74EfS5fE1iTjTcHxE9d4dTD7XKF2ROgrWg@mail.gmail.com>
	<CALtSBbYhpvcT7OayAAPg9n_stsG6SN_ALMxS1gwzj+GH4=LmsQ@mail.gmail.com>
	<CALtSBbbVxtLKYN-q_9N-b9pUKDzbF2qkEirrYYWxmWnJrPDyMA@mail.gmail.com>
Date: Mon, 4 Mar 2013 13:09:53 -0300
Message-ID: 
 <CALtSBbY+LX6fiKutGsybS5oLXxZbVuN0WvW_a5JbExY98hJfig@mail.gmail.com>
Subject: Re: map reduce and sync
From: Lucas Bernardi <lucejb@gmail.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=bcaec55408ac5f8f8f04d71b960a

--bcaec55408ac5f8f8f04d71b960a
Content-Type: text/plain; charset=ISO-8859-1

Ok, so I found a workaround for this issue, I share it here for others:
So the key problem is that hadoop won't update the file size until the file
is closed, then the FileInputFormat will see never-closed-files as empty
files and generate no splits for the map reduce process.

To fix this problem I changed the way the file length is calculated,
overriding the listStatus mehtod in a new InputFormat implementation, which
inherits from FileInputFormat:

    @Override
    protected List<FileStatus> listStatus(JobContext job) throws
IOException {
        List<FileStatus> listStatus = super.listStatus(job);
        List<FileStatus> result = Lists.newArrayList();
        DFSClient dfsClient = null;
        try {
            dfsClient = new DFSClient(job.getConfiguration());
            for (FileStatus fileStatus : listStatus) {
                long len = fileStatus.getLen();
                if (len == 0) {
                    DFSInputStream open =
dfsClient.open(fileStatus.getPath().toUri().getPath());
                    long fileLength = open.getFileLength();
                    open.close();
                    FileStatus fileStatus2 = new FileStatus(fileLength,
fileStatus.isDir(), fileStatus.getReplication(),
                        fileStatus.getBlockSize(),
fileStatus.getModificationTime(), fileStatus.getAccessTime(),
                        fileStatus.getPermission(), fileStatus.getOwner(),
fileStatus.getGroup(), fileStatus.getPath());
                    result.add(fileStatus2);
                } else {
                    result.add(fileStatus);
                }
            }
        } finally {
            if (dfsClient != null) {
                dfsClient.close();
            }
        }
        return result;
    }

this worked just fine for me.

What do you think?

Thanks!
Lucas

On Mon, Feb 25, 2013 at 7:03 PM, Lucas Bernardi <lucejb@gmail.com> wrote:

> It looks like getSplits in FileInputFormat is ignoring 0 lenght files....
> That also would explain the weird behavior of tail, which seems to always
> jump to the start since file length is 0.
>
> So, basically, sync doesn't update file length, any code based on file
> size, is unreliable.
>
> Am I right?
>
> How can I get around this?
>
> Lucas
>
>
> On Mon, Feb 25, 2013 at 12:38 PM, Lucas Bernardi <lucejb@gmail.com> wrote:
>
>> I didn't notice, thanks for the heads up.
>>
>>
>> On Mon, Feb 25, 2013 at 4:31 AM, Harsh J <harsh@cloudera.com> wrote:
>>
>>> Just an aside (I've not tried to look at the original issue yet), but
>>> Scribe has not been maintained (nor has seen a release) in over a year
>>> now -- looking at the commit history. Same case with both Facebook and
>>> Twitter's fork.
>>>
>>> On Mon, Feb 25, 2013 at 7:16 AM, Lucas Bernardi <lucejb@gmail.com>
>>> wrote:
>>> > Yeah I looked at scribe, looks good but sounds like too much for my
>>> problem.
>>> > I'd rather make it work the simple way. Could you pleas post your
>>> code, may
>>> > be I'm doing something wrong on the sync side. Maybe a buffer size,
>>> block
>>> > size or some other  parameter is different...
>>> >
>>> > Thanks!
>>> > Lucas
>>> >
>>> >
>>> > On Sun, Feb 24, 2013 at 10:31 PM, Hemanth Yamijala
>>> > <yhemanth@thoughtworks.com> wrote:
>>> >>
>>> >> I am using the same version of Hadoop as you.
>>> >>
>>> >> Can you look at something like Scribe, which AFAIK fits the use case
>>> you
>>> >> describe.
>>> >>
>>> >> Thanks
>>> >> Hemanth
>>> >>
>>> >>
>>> >> On Sun, Feb 24, 2013 at 3:33 AM, Lucas Bernardi <lucejb@gmail.com>
>>> wrote:
>>> >>>
>>> >>> That is exactly what I did, but in my case, it is like if the file
>>> were
>>> >>> empty, the job counters say no bytes read.
>>> >>> I'm using hadoop 1.0.3 which version did you try?
>>> >>>
>>> >>> What I'm trying to do is just some basic analyitics on a product
>>> search
>>> >>> system. There is a search service, every time a user performs a
>>> search, the
>>> >>> search string, and the results are stored in this file, and the file
>>> is
>>> >>> sync'ed. I'm actually using pig to do some basic counts, it doesn't
>>> work,
>>> >>> like I described, because the file looks empty for the map reduce
>>> >>> components. I thought it was about pig, but I wasn't sure, so I
>>> tried a
>>> >>> simple mr job, and used the word count to test the map reduce
>>> compoinents
>>> >>> actually see the sync'ed bytes.
>>> >>>
>>> >>> Of course if I close the file, everything works perfectly, but I
>>> don't
>>> >>> want to close the file every while, since that means I should create
>>> another
>>> >>> one (since no append support), and that would end up with too many
>>> tiny
>>> >>> files, something we know is bad for mr performance, and I don't want
>>> to add
>>> >>> more parts to this (like a file merging tool). I think unign sync is
>>> a clean
>>> >>> solution, since we don't care about writing performance, so I'd
>>> rather keep
>>> >>> it like this if I can make it work.
>>> >>>
>>> >>> Any idea besides hadoop version?
>>> >>>
>>> >>> Thanks!
>>> >>>
>>> >>> Lucas
>>> >>>
>>> >>>
>>> >>>
>>> >>> On Sat, Feb 23, 2013 at 11:54 AM, Hemanth Yamijala
>>> >>> <yhemanth@thoughtworks.com> wrote:
>>> >>>>
>>> >>>> Hi Lucas,
>>> >>>>
>>> >>>> I tried something like this but got different results.
>>> >>>>
>>> >>>> I wrote code that opened a file on HDFS, wrote a line and called
>>> sync.
>>> >>>> Without closing the file, I ran a wordcount with that file as
>>> input. It did
>>> >>>> work fine and was able to count the words that were sync'ed (even
>>> though the
>>> >>>> file length seems to come as 0 like you noted in fs -ls)
>>> >>>>
>>> >>>> So, not sure what's happening in your case. In the MR job, do the
>>> job
>>> >>>> counters indicate no bytes were read ?
>>> >>>>
>>> >>>> On a different note though, if you can describe a little more what
>>> you
>>> >>>> are trying to accomplish, we could probably work a better solution.
>>> >>>>
>>> >>>> Thanks
>>> >>>> hemanth
>>> >>>>
>>> >>>>
>>> >>>> On Sat, Feb 23, 2013 at 7:15 PM, Lucas Bernardi <lucejb@gmail.com>
>>> >>>> wrote:
>>> >>>>>
>>> >>>>> Helo Hemanth, thanks for answering.
>>> >>>>> The file is open by a separate process not map reduce related at
>>> all.
>>> >>>>> You can think of it as a servlet, receiving requests, and writing
>>> them to
>>> >>>>> this file, every time a request is received it is written and
>>> >>>>> org.apache.hadoop.fs.FSDataOutputStream.sync() is invoked.
>>> >>>>>
>>> >>>>> At the same time, I want to run a map reduce job over this file.
>>> Simply
>>> >>>>> runing the word count example doesn't seem to work, it is like if
>>> the file
>>> >>>>> were empty.
>>> >>>>>
>>> >>>>> hadoop -fs -tail works just fine, and reading the file using
>>> >>>>> org.apache.hadoop.fs.FSDataInputStream also works ok.
>>> >>>>>
>>> >>>>> Last thing, the web interface doesn't see the contents, and command
>>> >>>>> hadoop -fs -ls says the file is empty.
>>> >>>>>
>>> >>>>> What am I doing wrong?
>>> >>>>>
>>> >>>>> Thanks!
>>> >>>>>
>>> >>>>> Lucas
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> On Sat, Feb 23, 2013 at 4:37 AM, Hemanth Yamijala
>>> >>>>> <yhemanth@thoughtworks.com> wrote:
>>> >>>>>>
>>> >>>>>> Could you please clarify, are you opening the file in your mapper
>>> code
>>> >>>>>> and reading from there ?
>>> >>>>>>
>>> >>>>>> Thanks
>>> >>>>>> Hemanth
>>> >>>>>>
>>> >>>>>> On Friday, February 22, 2013, Lucas Bernardi wrote:
>>> >>>>>>>
>>> >>>>>>> Hello there, I'm trying to use hadoop map reduce to process an
>>> open
>>> >>>>>>> file. The writing process, writes a line to the file and syncs
>>> the file to
>>> >>>>>>> readers.
>>> >>>>>>> (org.apache.hadoop.fs.FSDataOutputStream.sync()).
>>> >>>>>>>
>>> >>>>>>> If I try to read the file from another process, it works fine, at
>>> >>>>>>> least using
>>> >>>>>>> org.apache.hadoop.fs.FSDataInputStream.
>>> >>>>>>>
>>> >>>>>>> hadoop -fs -tail also works just fine
>>> >>>>>>>
>>> >>>>>>> But it looks like map reduce doesn't read any data. I tried
>>> using the
>>> >>>>>>> word count example, same thing, it is like if the file were
>>> empty for the
>>> >>>>>>> map reduce framework.
>>> >>>>>>>
>>> >>>>>>> I'm using hadoop 1.0.3. and pig 0.10.0
>>> >>>>>>>
>>> >>>>>>> I need some help around this.
>>> >>>>>>>
>>> >>>>>>> Thanks!
>>> >>>>>>>
>>> >>>>>>> Lucas
>>> >>>>>
>>> >>>>>
>>> >>>>
>>> >>>
>>> >>
>>> >
>>>
>>>
>>>
>>> --
>>> Harsh J
>>>
>>
>>
>

--bcaec55408ac5f8f8f04d71b960a
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Ok, so I found a workaround for this issue, I share it here for others:<div=
>So the key problem is that hadoop won&#39;t update the file size until the=
 file is closed, then the FileInputFormat will see never-closed-files as em=
pty files and generate no splits for the map reduce process.</div>


<div><br></div><div>To fix this problem I changed the way the file length i=
s calculated, overriding the listStatus mehtod in a new InputFormat impleme=
ntation, which inherits from FileInputFormat:</div><div><br></div><div>
<div>=A0 =A0 @Override</div><div>=A0 =A0 protected List&lt;FileStatus&gt; l=
istStatus(JobContext job) throws IOException {</div><div>=A0 =A0 =A0 =A0 Li=
st&lt;FileStatus&gt; listStatus =3D super.listStatus(job);</div><div>=A0 =
=A0 =A0 =A0 List&lt;FileStatus&gt; result =3D Lists.newArrayList();</div>
<div>=A0 =A0 =A0 =A0 DFSClient dfsClient =3D null;</div><div>=A0 =A0 =A0 =
=A0 try {</div><div>=A0 =A0 =A0 =A0 =A0 =A0 dfsClient =3D new DFSClient(job=
.getConfiguration());</div><div>=A0 =A0 =A0 =A0 =A0 =A0 for (FileStatus fil=
eStatus : listStatus) {</div><div>=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 long len =
=3D fileStatus.getLen();</div>
<div>=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (len =3D=3D 0) {</div><div>=A0 =A0 =
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 DFSInputStream open =3D dfsClient.open(file=
Status.getPath().toUri().getPath());</div><div>=A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0 =A0 long fileLength =3D open.getFileLength();</div><div>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 open.close();</div><div>=A0 =A0 =A0=
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 FileStatus fileStatus2 =3D new FileStatus(file=
Length, fileStatus.isDir(), fileStatus.getReplication(),</div><div>=A0 =A0 =
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 fileStatus.getBlockSize(), fileStat=
us.getModificationTime(), fileStatus.getAccessTime(),</div>
<div>=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 fileStatus.getPermissi=
on(), fileStatus.getOwner(), fileStatus.getGroup(), fileStatus.getPath());<=
/div><div>=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 result.add(fileStatus2);<=
/div><div>=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 } else {</div>
<div>=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 result.add(fileStatus);</div><=
div>=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 }</div><div>=A0 =A0 =A0 =A0 =A0 =A0 }</=
div><div>=A0 =A0 =A0 =A0 } finally {</div><div>=A0 =A0 =A0 =A0 =A0 =A0 if (=
dfsClient !=3D null) {</div><div>=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 dfsClient.=
close();</div>
<div>=A0 =A0 =A0 =A0 =A0 =A0 }</div><div>=A0 =A0 =A0 =A0 }</div><div>=A0 =
=A0 =A0 =A0 return result;</div><div>=A0 =A0 }</div></div><div><br></div><d=
iv>this worked just fine for me.</div><div><br></div><div>What do you think=
?</div><div><br></div><div>Thanks!</div>
<div>Lucas</div><div><br><div class=3D"gmail_quote">On Mon, Feb 25, 2013 at=
 7:03 PM, Lucas Bernardi <span dir=3D"ltr">&lt;<a href=3D"mailto:lucejb@gma=
il.com" target=3D"_blank">lucejb@gmail.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">

It looks like getSplits in=A0FileInputFormat is ignoring 0 lenght files....=
<div>That also would explain the weird behavior of tail, which seems to alw=
ays jump to the start since file length is 0.</div><div><br></div><div>So, =
basically, sync doesn&#39;t update file length, any code based on file size=
, is unreliable.</div>


<div><br></div><div>Am I right?</div><div><br></div><div>How can I get arou=
nd this?</div><span><font color=3D"#888888"><div><br></div></font></span><d=
iv><span><font color=3D"#888888">Lucas</font></span><div>
<div><br><br><div class=3D"gmail_quote">On Mon, Feb 25, 2013 at 12:38 PM, L=
ucas Bernardi <span dir=3D"ltr">&lt;<a href=3D"mailto:lucejb@gmail.com" tar=
get=3D"_blank">lucejb@gmail.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">I didn&#39;t notice, thanks for the heads up=
.<div><div><br><br><div class=3D"gmail_quote">On Mon, Feb 25, 2013 at 4:31 =
AM, Harsh J <span dir=3D"ltr">&lt;<a href=3D"mailto:harsh@cloudera.com" tar=
get=3D"_blank">harsh@cloudera.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">Just an aside (I&#39;ve not tried to look at=
 the original issue yet), but<br>
Scribe has not been maintained (nor has seen a release) in over a year<br>
now -- looking at the commit history. Same case with both Facebook and<br>
Twitter&#39;s fork.<br>
<div><div><br>
On Mon, Feb 25, 2013 at 7:16 AM, Lucas Bernardi &lt;<a href=3D"mailto:lucej=
b@gmail.com" target=3D"_blank">lucejb@gmail.com</a>&gt; wrote:<br>
&gt; Yeah I looked at scribe, looks good but sounds like too much for my pr=
oblem.<br>
&gt; I&#39;d rather make it work the simple way. Could you pleas post your =
code, may<br>
&gt; be I&#39;m doing something wrong on the sync side. Maybe a buffer size=
, block<br>
&gt; size or some other =A0parameter is different...<br>
&gt;<br>
&gt; Thanks!<br>
&gt; Lucas<br>
&gt;<br>
&gt;<br>
&gt; On Sun, Feb 24, 2013 at 10:31 PM, Hemanth Yamijala<br>
&gt; &lt;<a href=3D"mailto:yhemanth@thoughtworks.com" target=3D"_blank">yhe=
manth@thoughtworks.com</a>&gt; wrote:<br>
&gt;&gt;<br>
&gt;&gt; I am using the same version of Hadoop as you.<br>
&gt;&gt;<br>
&gt;&gt; Can you look at something like Scribe, which AFAIK fits the use ca=
se you<br>
&gt;&gt; describe.<br>
&gt;&gt;<br>
&gt;&gt; Thanks<br>
&gt;&gt; Hemanth<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; On Sun, Feb 24, 2013 at 3:33 AM, Lucas Bernardi &lt;<a href=3D"mai=
lto:lucejb@gmail.com" target=3D"_blank">lucejb@gmail.com</a>&gt; wrote:<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; That is exactly what I did, but in my case, it is like if the =
file were<br>
&gt;&gt;&gt; empty, the job counters say no bytes read.<br>
&gt;&gt;&gt; I&#39;m using hadoop 1.0.3 which version did you try?<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; What I&#39;m trying to do is just some basic analyitics on a p=
roduct search<br>
&gt;&gt;&gt; system. There is a search service, every time a user performs =
a search, the<br>
&gt;&gt;&gt; search string, and the results are stored in this file, and th=
e file is<br>
&gt;&gt;&gt; sync&#39;ed. I&#39;m actually using pig to do some basic count=
s, it doesn&#39;t work,<br>
&gt;&gt;&gt; like I described, because the file looks empty for the map red=
uce<br>
&gt;&gt;&gt; components. I thought it was about pig, but I wasn&#39;t sure,=
 so I tried a<br>
&gt;&gt;&gt; simple mr job, and used the word count to test the map reduce =
compoinents<br>
&gt;&gt;&gt; actually see the sync&#39;ed bytes.<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; Of course if I close the file, everything works perfectly, but=
 I don&#39;t<br>
&gt;&gt;&gt; want to close the file every while, since that means I should =
create another<br>
&gt;&gt;&gt; one (since no append support), and that would end up with too =
many tiny<br>
&gt;&gt;&gt; files, something we know is bad for mr performance, and I don&=
#39;t want to add<br>
&gt;&gt;&gt; more parts to this (like a file merging tool). I think unign s=
ync is a clean<br>
&gt;&gt;&gt; solution, since we don&#39;t care about writing performance, s=
o I&#39;d rather keep<br>
&gt;&gt;&gt; it like this if I can make it work.<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; Any idea besides hadoop version?<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; Thanks!<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; Lucas<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; On Sat, Feb 23, 2013 at 11:54 AM, Hemanth Yamijala<br>
&gt;&gt;&gt; &lt;<a href=3D"mailto:yhemanth@thoughtworks.com" target=3D"_bl=
ank">yhemanth@thoughtworks.com</a>&gt; wrote:<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; Hi Lucas,<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; I tried something like this but got different results.<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; I wrote code that opened a file on HDFS, wrote a line and =
called sync.<br>
&gt;&gt;&gt;&gt; Without closing the file, I ran a wordcount with that file=
 as input. It did<br>
&gt;&gt;&gt;&gt; work fine and was able to count the words that were sync&#=
39;ed (even though the<br>
&gt;&gt;&gt;&gt; file length seems to come as 0 like you noted in fs -ls)<b=
r>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; So, not sure what&#39;s happening in your case. In the MR =
job, do the job<br>
&gt;&gt;&gt;&gt; counters indicate no bytes were read ?<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; On a different note though, if you can describe a little m=
ore what you<br>
&gt;&gt;&gt;&gt; are trying to accomplish, we could probably work a better =
solution.<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; Thanks<br>
&gt;&gt;&gt;&gt; hemanth<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; On Sat, Feb 23, 2013 at 7:15 PM, Lucas Bernardi &lt;<a hre=
f=3D"mailto:lucejb@gmail.com" target=3D"_blank">lucejb@gmail.com</a>&gt;<br=
>
&gt;&gt;&gt;&gt; wrote:<br>
&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt; Helo Hemanth, thanks for answering.<br>
&gt;&gt;&gt;&gt;&gt; The file is open by a separate process not map reduce =
related at all.<br>
&gt;&gt;&gt;&gt;&gt; You can think of it as a servlet, receiving requests, =
and writing them to<br>
&gt;&gt;&gt;&gt;&gt; this file, every time a request is received it is writ=
ten and<br>
&gt;&gt;&gt;&gt;&gt; org.apache.hadoop.fs.FSDataOutputStream.sync() is invo=
ked.<br>
&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt; At the same time, I want to run a map reduce job over =
this file. Simply<br>
&gt;&gt;&gt;&gt;&gt; runing the word count example doesn&#39;t seem to work=
, it is like if the file<br>
&gt;&gt;&gt;&gt;&gt; were empty.<br>
&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt; hadoop -fs -tail works just fine, and reading the file=
 using<br>
&gt;&gt;&gt;&gt;&gt; org.apache.hadoop.fs.FSDataInputStream also works ok.<=
br>
&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt; Last thing, the web interface doesn&#39;t see the cont=
ents, and command<br>
&gt;&gt;&gt;&gt;&gt; hadoop -fs -ls says the file is empty.<br>
&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt; What am I doing wrong?<br>
&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt; Thanks!<br>
&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt; Lucas<br>
&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt; On Sat, Feb 23, 2013 at 4:37 AM, Hemanth Yamijala<br>
&gt;&gt;&gt;&gt;&gt; &lt;<a href=3D"mailto:yhemanth@thoughtworks.com" targe=
t=3D"_blank">yhemanth@thoughtworks.com</a>&gt; wrote:<br>
&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt; Could you please clarify, are you opening the file=
 in your mapper code<br>
&gt;&gt;&gt;&gt;&gt;&gt; and reading from there ?<br>
&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt; Thanks<br>
&gt;&gt;&gt;&gt;&gt;&gt; Hemanth<br>
&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt; On Friday, February 22, 2013, Lucas Bernardi wrote=
:<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt; Hello there, I&#39;m trying to use hadoop map =
reduce to process an open<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt; file. The writing process, writes a line to th=
e file and syncs the file to<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt; readers.<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt; (org.apache.hadoop.fs.FSDataOutputStream.sync(=
)).<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt; If I try to read the file from another process=
, it works fine, at<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt; least using<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt; org.apache.hadoop.fs.FSDataInputStream.<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt; hadoop -fs -tail also works just fine<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt; But it looks like map reduce doesn&#39;t read =
any data. I tried using the<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt; word count example, same thing, it is like if =
the file were empty for the<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt; map reduce framework.<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt; I&#39;m using hadoop 1.0.3. and pig 0.10.0<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt; I need some help around this.<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt; Thanks!<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt; Lucas<br>
&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;<br>
&gt;&gt;<br>
&gt;<br>
<br>
<br>
<br>
</div></div>--<br>
Harsh J<br>
</blockquote></div><br>
</div></div></blockquote></div><br></div></div></div>
</blockquote></div><br></div>

--bcaec55408ac5f8f8f04d71b960a--