Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of lucejb@gmail.com designates
 209.85.214.173 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAOcnVr2ucYTojW5G74EfS5fE1iTjTcHxE9d4dTD7XKF2ROgrWg@mail.gmail.com>
References: 
 <CALtSBbaYkmBEcBbtG3yMFFAFNsJ1hCdZeatDLghn3xzBR67Y=A@mail.gmail.com>
	<CAEAKFL8ieWsWShqCa+zJrwhETZpEnP2uBi+-5fRbdVSygch5_w@mail.gmail.com>
	<CALtSBbYvPTG6jykBfy+DveQhrW74iDFuz+JnBjgY_LMFr1haUQ@mail.gmail.com>
	<CAEAKFL_xEWW1rKiZmvk_9CxrDWe_L+7=bGU8qjmM7z3ERvoYjQ@mail.gmail.com>
	<CALtSBbbi7jW0Ne+_i__fMc1DwhumfoDRCswjTOqFUf+epX2-Lw@mail.gmail.com>
	<CAEAKFL9hXNA1gdtWg9=R0mOVkGjDjOTGiN6Lm3j9KR0heqacuA@mail.gmail.com>
	<CALtSBbaNKxeSjRWW1z8QK5fXnA8P+7NjKsT-G7+bC31M+DW1AA@mail.gmail.com>
	<CAOcnVr2ucYTojW5G74EfS5fE1iTjTcHxE9d4dTD7XKF2ROgrWg@mail.gmail.com>
Date: Mon, 25 Feb 2013 12:38:40 -0300
Message-ID: 
 <CALtSBbYhpvcT7OayAAPg9n_stsG6SN_ALMxS1gwzj+GH4=LmsQ@mail.gmail.com>
Subject: Re: map reduce and sync
From: Lucas Bernardi <lucejb@gmail.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=14dae9399ad1d3063a04d68e5581

--14dae9399ad1d3063a04d68e5581
Content-Type: text/plain; charset=ISO-8859-1

I didn't notice, thanks for the heads up.

On Mon, Feb 25, 2013 at 4:31 AM, Harsh J <harsh@cloudera.com> wrote:

> Just an aside (I've not tried to look at the original issue yet), but
> Scribe has not been maintained (nor has seen a release) in over a year
> now -- looking at the commit history. Same case with both Facebook and
> Twitter's fork.
>
> On Mon, Feb 25, 2013 at 7:16 AM, Lucas Bernardi <lucejb@gmail.com> wrote:
> > Yeah I looked at scribe, looks good but sounds like too much for my
> problem.
> > I'd rather make it work the simple way. Could you pleas post your code,
> may
> > be I'm doing something wrong on the sync side. Maybe a buffer size, block
> > size or some other  parameter is different...
> >
> > Thanks!
> > Lucas
> >
> >
> > On Sun, Feb 24, 2013 at 10:31 PM, Hemanth Yamijala
> > <yhemanth@thoughtworks.com> wrote:
> >>
> >> I am using the same version of Hadoop as you.
> >>
> >> Can you look at something like Scribe, which AFAIK fits the use case you
> >> describe.
> >>
> >> Thanks
> >> Hemanth
> >>
> >>
> >> On Sun, Feb 24, 2013 at 3:33 AM, Lucas Bernardi <lucejb@gmail.com>
> wrote:
> >>>
> >>> That is exactly what I did, but in my case, it is like if the file were
> >>> empty, the job counters say no bytes read.
> >>> I'm using hadoop 1.0.3 which version did you try?
> >>>
> >>> What I'm trying to do is just some basic analyitics on a product search
> >>> system. There is a search service, every time a user performs a
> search, the
> >>> search string, and the results are stored in this file, and the file is
> >>> sync'ed. I'm actually using pig to do some basic counts, it doesn't
> work,
> >>> like I described, because the file looks empty for the map reduce
> >>> components. I thought it was about pig, but I wasn't sure, so I tried a
> >>> simple mr job, and used the word count to test the map reduce
> compoinents
> >>> actually see the sync'ed bytes.
> >>>
> >>> Of course if I close the file, everything works perfectly, but I don't
> >>> want to close the file every while, since that means I should create
> another
> >>> one (since no append support), and that would end up with too many tiny
> >>> files, something we know is bad for mr performance, and I don't want
> to add
> >>> more parts to this (like a file merging tool). I think unign sync is a
> clean
> >>> solution, since we don't care about writing performance, so I'd rather
> keep
> >>> it like this if I can make it work.
> >>>
> >>> Any idea besides hadoop version?
> >>>
> >>> Thanks!
> >>>
> >>> Lucas
> >>>
> >>>
> >>>
> >>> On Sat, Feb 23, 2013 at 11:54 AM, Hemanth Yamijala
> >>> <yhemanth@thoughtworks.com> wrote:
> >>>>
> >>>> Hi Lucas,
> >>>>
> >>>> I tried something like this but got different results.
> >>>>
> >>>> I wrote code that opened a file on HDFS, wrote a line and called sync.
> >>>> Without closing the file, I ran a wordcount with that file as input.
> It did
> >>>> work fine and was able to count the words that were sync'ed (even
> though the
> >>>> file length seems to come as 0 like you noted in fs -ls)
> >>>>
> >>>> So, not sure what's happening in your case. In the MR job, do the job
> >>>> counters indicate no bytes were read ?
> >>>>
> >>>> On a different note though, if you can describe a little more what you
> >>>> are trying to accomplish, we could probably work a better solution.
> >>>>
> >>>> Thanks
> >>>> hemanth
> >>>>
> >>>>
> >>>> On Sat, Feb 23, 2013 at 7:15 PM, Lucas Bernardi <lucejb@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>> Helo Hemanth, thanks for answering.
> >>>>> The file is open by a separate process not map reduce related at all.
> >>>>> You can think of it as a servlet, receiving requests, and writing
> them to
> >>>>> this file, every time a request is received it is written and
> >>>>> org.apache.hadoop.fs.FSDataOutputStream.sync() is invoked.
> >>>>>
> >>>>> At the same time, I want to run a map reduce job over this file.
> Simply
> >>>>> runing the word count example doesn't seem to work, it is like if
> the file
> >>>>> were empty.
> >>>>>
> >>>>> hadoop -fs -tail works just fine, and reading the file using
> >>>>> org.apache.hadoop.fs.FSDataInputStream also works ok.
> >>>>>
> >>>>> Last thing, the web interface doesn't see the contents, and command
> >>>>> hadoop -fs -ls says the file is empty.
> >>>>>
> >>>>> What am I doing wrong?
> >>>>>
> >>>>> Thanks!
> >>>>>
> >>>>> Lucas
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Sat, Feb 23, 2013 at 4:37 AM, Hemanth Yamijala
> >>>>> <yhemanth@thoughtworks.com> wrote:
> >>>>>>
> >>>>>> Could you please clarify, are you opening the file in your mapper
> code
> >>>>>> and reading from there ?
> >>>>>>
> >>>>>> Thanks
> >>>>>> Hemanth
> >>>>>>
> >>>>>> On Friday, February 22, 2013, Lucas Bernardi wrote:
> >>>>>>>
> >>>>>>> Hello there, I'm trying to use hadoop map reduce to process an open
> >>>>>>> file. The writing process, writes a line to the file and syncs the
> file to
> >>>>>>> readers.
> >>>>>>> (org.apache.hadoop.fs.FSDataOutputStream.sync()).
> >>>>>>>
> >>>>>>> If I try to read the file from another process, it works fine, at
> >>>>>>> least using
> >>>>>>> org.apache.hadoop.fs.FSDataInputStream.
> >>>>>>>
> >>>>>>> hadoop -fs -tail also works just fine
> >>>>>>>
> >>>>>>> But it looks like map reduce doesn't read any data. I tried using
> the
> >>>>>>> word count example, same thing, it is like if the file were empty
> for the
> >>>>>>> map reduce framework.
> >>>>>>>
> >>>>>>> I'm using hadoop 1.0.3. and pig 0.10.0
> >>>>>>>
> >>>>>>> I need some help around this.
> >>>>>>>
> >>>>>>> Thanks!
> >>>>>>>
> >>>>>>> Lucas
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
>
>
>
> --
> Harsh J
>

--14dae9399ad1d3063a04d68e5581
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

I didn&#39;t notice, thanks for the heads up.<br><br><div class=3D"gmail_qu=
ote">On Mon, Feb 25, 2013 at 4:31 AM, Harsh J <span dir=3D"ltr">&lt;<a href=
=3D"mailto:harsh@cloudera.com" target=3D"_blank">harsh@cloudera.com</a>&gt;=
</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">Just an aside (I&#39;ve not tried to look at=
 the original issue yet), but<br>
Scribe has not been maintained (nor has seen a release) in over a year<br>
now -- looking at the commit history. Same case with both Facebook and<br>
Twitter&#39;s fork.<br>
<div><div class=3D"h5"><br>
On Mon, Feb 25, 2013 at 7:16 AM, Lucas Bernardi &lt;<a href=3D"mailto:lucej=
b@gmail.com">lucejb@gmail.com</a>&gt; wrote:<br>
&gt; Yeah I looked at scribe, looks good but sounds like too much for my pr=
oblem.<br>
&gt; I&#39;d rather make it work the simple way. Could you pleas post your =
code, may<br>
&gt; be I&#39;m doing something wrong on the sync side. Maybe a buffer size=
, block<br>
&gt; size or some other =A0parameter is different...<br>
&gt;<br>
&gt; Thanks!<br>
&gt; Lucas<br>
&gt;<br>
&gt;<br>
&gt; On Sun, Feb 24, 2013 at 10:31 PM, Hemanth Yamijala<br>
&gt; &lt;<a href=3D"mailto:yhemanth@thoughtworks.com">yhemanth@thoughtworks=
.com</a>&gt; wrote:<br>
&gt;&gt;<br>
&gt;&gt; I am using the same version of Hadoop as you.<br>
&gt;&gt;<br>
&gt;&gt; Can you look at something like Scribe, which AFAIK fits the use ca=
se you<br>
&gt;&gt; describe.<br>
&gt;&gt;<br>
&gt;&gt; Thanks<br>
&gt;&gt; Hemanth<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; On Sun, Feb 24, 2013 at 3:33 AM, Lucas Bernardi &lt;<a href=3D"mai=
lto:lucejb@gmail.com">lucejb@gmail.com</a>&gt; wrote:<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; That is exactly what I did, but in my case, it is like if the =
file were<br>
&gt;&gt;&gt; empty, the job counters say no bytes read.<br>
&gt;&gt;&gt; I&#39;m using hadoop 1.0.3 which version did you try?<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; What I&#39;m trying to do is just some basic analyitics on a p=
roduct search<br>
&gt;&gt;&gt; system. There is a search service, every time a user performs =
a search, the<br>
&gt;&gt;&gt; search string, and the results are stored in this file, and th=
e file is<br>
&gt;&gt;&gt; sync&#39;ed. I&#39;m actually using pig to do some basic count=
s, it doesn&#39;t work,<br>
&gt;&gt;&gt; like I described, because the file looks empty for the map red=
uce<br>
&gt;&gt;&gt; components. I thought it was about pig, but I wasn&#39;t sure,=
 so I tried a<br>
&gt;&gt;&gt; simple mr job, and used the word count to test the map reduce =
compoinents<br>
&gt;&gt;&gt; actually see the sync&#39;ed bytes.<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; Of course if I close the file, everything works perfectly, but=
 I don&#39;t<br>
&gt;&gt;&gt; want to close the file every while, since that means I should =
create another<br>
&gt;&gt;&gt; one (since no append support), and that would end up with too =
many tiny<br>
&gt;&gt;&gt; files, something we know is bad for mr performance, and I don&=
#39;t want to add<br>
&gt;&gt;&gt; more parts to this (like a file merging tool). I think unign s=
ync is a clean<br>
&gt;&gt;&gt; solution, since we don&#39;t care about writing performance, s=
o I&#39;d rather keep<br>
&gt;&gt;&gt; it like this if I can make it work.<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; Any idea besides hadoop version?<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; Thanks!<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; Lucas<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; On Sat, Feb 23, 2013 at 11:54 AM, Hemanth Yamijala<br>
&gt;&gt;&gt; &lt;<a href=3D"mailto:yhemanth@thoughtworks.com">yhemanth@thou=
ghtworks.com</a>&gt; wrote:<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; Hi Lucas,<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; I tried something like this but got different results.<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; I wrote code that opened a file on HDFS, wrote a line and =
called sync.<br>
&gt;&gt;&gt;&gt; Without closing the file, I ran a wordcount with that file=
 as input. It did<br>
&gt;&gt;&gt;&gt; work fine and was able to count the words that were sync&#=
39;ed (even though the<br>
&gt;&gt;&gt;&gt; file length seems to come as 0 like you noted in fs -ls)<b=
r>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; So, not sure what&#39;s happening in your case. In the MR =
job, do the job<br>
&gt;&gt;&gt;&gt; counters indicate no bytes were read ?<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; On a different note though, if you can describe a little m=
ore what you<br>
&gt;&gt;&gt;&gt; are trying to accomplish, we could probably work a better =
solution.<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; Thanks<br>
&gt;&gt;&gt;&gt; hemanth<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; On Sat, Feb 23, 2013 at 7:15 PM, Lucas Bernardi &lt;<a hre=
f=3D"mailto:lucejb@gmail.com">lucejb@gmail.com</a>&gt;<br>
&gt;&gt;&gt;&gt; wrote:<br>
&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt; Helo Hemanth, thanks for answering.<br>
&gt;&gt;&gt;&gt;&gt; The file is open by a separate process not map reduce =
related at all.<br>
&gt;&gt;&gt;&gt;&gt; You can think of it as a servlet, receiving requests, =
and writing them to<br>
&gt;&gt;&gt;&gt;&gt; this file, every time a request is received it is writ=
ten and<br>
&gt;&gt;&gt;&gt;&gt; org.apache.hadoop.fs.FSDataOutputStream.sync() is invo=
ked.<br>
&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt; At the same time, I want to run a map reduce job over =
this file. Simply<br>
&gt;&gt;&gt;&gt;&gt; runing the word count example doesn&#39;t seem to work=
, it is like if the file<br>
&gt;&gt;&gt;&gt;&gt; were empty.<br>
&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt; hadoop -fs -tail works just fine, and reading the file=
 using<br>
&gt;&gt;&gt;&gt;&gt; org.apache.hadoop.fs.FSDataInputStream also works ok.<=
br>
&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt; Last thing, the web interface doesn&#39;t see the cont=
ents, and command<br>
&gt;&gt;&gt;&gt;&gt; hadoop -fs -ls says the file is empty.<br>
&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt; What am I doing wrong?<br>
&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt; Thanks!<br>
&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt; Lucas<br>
&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt; On Sat, Feb 23, 2013 at 4:37 AM, Hemanth Yamijala<br>
&gt;&gt;&gt;&gt;&gt; &lt;<a href=3D"mailto:yhemanth@thoughtworks.com">yhema=
nth@thoughtworks.com</a>&gt; wrote:<br>
&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt; Could you please clarify, are you opening the file=
 in your mapper code<br>
&gt;&gt;&gt;&gt;&gt;&gt; and reading from there ?<br>
&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt; Thanks<br>
&gt;&gt;&gt;&gt;&gt;&gt; Hemanth<br>
&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt; On Friday, February 22, 2013, Lucas Bernardi wrote=
:<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt; Hello there, I&#39;m trying to use hadoop map =
reduce to process an open<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt; file. The writing process, writes a line to th=
e file and syncs the file to<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt; readers.<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt; (org.apache.hadoop.fs.FSDataOutputStream.sync(=
)).<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt; If I try to read the file from another process=
, it works fine, at<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt; least using<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt; org.apache.hadoop.fs.FSDataInputStream.<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt; hadoop -fs -tail also works just fine<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt; But it looks like map reduce doesn&#39;t read =
any data. I tried using the<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt; word count example, same thing, it is like if =
the file were empty for the<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt; map reduce framework.<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt; I&#39;m using hadoop 1.0.3. and pig 0.10.0<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt; I need some help around this.<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt; Thanks!<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;&gt;&gt; Lucas<br>
&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;<br>
&gt;&gt;<br>
&gt;<br>
<br>
<br>
<br>
</div></div>--<br>
Harsh J<br>
</blockquote></div><br>

--14dae9399ad1d3063a04d68e5581--