Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DCF0CEA45 for ; Mon, 25 Feb 2013 15:39:14 +0000 (UTC) Received: (qmail 50589 invoked by uid 500); 25 Feb 2013 15:39:10 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 50103 invoked by uid 500); 25 Feb 2013 15:39:06 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 50072 invoked by uid 99); 25 Feb 2013 15:39:05 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 25 Feb 2013 15:39:05 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of lucejb@gmail.com designates 209.85.214.173 as permitted sender) Received: from [209.85.214.173] (HELO mail-ob0-f173.google.com) (209.85.214.173) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 25 Feb 2013 15:39:01 +0000 Received: by mail-ob0-f173.google.com with SMTP id dn14so3000251obc.18 for ; Mon, 25 Feb 2013 07:38:40 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:content-type; bh=0uTcgsLdXAep37ThW7B2lQO5WKJhUBjp7o/J8m1kH30=; b=NmzJwN9QHTqz071qF+/6xNZf4LYk3yi3Zsj7Xt1yPzJEBYDgIinUOlezKkMSaNjo6P Sd/Qse6VdEriL5NyTeCQ5XqH+uvupCpu5lySRn+uF9bwcl+y4BVWQqZnp68oWaNug8HH WDVEQmS/u/VI6NsXioLjePqDql6/4ZYTteIJcXMQHQUKT7m0IwV4pJFgNXlb/rspHiu5 ll1b3nvoxhPFczsFOBav5/cMZ1GjcV7yXEl+Bj97vQdPSuWGM9DpnYxmNtEB2sffqVUH B7urRV0R9QXT6rZFXTk8PKQVi49swdNJ5zq3oUUZ0o3i+xxcUlQn7kFO1Akenfh6eNSa bRBA== MIME-Version: 1.0 X-Received: by 10.182.14.39 with SMTP id m7mr7905091obc.96.1361806720435; Mon, 25 Feb 2013 07:38:40 -0800 (PST) Received: by 10.76.27.73 with HTTP; Mon, 25 Feb 2013 07:38:40 -0800 (PST) In-Reply-To: References: Date: Mon, 25 Feb 2013 12:38:40 -0300 Message-ID: Subject: Re: map reduce and sync From: Lucas Bernardi To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=14dae9399ad1d3063a04d68e5581 X-Virus-Checked: Checked by ClamAV on apache.org --14dae9399ad1d3063a04d68e5581 Content-Type: text/plain; charset=ISO-8859-1 I didn't notice, thanks for the heads up. On Mon, Feb 25, 2013 at 4:31 AM, Harsh J wrote: > Just an aside (I've not tried to look at the original issue yet), but > Scribe has not been maintained (nor has seen a release) in over a year > now -- looking at the commit history. Same case with both Facebook and > Twitter's fork. > > On Mon, Feb 25, 2013 at 7:16 AM, Lucas Bernardi wrote: > > Yeah I looked at scribe, looks good but sounds like too much for my > problem. > > I'd rather make it work the simple way. Could you pleas post your code, > may > > be I'm doing something wrong on the sync side. Maybe a buffer size, block > > size or some other parameter is different... > > > > Thanks! > > Lucas > > > > > > On Sun, Feb 24, 2013 at 10:31 PM, Hemanth Yamijala > > wrote: > >> > >> I am using the same version of Hadoop as you. > >> > >> Can you look at something like Scribe, which AFAIK fits the use case you > >> describe. > >> > >> Thanks > >> Hemanth > >> > >> > >> On Sun, Feb 24, 2013 at 3:33 AM, Lucas Bernardi > wrote: > >>> > >>> That is exactly what I did, but in my case, it is like if the file were > >>> empty, the job counters say no bytes read. > >>> I'm using hadoop 1.0.3 which version did you try? > >>> > >>> What I'm trying to do is just some basic analyitics on a product search > >>> system. There is a search service, every time a user performs a > search, the > >>> search string, and the results are stored in this file, and the file is > >>> sync'ed. I'm actually using pig to do some basic counts, it doesn't > work, > >>> like I described, because the file looks empty for the map reduce > >>> components. I thought it was about pig, but I wasn't sure, so I tried a > >>> simple mr job, and used the word count to test the map reduce > compoinents > >>> actually see the sync'ed bytes. > >>> > >>> Of course if I close the file, everything works perfectly, but I don't > >>> want to close the file every while, since that means I should create > another > >>> one (since no append support), and that would end up with too many tiny > >>> files, something we know is bad for mr performance, and I don't want > to add > >>> more parts to this (like a file merging tool). I think unign sync is a > clean > >>> solution, since we don't care about writing performance, so I'd rather > keep > >>> it like this if I can make it work. > >>> > >>> Any idea besides hadoop version? > >>> > >>> Thanks! > >>> > >>> Lucas > >>> > >>> > >>> > >>> On Sat, Feb 23, 2013 at 11:54 AM, Hemanth Yamijala > >>> wrote: > >>>> > >>>> Hi Lucas, > >>>> > >>>> I tried something like this but got different results. > >>>> > >>>> I wrote code that opened a file on HDFS, wrote a line and called sync. > >>>> Without closing the file, I ran a wordcount with that file as input. > It did > >>>> work fine and was able to count the words that were sync'ed (even > though the > >>>> file length seems to come as 0 like you noted in fs -ls) > >>>> > >>>> So, not sure what's happening in your case. In the MR job, do the job > >>>> counters indicate no bytes were read ? > >>>> > >>>> On a different note though, if you can describe a little more what you > >>>> are trying to accomplish, we could probably work a better solution. > >>>> > >>>> Thanks > >>>> hemanth > >>>> > >>>> > >>>> On Sat, Feb 23, 2013 at 7:15 PM, Lucas Bernardi > >>>> wrote: > >>>>> > >>>>> Helo Hemanth, thanks for answering. > >>>>> The file is open by a separate process not map reduce related at all. > >>>>> You can think of it as a servlet, receiving requests, and writing > them to > >>>>> this file, every time a request is received it is written and > >>>>> org.apache.hadoop.fs.FSDataOutputStream.sync() is invoked. > >>>>> > >>>>> At the same time, I want to run a map reduce job over this file. > Simply > >>>>> runing the word count example doesn't seem to work, it is like if > the file > >>>>> were empty. > >>>>> > >>>>> hadoop -fs -tail works just fine, and reading the file using > >>>>> org.apache.hadoop.fs.FSDataInputStream also works ok. > >>>>> > >>>>> Last thing, the web interface doesn't see the contents, and command > >>>>> hadoop -fs -ls says the file is empty. > >>>>> > >>>>> What am I doing wrong? > >>>>> > >>>>> Thanks! > >>>>> > >>>>> Lucas > >>>>> > >>>>> > >>>>> > >>>>> On Sat, Feb 23, 2013 at 4:37 AM, Hemanth Yamijala > >>>>> wrote: > >>>>>> > >>>>>> Could you please clarify, are you opening the file in your mapper > code > >>>>>> and reading from there ? > >>>>>> > >>>>>> Thanks > >>>>>> Hemanth > >>>>>> > >>>>>> On Friday, February 22, 2013, Lucas Bernardi wrote: > >>>>>>> > >>>>>>> Hello there, I'm trying to use hadoop map reduce to process an open > >>>>>>> file. The writing process, writes a line to the file and syncs the > file to > >>>>>>> readers. > >>>>>>> (org.apache.hadoop.fs.FSDataOutputStream.sync()). > >>>>>>> > >>>>>>> If I try to read the file from another process, it works fine, at > >>>>>>> least using > >>>>>>> org.apache.hadoop.fs.FSDataInputStream. > >>>>>>> > >>>>>>> hadoop -fs -tail also works just fine > >>>>>>> > >>>>>>> But it looks like map reduce doesn't read any data. I tried using > the > >>>>>>> word count example, same thing, it is like if the file were empty > for the > >>>>>>> map reduce framework. > >>>>>>> > >>>>>>> I'm using hadoop 1.0.3. and pig 0.10.0 > >>>>>>> > >>>>>>> I need some help around this. > >>>>>>> > >>>>>>> Thanks! > >>>>>>> > >>>>>>> Lucas > >>>>> > >>>>> > >>>> > >>> > >> > > > > > > -- > Harsh J > --14dae9399ad1d3063a04d68e5581 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable I didn't notice, thanks for the heads up.

On Mon, Feb 25, 2013 at 4:31 AM, Harsh J <harsh@cloudera.com>= wrote:
Just an aside (I've not tried to look at= the original issue yet), but
Scribe has not been maintained (nor has seen a release) in over a year
now -- looking at the commit history. Same case with both Facebook and
Twitter's fork.

On Mon, Feb 25, 2013 at 7:16 AM, Lucas Bernardi <lucejb@gmail.com> wrote:
> Yeah I looked at scribe, looks good but sounds like too much for my pr= oblem.
> I'd rather make it work the simple way. Could you pleas post your = code, may
> be I'm doing something wrong on the sync side. Maybe a buffer size= , block
> size or some other =A0parameter is different...
>
> Thanks!
> Lucas
>
>
> On Sun, Feb 24, 2013 at 10:31 PM, Hemanth Yamijala
> <yhemanth@thoughtworks= .com> wrote:
>>
>> I am using the same version of Hadoop as you.
>>
>> Can you look at something like Scribe, which AFAIK fits the use ca= se you
>> describe.
>>
>> Thanks
>> Hemanth
>>
>>
>> On Sun, Feb 24, 2013 at 3:33 AM, Lucas Bernardi <lucejb@gmail.com> wrote:
>>>
>>> That is exactly what I did, but in my case, it is like if the = file were
>>> empty, the job counters say no bytes read.
>>> I'm using hadoop 1.0.3 which version did you try?
>>>
>>> What I'm trying to do is just some basic analyitics on a p= roduct search
>>> system. There is a search service, every time a user performs = a search, the
>>> search string, and the results are stored in this file, and th= e file is
>>> sync'ed. I'm actually using pig to do some basic count= s, it doesn't work,
>>> like I described, because the file looks empty for the map red= uce
>>> components. I thought it was about pig, but I wasn't sure,= so I tried a
>>> simple mr job, and used the word count to test the map reduce = compoinents
>>> actually see the sync'ed bytes.
>>>
>>> Of course if I close the file, everything works perfectly, but= I don't
>>> want to close the file every while, since that means I should = create another
>>> one (since no append support), and that would end up with too = many tiny
>>> files, something we know is bad for mr performance, and I don&= #39;t want to add
>>> more parts to this (like a file merging tool). I think unign s= ync is a clean
>>> solution, since we don't care about writing performance, s= o I'd rather keep
>>> it like this if I can make it work.
>>>
>>> Any idea besides hadoop version?
>>>
>>> Thanks!
>>>
>>> Lucas
>>>
>>>
>>>
>>> On Sat, Feb 23, 2013 at 11:54 AM, Hemanth Yamijala
>>> <yhemanth@thou= ghtworks.com> wrote:
>>>>
>>>> Hi Lucas,
>>>>
>>>> I tried something like this but got different results.
>>>>
>>>> I wrote code that opened a file on HDFS, wrote a line and = called sync.
>>>> Without closing the file, I ran a wordcount with that file= as input. It did
>>>> work fine and was able to count the words that were sync&#= 39;ed (even though the
>>>> file length seems to come as 0 like you noted in fs -ls) >>>>
>>>> So, not sure what's happening in your case. In the MR = job, do the job
>>>> counters indicate no bytes were read ?
>>>>
>>>> On a different note though, if you can describe a little m= ore what you
>>>> are trying to accomplish, we could probably work a better = solution.
>>>>
>>>> Thanks
>>>> hemanth
>>>>
>>>>
>>>> On Sat, Feb 23, 2013 at 7:15 PM, Lucas Bernardi <lucejb@gmail.com>
>>>> wrote:
>>>>>
>>>>> Helo Hemanth, thanks for answering.
>>>>> The file is open by a separate process not map reduce = related at all.
>>>>> You can think of it as a servlet, receiving requests, = and writing them to
>>>>> this file, every time a request is received it is writ= ten and
>>>>> org.apache.hadoop.fs.FSDataOutputStream.sync() is invo= ked.
>>>>>
>>>>> At the same time, I want to run a map reduce job over = this file. Simply
>>>>> runing the word count example doesn't seem to work= , it is like if the file
>>>>> were empty.
>>>>>
>>>>> hadoop -fs -tail works just fine, and reading the file= using
>>>>> org.apache.hadoop.fs.FSDataInputStream also works ok.<= br> >>>>>
>>>>> Last thing, the web interface doesn't see the cont= ents, and command
>>>>> hadoop -fs -ls says the file is empty.
>>>>>
>>>>> What am I doing wrong?
>>>>>
>>>>> Thanks!
>>>>>
>>>>> Lucas
>>>>>
>>>>>
>>>>>
>>>>> On Sat, Feb 23, 2013 at 4:37 AM, Hemanth Yamijala
>>>>> <yhema= nth@thoughtworks.com> wrote:
>>>>>>
>>>>>> Could you please clarify, are you opening the file= in your mapper code
>>>>>> and reading from there ?
>>>>>>
>>>>>> Thanks
>>>>>> Hemanth
>>>>>>
>>>>>> On Friday, February 22, 2013, Lucas Bernardi wrote= :
>>>>>>>
>>>>>>> Hello there, I'm trying to use hadoop map = reduce to process an open
>>>>>>> file. The writing process, writes a line to th= e file and syncs the file to
>>>>>>> readers.
>>>>>>> (org.apache.hadoop.fs.FSDataOutputStream.sync(= )).
>>>>>>>
>>>>>>> If I try to read the file from another process= , it works fine, at
>>>>>>> least using
>>>>>>> org.apache.hadoop.fs.FSDataInputStream.
>>>>>>>
>>>>>>> hadoop -fs -tail also works just fine
>>>>>>>
>>>>>>> But it looks like map reduce doesn't read = any data. I tried using the
>>>>>>> word count example, same thing, it is like if = the file were empty for the
>>>>>>> map reduce framework.
>>>>>>>
>>>>>>> I'm using hadoop 1.0.3. and pig 0.10.0
>>>>>>>
>>>>>>> I need some help around this.
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>> Lucas
>>>>>
>>>>>
>>>>
>>>
>>
>



--
Harsh J

--14dae9399ad1d3063a04d68e5581--