hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zheng Lv <lvzheng19800...@gmail.com>
Subject Re: Cannot open filename Exceptions
Date Wed, 24 Mar 2010 03:42:56 GMT
Hello Stack,
  >So, for sure ugly stuff is going on.  I filed
  >https://issues.apache.org/jira/browse/HBASE-2365.  It looks like we're
  >doubly assigning a region.
  Can you tell me how this happened in detail? Thanks a lot.

  >Can you confirm that 209 lags behind the master (207) by about 25
  >seconds?  Are you running NTP on these machines so they sync their
  >clocks?
  Yes, the 209 lags behind the 207 by about 30s. But this can lead to the
exceptions?
  Before, there were something wrong about our ntp script, now it has been
fixed.

  >With DEBUG enabled have you been able to reproduce?
  These days the exception did not appera again, if it would, I'll show you
the logs.

  Thanks a lot again.
    LvZheng




2010/3/23 Stack <stack@duboce.net>

> So, for sure ugly stuff is going on.  I filed
> https://issues.apache.org/jira/browse/HBASE-2365.  It looks like we're
> doubly assigning a region.
>
> Can you confirm that 209 lags behind the master (207) by about 25
> seconds?  Are you running NTP on these machines so they sync their
> clocks?
>
> With DEBUG enabled have you been able to reproduce?
>
> That said there might be enough in these logs to go on if you can
> confirm the above.
>
> Thanks for your patience Zheng.
>
> St.Ack
>
>
>
> On Thu, Mar 18, 2010 at 11:43 PM, Zheng Lv <lvzheng19800619@gmail.com>
> wrote:
> > Hello Stack,
> >  I must say thank you, for your patience too.
> >  I'm sorry for that you had tried for many times but the logs you got
> were
> > not that usful. Now I have turn the logging to debug level, so if we get
> > these exceptions again, I will send you debug logs. Anyway, I still
> upload
> > the logs you want to rapidshare, although they are not in debug level.
> The
> > urls:
> >
> >
> http://rapidshare.com/files/365292889/hadoop-root-namenode-cactus207.log.2010-03-15.html
> >
> >
> http://rapidshare.com/files/365293127/hbase-root-master-cactus207.log.2010-03-15.html
> >
> >
> http://rapidshare.com/files/365293238/hbase-root-regionserver-cactus208.log.2010-03-15.html
> >
> >
> http://rapidshare.com/files/365293391/hbase-root-regionserver-cactus209.log.2010-03-15.html
> >
> >
> http://rapidshare.com/files/365293488/hbase-root-regionserver-cactus210.log.2010-03-15.html
> >
> >  >For sure you've upped xceivers on your hdfs cluster and you've upped
> >>the file descriptors as per the 'Getting Started'? (Sorry, have to
> >>ask).
> >  Before I got your mail, we didn't set the properties you mentioned,
> > because we didn't got the "too many open files" or something which are
> > mentioned in "getting start" docs. But now I have upped these properties.
> > We'll see what will happen.
> >
> >  If you need more infomations, just tell me.
> >
> >  Thanks again,
> >  LvZheng.
> >
> >
> > 2010/3/19 Stack <stack@duboce.net>
> >
> >> Yeah, I had to retry a couple of times ("Too busy; try back later --
> >> or sign up premium service!").
> >>
> >> It would have been nice to have wider log snippets.  I'd like to have
> >> seen if the issue was double assignment.  The master log snippet only
> >> shows the split.  Regionserver 209's log is the one where the
> >> interesting stuff is going on around this time, 2010-03-15
> >> 16:06:51,150, but its not in the provided set.  Neither are you
> >> running at DEBUG level so it'd be harder to see what is up even if you
> >> provided it.
> >>
> >> Looking in 208, I see a few exceptions beyond the one you paste below.
> >>  For sure you've upped xceivers on your hdfs cluster and you've upped
> >> the file descriptors as per the 'Getting Started'? (Sorry, have to
> >> ask).
> >>
> >> Can I have more of the logs?  Can I have all of the namenode log, all
> >> of the master log and 209's log?  This rapidshare thing is fine with
> >> me.  I don't mind retrying.
> >>
> >> Sorry it took me a while to get to this.
> >> St.Ack
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Wed, Mar 17, 2010 at 8:32 PM, Zheng Lv <lvzheng19800619@gmail.com>
> >> wrote:
> >> > Hello Stack,
> >> >    >Sorry. It's taken me a while.  Let try and get to this this
> evening
> >> >    Is it downloading the log files what take you a while? I m sorry, I
> >> used
> >> > to upload files to skydrive, but now we cant access the website. Is
> there
> >> > any netdisk or something you can download fast? I can upload to it.
> >> >    LvZheng
> >> > 2010/3/18 Stack <saint.ack@gmail.com>
> >> >
> >> >> Sorry. It's taken me a while.  Let try and get to this this evening
> >> >>
> >> >> Thank you for your patience
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> On Mar 17, 2010, at 2:29 AM, Zheng Lv <lvzheng19800619@gmail.com>
> >> wrote:
> >> >>
> >> >> Hello Stack,
> >> >>>  Did you receive my mail?It looks like you didnt.
> >> >>>   LvZheng
> >> >>>
> >> >>> 2010/3/16 Zheng Lv <lvzheng19800619@gmail.com>
> >> >>>
> >> >>> Hello Stack,
> >> >>>>  I have uploaded some parts of the logs on master, regionserver208
> and
> >> >>>> regionserver210 to:
> >> >>>>  http://rapidshare.com/files/363988384/master_207_log.txt.html
> >> >>>>
> http://rapidshare.com/files/363988673/regionserver_208_log.txt.html
> >> >>>>
> http://rapidshare.com/files/363988819/regionserver_210_log.txt.html
> >> >>>>  I noticed that there are some LeaseExpiredException and
> "2010-03-15
> >> >>>> 16:06:32,864 ERROR
> >> >>>> org.apache.hadoop.hbase.regionserver.CompactSplitThread:
> >> >>>> Compaction/Split failed for region ..." before 17 oclock. Did
these
> >> lead
> >> >>>> to
> >> >>>> the error? Why did these happened? How to avoid these?
> >> >>>>  Thanks.
> >> >>>>   LvZheng
> >> >>>> 2010/3/16 Stack <stack@duboce.net>
> >> >>>>
> >> >>>> Maybe just the master log would be sufficient from around this
time
> to
> >> >>>>> figure the story.
> >> >>>>> St.Ack
> >> >>>>>
> >> >>>>> On Mon, Mar 15, 2010 at 10:04 PM, Stack <stack@duboce.net>
wrote:
> >> >>>>>
> >> >>>>>> Hey Zheng:
> >> >>>>>>
> >> >>>>>> On Mon, Mar 15, 2010 at 8:16 PM, Zheng Lv <
> >> lvzheng19800619@gmail.com>
> >> >>>>>>
> >> >>>>> wrote:
> >> >>>>>
> >> >>>>>> Hello Stack,
> >> >>>>>>> After we got these exceptions, we restart the cluster
and
> restarted
> >> >>>>>>>
> >> >>>>>> the
> >> >>>>>
> >> >>>>>> job that failed, and the job succeeded.
> >> >>>>>>> Now when we access
> >> >>>>>>>
> >> >>>>>> /hbase/summary/1491233486/metrics/5046821377427277894,
> >> >>>>>
> >> >>>>>> we got " Cannot access
> >> >>>>>>> /hbase/summary/1491233486/metrics/5046821377427277894:
No such
> file
> >> or
> >> >>>>>>> directory." .
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>> So, that would seem to indicate that the reference
was in memory
> >> >>>>>> only.. that file was not in filesystem.  You could
have tried
> >> closing
> >> >>>>>> that region.   It would have been interesting also
to find
> history
> >> on
> >> >>>>>> that region, to try and figure how it came to hold
in memory a
> >> >>>>>> reference to a file since removed.
> >> >>>>>>
> >> >>>>>> The messages about this file in namenode logs are in
here:
> >> >>>>>>> http://rapidshare.com/files/363938595/log.txt.html
> >> >>>>>>>
> >> >>>>>>
> >> >>>>>> This is interesting.  Do you have regionserver logs
from 209,
> 208,
> >> and
> >> >>>>>> 210 for corresponding times?
> >> >>>>>>
> >> >>>>>> Thanks,
> >> >>>>>> St.Ack
> >> >>>>>>
> >> >>>>>> The job failed startted about at 17 o'clock.
> >> >>>>>>> By the way, the hadoop version we are using is
0.20.1, the hbase
> >> >>>>>>>
> >> >>>>>> version
> >> >>>>>
> >> >>>>>> we are using is 0.20.3.
> >> >>>>>>>
> >> >>>>>>> Regards,
> >> >>>>>>> LvZheng
> >> >>>>>>> 2010/3/16 Stack <stack@duboce.net>
> >> >>>>>>>
> >> >>>>>>> Can you get that file from hdfs?
> >> >>>>>>>>
> >> >>>>>>>> ./bin/hadoop fs -get
> >> >>>>>>>>>
> >> >>>>>>>> /hbase/summary/1491233486/metrics/5046821377427277894
> >> >>>>>>>>
> >> >>>>>>>> Does it look wholesome?  Is it empty?
> >> >>>>>>>>
> >> >>>>>>>> What if you trace the life of that file in
regionserver logs or
> >> >>>>>>>> probably better, over in namenode log?  If
you move this file
> >> aside,
> >> >>>>>>>> the region deploys?
> >> >>>>>>>>
> >> >>>>>>>> St.Ack
> >> >>>>>>>>
> >> >>>>>>>> On Mon, Mar 15, 2010 at 3:40 AM, Zheng Lv <
> >> lvzheng19800619@gmail.com
> >> >>>>>>>> >
> >> >>>>>>>> wrote:
> >> >>>>>>>>
> >> >>>>>>>>> Hello Everyone,
> >> >>>>>>>>>  Recently we often got these in our client
logs:
> >> >>>>>>>>>  org.apache.hadoop.hbase.client.RetriesExhaustedException:
> Trying
> >> >>>>>>>>>
> >> >>>>>>>> to
> >> >>>>>
> >> >>>>>>  contact region server 172.16.1.208:60020 for region
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>
> >>
> summary,SITE_0000000032\x01pt\x0120100314000000\x01\x25E7\x258C\x25AE\x25E5\x258E\x25BF\x25E5\x2586\x2580\x25E9\x25B9\x25B0\x25E6\x2591\x25A9\x25E6\x2593\x25A6\x25E6\x259D\x2590\x25E6\x2596\x2599\x25E5\x258E\x2582\x2B\x25E6\x25B1\x25BD\x25E8\x25BD\x25A6\x25E9\x2585\x258D\x25E4\x25BB\x25B6\x25EF\x25BC\x258C\x25E5\x2598\x2580\x25E9\x2593\x2583\x25E9\x2593\x2583--\x25E7\x259C\x259F\x25E5\x25AE\x259E\x25E5\x25AE\x2589\x25E5\x2585\x25A8\x25E7\x259A\x2584\x25E7\x2594\x25B5\x25E8\x25AF\x259D\x25E3\x2580\x2581\x25E7\x25BD\x2591\x25E7\x25BB\x259C\x25E4\x25BA\x2592\x25E5\x258A\x25A8\x25E4\x25BA\x25A4\x25E5\x258F\x258B\x25E7\x25A4\x25BE\x25E5\x258C\x25BA\x25EF\x25BC\x2581,1268640385017,
> >> >>>>>
> >> >>>>>>  row
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>
> >>
> 'SITE_0000000032\x01pt\x0120100315000000\x01\x2521\x25EF\x25BC\x2581\x25E9\x2594\x2580\x25E5\x2594\x25AE\x252F\x25E6\x2594\x25B6\x25E8\x25B4\x25AD\x25EF\x25BC\x2581VM700T\x2BVM700T\x2B\x25E5\x259B\x25BE\x25E5\x2583\x258F\x25E4\x25BF\x25A1\x25E5\x258F\x25B7\x25E4\x25BA\x25A7\x25E7\x2594\x259F\x25E5\x2599\x25A8\x2B\x25E7\x2594\x25B5\x25E5\x25AD\x2590\x25E6\x25B5\x258B\x25E9\x2587\x258F\x25E4\x25BB\x25AA\x25E5\x2599\x25A8\x25EF\x25BC\x258C\x25E5\x2598\x2580\x25E9\x2593\x2583\x25E9\x2593\x2583--\x25E7\x259C\x259F\x25E5\x25AE\x259E\x25E5\x25AE\x2589\x25E5\x2585\x25A8\x25E7\x259A\x2584\x25E7\x2594\x25B5\x25E8\x25AF\x259D\x25E3\x2580\x2581\x25E7\x25BD\x2591\x25E7\x25BB\x259C\x25E4\x25BA\x2592\x25E5\x258A\x25A8\x25E4\x25BA\x25A4\x25E5\x258F\x258B\x25E7\x25A4\x25BE\x25E5\x258C\x25BA\x25EF\x25BC\x2581',
> >> >>>>>
> >> >>>>>>  but failed after 10 attempts.
> >> >>>>>>>>> Exceptions:
> >> >>>>>>>>> java.io.IOException: java.io.IOException:
Cannot open filename
> >> >>>>>>>>> /hbase/summary/1491233486/metrics/5046821377427277894
> >> >>>>>>>>> at
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1474)
> >> >>>>>
> >> >>>>>>  at
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1800)
> >> >>>>>
> >> >>>>>>  at
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1616)
> >> >>>>>
> >> >>>>>>  at
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1743)
> >> >>>>>
> >> >>>>>>  at java.io.DataInputStream.read(DataInputStream.java:132)
> >> >>>>>>>>> at
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>
> >>
> org.apache.hadoop.hbase.io.hfile.BoundedRangeFileInputStream.read(BoundedRangeFileInputStream.java:99)
> >> >>>>>
> >> >>>>>>  at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100)
> >> >>>>>>>>> at
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>
> >>
> org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1020)
> >> >>>>>
> >> >>>>>>  at
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>
> >> org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:971)
> >> >>>>>
> >> >>>>>>  at
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>
> >>
> org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.loadBlock(HFile.java:1304)
> >> >>>>>
> >> >>>>>>  at
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>
> >>
> org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.seekTo(HFile.java:1186)
> >> >>>>>
> >> >>>>>>  at
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>
> >>
> org.apache.hadoop.hbase.io.HalfHFileReader$1.seekTo(HalfHFileReader.java:207)
> >> >>>>>
> >> >>>>>>  at
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>
> >>
> org.apache.hadoop.hbase.regionserver.StoreFileGetScan.getStoreFile(StoreFileGetScan.java:80)
> >> >>>>>
> >> >>>>>>  at
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>
> >>
> org.apache.hadoop.hbase.regionserver.StoreFileGetScan.get(StoreFileGetScan.java:65)
> >> >>>>>
> >> >>>>>>  at
> org.apache.hadoop.hbase.regionserver.Store.get(Store.java:1461)
> >> >>>>>>>>> at
> >> >>>>>>>>>
> >> >>>>>>>>
> >> org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:2396)
> >> >>>>>
> >> >>>>>>  at
> >> >>>>>>>>>
> >> >>>>>>>>
> >> org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:2385)
> >> >>>>>
> >> >>>>>>  at
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>
> >>
> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:1731)
> >> >>>>>
> >> >>>>>>  at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown
Source)
> >> >>>>>>>>> at
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>
> >>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >> >>>>>
> >> >>>>>>  at java.lang.reflect.Method.invoke(Method.java:597)
> >> >>>>>>>>> at
> >> >>>>>>>>>
> >> >>>>>>>>
> >> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)
> >> >>>>>
> >> >>>>>>  at
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>
> >>
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
> >> >>>>>
> >> >>>>>>   Is there any way to fix this problem? Or is there
anything we
> can
> >> >>>>>>>>>
> >> >>>>>>>> do
> >> >>>>>
> >> >>>>>>  even manually to relieve it?
> >> >>>>>>>>>  Any suggestion?
> >> >>>>>>>>>  Thank you.
> >> >>>>>>>>>  LvZheng
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>
> >> >>>>>>
> >> >>>>>
> >> >>>>
> >> >>>>
> >> >
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message