accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Vines <vi...@apache.org>
Subject Re: Problem during compacting a table
Date Wed, 05 Aug 2015 14:58:03 GMT
1. it looks like it's an underlying hdfs issue. I'm not familiar with that
message, maybe it's a new 2.7 thing? I'm not sure how tested we are for
hadoop 2.7, especially with accumulo 1.6 so that could be a factor

2. You don't need tablets to major compact to use locality groups.

3. The shell was waiting for the major compaction to finish because you
gave it the -w flag. If you didn't want the shell to wait, do not provide
that flag.

On Wed, Aug 5, 2015 at 5:28 AM mohit.kaushik <mohit.kaushik@orkash.com>
wrote:

> After a long stuck the compaction is complete for the table but the
> question is still same. why does the shell stuck on io for so long ???
>
>
>
>
>
> * 2015-08-05 12:28:50,583 [Shell.audit] INFO : root@orkash page_content>
> compact -w 2015-08-05 12:28:50,586 [shell.Shell] INFO : Compacting table
> ... 2015-08-05 12:30:51,563 [impl.ThriftTransportPool] WARN : Thread
> "shell" stuck on IO  to orkash4:9999 (0) for at least 120031 ms 2015-08-05
> 13:26:24,301 [impl.ThriftTransportPool] INFO : Thread "shell" no longer
> stuck on IO  to orkash4:9999 (0) sawError = false 2015-08-05 13:26:24,319
> [shell.Shell] INFO : Compaction of table page_content completed for given
> range*
>
>
> On 08/05/2015 12:20 PM, mohit.kaushik wrote:
>
> There errors are shown in logs of Hadoop namenode and slaves...
>
> *Namenode** log*
>
>
>
>
>
>
>
>
>
>
>
>
> *2015-08-05 12:05:14,518 INFO
> org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at
> 391508 2015-08-05 12:05:14,664 INFO BlockStateChange: BLOCK* ask
> 192.168.10.121:50010 <http://192.168.10.121:50010> to replicate
> blk_1073780327_39560 to datanode(s) 192.168.10.122:50010
> <http://192.168.10.122:50010> 2015-08-05 12:05:14,664 INFO
> BlockStateChange: BLOCK* ask 192.168.10.121:50010
> <http://192.168.10.121:50010> to replicate blk_1073780379_39612 to
> datanode(s) 192.168.10.122:50010 <http://192.168.10.122:50010> 2015-08-05
> 12:05:24,621 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap
> updated: 192.168.10.122:50010 <http://192.168.10.122:50010> is added to
> blk_1073782847_42080 size 134217728 2015-08-05 12:05:26,665 INFO
> BlockStateChange: BLOCK* ask 192.168.10.121:50010
> <http://192.168.10.121:50010> to replicate blk_1073780611_39844 to
> datanode(s) 192.168.10.122:50010 <http://192.168.10.122:50010> 2015-08-05
> 12:05:27,232 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap
> updated: 192.168.10.122:50010 <http://192.168.10.122:50010> is added to
> blk_1073793941_53178 size 134217728 2015-08-05 12:05:27,950 INFO
> BlockStateChange: BLOCK* addStoredBlock: blockMap updated:
> 192.168.10.122:50010 <http://192.168.10.122:50010> is added to
> blk_1073783859_43092 size 134217728 2015-08-05 12:05:28,798 INFO
> BlockStateChange: BLOCK* addStoredBlock: blockMap updated:
> 192.168.10.122:50010 <http://192.168.10.122:50010> is added to
> blk_1073793387_52620 size 22496 2015-08-05 12:05:29,666 INFO
> BlockStateChange: BLOCK* ask 192.168.10.123:50010
> <http://192.168.10.123:50010> to replicate blk_1073780678_39911 to
> datanode(s) 192.168.10.121:50010 <http://192.168.10.121:50010> 2015-08-05
> 12:05:29,666 INFO BlockStateChange: BLOCK* ask 192.168.10.121:50010
> <http://192.168.10.121:50010> to replicate blk_1073780682_39915 to
> datanode(s) 192.168.10.122:50010 <http://192.168.10.122:50010> 2015-08-05
> 12:05:32,002 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap
> updated: 192.168.10.122:50010 <http://192.168.10.122:50010> is added to
> blk_1073796582_55826{UCState=UNDER_CONSTRUCTION, truncateBlock=null,
> primaryNodeIndex=-1,
> replicas=[ReplicaUC[[DISK]DS-896dada5-52c0-4a69-beed-dfbc5d437fc6:NORMAL:192.168.10.123:50010|RBW],
> ReplicaUC[[DISK]DS-dd6d6a25-122f-4958-a20b-4ccb82f49f11:NORMAL:192.168.10.121:50010|RBW],
> ReplicaUC[[DISK]DS-188489f9-89d3-40bd-9d20-9db358d644c9:NORMAL:192.168.10.122:50010|RBW]]}
> size 0 2015-08-05 12:05:32,072 INFO BlockStateChange: BLOCK*
> addStoredBlock: blockMap updated: 192.168.10.121:50010
> <http://192.168.10.121:50010> is added to
> blk_1073796582_55826{UCState=UNDER_CONSTRUCTION, truncateBlock=null,
> primaryNodeIndex=-1,
> replicas=[ReplicaUC[[DISK]DS-896dada5-52c0-4a69-beed-dfbc5d437fc6:NORMAL:192.168.10.123:50010|RBW],
> ReplicaUC[[DISK]DS-dd6d6a25-122f-4958-a20b-4ccb82f49f11:NORMAL:192.168.10.121:50010|RBW],
> ReplicaUC[[DISK]DS-188489f9-89d3-40bd-9d20-9db358d644c9:NORMAL:192.168.10.122:50010|RBW]]}
> size 0 2015-08-05 12:05:32,129 INFO BlockStateChange: BLOCK*
> addStoredBlock: blockMap updated: 192.168.10.123:50010
> <http://192.168.10.123:50010> is added to
> blk_1073796582_55826{UCState=UNDER_CONSTRUCTION, truncateBlock=null,
> primaryNodeIndex=-1,
> replicas=[ReplicaUC[[DISK]DS-896dada5-52c0-4a69-beed-dfbc5d437fc6:NORMAL:192.168.10.123:50010|RBW],
> ReplicaUC[[DISK]DS-dd6d6a25-122f-4958-a20b-4ccb82f49f11:NORMAL:192.168.10.121:50010|RBW],
> ReplicaUC[[DISK]DS-188489f9-89d3-40bd-9d20-9db358d644c9:NORMAL:192.168.10.122:50010|RBW]]}
> size 0*................and more
>
> *Slave log **(too many)*
>
>
>
>
>
>
>
> *k_1073794728_53972 on DS-896dada5-52c0-4a69-beed-dfbc5d437fc6, because
> the block scanner is disabled. 2015-08-05 11:50:30,438 INFO
> org.apache.hadoop.hdfs.server.datanode.BlockScanner: Not scanning
> suspicious block
> BP-2102462487-192.168.10.124-1436956492274:blk_1073794738_53982 on
> DS-896dada5-52c0-4a69-beed-dfbc5d437fc6, because the block scanner is
> disabled. 2015-08-05 11:50:31,024 INFO
> org.apache.hadoop.hdfs.server.datanode.BlockScanner: Not scanning
> suspicious block
> BP-2102462487-192.168.10.124-1436956492274:blk_1073794728_53972 on
> DS-896dada5-52c0-4a69-beed-dfbc5d437fc6, because the block scanner is
> disabled. 2015-08-05 11:50:31,027 INFO
> org.apache.hadoop.hdfs.server.datanode.BlockScanner: Not scanning
> suspicious block
> BP-2102462487-192.168.10.124-1436956492274:blk_1073794738_53982 on
> DS-896dada5-52c0-4a69-beed-dfbc5d437fc6, because the block scanner is
> disabled. 2015-08-05 11:50:31,095 INFO
> org.apache.hadoop.hdfs.server.datanode.BlockScanner: Not scanning
> suspicious block
> BP-2102462487-192.168.10.124-1436956492274:blk_1073794740_53984 on
> DS-896dada5-52c0-4a69-beed-dfbc5d437fc6, because the block scanner is
> disabled. 2015-08-05 11:50:31,105 INFO
> org.apache.hadoop.hdfs.server.datanode.BlockScanner: Not scanning
> suspicious block
> BP-2102462487-192.168.10.124-1436956492274:blk_1073794740_53984 on
> DS-896dada5-52c0-4a69-beed-dfbc5d437fc6, because the block scanner is
> disabled. 2015-08-05 11:50:31,136 INFO
> org.apache.hadoop.hdfs.server.datanode.BlockScanner: Not scanning
> suspicious block
> BP-2102462487-192.168.10.124-1436956492274:blk_1073794740_53984 on
> DS-896dada5-52c0-4a69-beed-dfbc5d437fc6, because the block scanner is
> disabled. 2015-08-05 11:50:31,136 INFO
> org.apache.hadoop.hdfs.server.datanode.BlockScanner: Not scanning
> suspicious block
> BP-2102462487-192.168.10.124-1436956492274:blk_1073794740_53984 on
> DS-896dada5-52c0-4a69-beed-dfbc5d437fc6, because the block scanner is
> disabled.*
>
>
> I am using locality groups so its a *NEED* to compact tables.... plz
> explain how can I get rid of suspicious blocks.
>
> Thanks
>
> On 08/05/2015 10:53 AM, mohit.kaushik wrote:
>
> yes, One of my datanode was down because disk was detached for some time
> and tserver was lost for that node but Its Up and running again.
>
> fsck show that the file system is healthy. but with so many msgs reporting
> under replicated blocks while my replication factor is 3 it shows required
> is 5.
>
> */user/root/.Trash/Current/accumulo/tables/+r/root_tablet/delete+A0000d29.rf+F0000d28.rf:
> Under replicated
> BP-2102462487-192.168.10.124-1436956492274:blk_1073796198_55442. Target
> Replicas is 5 but found 3 replica(s).*
>
> Thanks & Regards
> Mohit Kaushik
>
> On 08/04/2015 09:18 PM, John Vines wrote:
>
> It looks like an hdfs issue. Did a datanode go down? Did you turn
> replication down to 1? The combination of those two errors would definitely
> cause the problems your seeing as the latter disables any sort of
> robustness of the underlying filesystem.
>
> On Tue, Aug 4, 2015 at 8:10 AM mohit.kaushik <mohit.kaushik@orkash.com>
> wrote:
>
>> On 08/04/2015 05:35 PM, mohit.kaushik wrote:
>>
>> Hello All,
>>
>> I am using Apache Accumulo-1.6.3 with Apache Hadoop-2.7.0 on a 3 node
>> cluster. when I give compact command from the shell it gives the folloing
>> warn.
>>
>> root@orkash testScan> compact -w
>> 2015-08-04 17:10:52,702 [Shell.audit] INFO : root@orkash testScan>
>> compact -w
>> 2015-08-04 17:10:52,706 [shell.Shell] INFO : Compacting table ...
>> 2015-08-04 17:12:53,986 [impl.ThriftTransportPool] *WARN : Thread
>> "shell" stuck on IO  to orkash4:9999 (0) for at least 120034 ms*
>>
>>
>> Tablet Servers show problem regarding a data block. which is something
>> like HDFS-8659 <https://issues.apache.org/jira/browse/HDFS-8659>
>>
>> *2015-08-04 15:00:27,825 [hdfs.DFSClient] WARN : Failed to connect to
>> /192.168.10.121:50010 <http://192.168.10.121:50010> for block, add to
>> deadNodes and continue. java.io.IOException: Got error, status message
>> opReadBlock BP-2102462487-192.168.10.124-1436956492274:blk_1073780678_39911
>> received exception
>> org.apache.hadoop.hdfs.server.datanode.ReplicaNotFoundException: Replica
>> not found for
>> BP-2102462487-192.168.10.124-1436956492274:blk_1073780678_39911, for
>> OP_READ_BLOCK, self=/192.168.10.121:38752 <http://192.168.10.121:38752>,
>> remote=/192.168.10.121:50010 <http://192.168.10.121:50010>, for file
>> /accumulo/tables/h/t-000016s/F000016t.rf, for pool
>> BP-2102462487-192.168.10.124-1436956492274 block 1073780678_39911*
>> *java.io.IOException: Got error, status message opReadBlock
>> BP-2102462487-192.168.10.124-1436956492274:blk_1073780678_39911 received
>> exception org.apache.hadoop.hdfs.server.datanode.ReplicaNotFoundException:
>> Replica not found for
>> BP-2102462487-192.168.10.124-1436956492274:blk_1073780678_39911, for
>> OP_READ_BLOCK, self=/192.168.10.121:38752 <http://192.168.10.121:38752>,
>> remote=/192.168.10.121:50010 <http://192.168.10.121:50010>, for file
>> /accumulo/tables/h/t-000016s/F000016t.rf, for pool
>> BP-2102462487-192.168.10.124-1436956492274 block 1073780678_39911*
>> *        at
>> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:140)*
>> *        at
>> org.apache.hadoop.hdfs.RemoteBlockReader2.checkSuccess(RemoteBlockReader2.java:456)*
>> *        at
>> org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:424)*
>> *        at
>> org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:814)*
>> *        at
>> org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:693)*
>> *        at
>> org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:352)*
>> *        at
>> org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:618)*
>> *        at
>> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:844)*
>> *        at
>> org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:896)*
>> *        at java.io.DataInputStream.read(DataInputStream.java:149)*
>> *        at
>> org.apache.accumulo.core.file.rfile.bcfile.BoundedRangeFileInputStream$1.run(BoundedRangeFileInputStream.java:104)*
>> *        at
>> org.apache.accumulo.core.file.rfile.bcfile.BoundedRangeFileInputStream$1.run(BoundedRangeFileInputStream.java:100)*
>> *        at java.security.AccessController.doPrivileged(Native Method)*
>> *        at
>> org.apache.accumulo.core.file.rfile.bcfile.BoundedRangeFileInputStream.read(BoundedRangeFileInputStream.java:100)*
>> *        at
>> org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(DecompressorStream.java:159)*
>> *        at
>> org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:143)*
>> *        at
>> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)*
>> *        at
>> java.io.BufferedInputStream.fill(BufferedInputStream.java:235)*
>> *        at
>> java.io.BufferedInputStream.read(BufferedInputStream.java:254)*
>> *        at java.io.FilterInputStream.read(FilterInputStream.java:83)*
>> *        at java.io.DataInputStream.readInt(DataInputStream.java:387)*
>> *        at
>> org.apache.accumulo.core.file.rfile.MultiLevelIndex$IndexBlock.readFields(MultiLevelIndex.java:269)*
>> *        at
>> org.apache.accumulo.core.file.rfile.MultiLevelIndex$Reader.getIndexBlock(MultiLevelIndex.java:724)*
>> *        at
>> org.apache.accumulo.core.file.rfile.MultiLevelIndex$Reader.access$100(MultiLevelIndex.java:497)*
>> *        at
>> org.apache.accumulo.core.file.rfile.MultiLevelIndex$Reader$Node.getNext(MultiLevelIndex.java:587)*
>> *        at
>> org.apache.accumulo.core.file.rfile.MultiLevelIndex$Reader$Node.getNextNode(MultiLevelIndex.java:593)*
>> *        at
>> org.apache.accumulo.core.file.rfile.MultiLevelIndex$Reader$IndexIterator.getNextNode(MultiLevelIndex.java:616)*
>> *        at
>> org.apache.accumulo.core.file.rfile.MultiLevelIndex$Reader$IndexIterator.next(MultiLevelIndex.java:659)*
>> *        at
>> org.apache.accumulo.core.file.rfile.RFile$LocalityGroupReader._next(RFile.java:559)*
>>
>> Regards
>> Mohit Kaushik
>>
>>
>> And Compaction never completes
>>
>>
>
> --
>
> * Mohit Kaushik*
> Software Engineer
> A Square,Plot No. 278, Udyog Vihar, Phase 2, Gurgaon 122016, India
> *Tel:* +91 (124) 4969352 | *Fax:* +91 (124) 4033553
>
> <http://politicomapper.orkash.com>interactive social intelligence at
> work...
>
> <https://www.facebook.com/Orkash2012>
> <http://www.linkedin.com/company/orkash-services-private-limited>
> <https://twitter.com/Orkash>  <http://www.orkash.com/blog/>
> <http://www.orkash.com>
> <http://www.orkash.com> ... ensuring Assurance in complexity and
> uncertainty
>
> *This message including the attachments, if any, is a confidential
> business communication. If you are not the intended recipient it may be
> unlawful for you to read, copy, distribute, disclose or otherwise use the
> information in this e-mail. If you have received it in error or are not the
> intended recipient, please destroy it and notify the sender immediately.
> Thank you *
>

Mime
View raw message