Return-Path: X-Original-To: apmail-accumulo-dev-archive@www.apache.org Delivered-To: apmail-accumulo-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DF75519893 for ; Fri, 8 Apr 2016 14:25:45 +0000 (UTC) Received: (qmail 46102 invoked by uid 500); 8 Apr 2016 14:25:44 -0000 Delivered-To: apmail-accumulo-dev-archive@accumulo.apache.org Received: (qmail 45983 invoked by uid 500); 8 Apr 2016 14:25:38 -0000 Mailing-List: contact dev-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@accumulo.apache.org Delivered-To: mailing list dev@accumulo.apache.org Delivered-To: moderator for dev@accumulo.apache.org Received: (qmail 15634 invoked by uid 99); 8 Apr 2016 10:30:28 -0000 X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.281 X-Spam-Level: * X-Spam-Status: No, score=1.281 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, HTML_OBFUSCATE_05_10=0.001, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=ci-mediatrac-com.20150623.gappssmtp.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ci-mediatrac-com.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:date:message-id:subject:from:to; bh=HlKxB9YezRjQyy58a0gc7/RFbC7lkGdvqcyhkHHuODc=; b=TBUjJl6Dk+zcNKPtoWvOIzS34Dd8fOs+r1L04iImH9G5Pc3U1B5YwKVZ5JPTwuclrT smZMIyXnVIz/qhCXY7pJX91gqFXZtiPOoqlEfjgnHZv3Y37rcZh+xOsGV5WfKCLps8XS dqBek3X9wsbNCE64q7z7aGQHhjqYxK2MPYvaXeRQXDOsrBQ3JLFwo+5gOPRFetUW6j9g 4LTgR0airzd495MBP+R9pnuryF2HkJBOoZXzFjmL/qs2sN7vuRMnfdnTaVzM2LL6tFMA dpr9FxaLe7X1UTSV24EgrJh6aOnVWcjqZ3NevlLBrkGePEPIUHisuL8TyGqHbHMtZxtE d4zQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to; bh=HlKxB9YezRjQyy58a0gc7/RFbC7lkGdvqcyhkHHuODc=; b=Nu6c1hVYeBIfxvQL21LS5884w+QASfVCIyfsZv50sWK9Ms0yJ3TNxxqLMXWGOYrikq CtqgyI9RNzynQMqyRP2BrHtTbU764OZdDE4fCsQ87UDWNuOT2mubdh9TJyRUPciWhPd4 lLWP+Z5G6u1+8Mjw1dxa5DAmWZGtxoe8kP9bTpjdhVEHbBmUokJXuAF0NO8k8N0YLMkd ICRmRzurwS1qmuZ1I2JGvdsG3qs9KEaNsvdtD2zHS2iXqJ8j5hOohZeFDRw22VV9PO2y OqRDoWwnsQ1AzAojotfEvNlIVXSsnXGjAN9M8ipJ/7d2lJdpU+CzP6KR9+SJEZVkTtMQ +ctA== X-Gm-Message-State: AD7BkJJ6htVps2RALgT8dIIAVV3xhHUX3a6l3vHQ54j6ccYCQbYBWxU+9j1dcYtxsYpa6vwlck87BuRVYDKcfg== MIME-Version: 1.0 X-Received: by 10.25.163.76 with SMTP id m73mr2717534lfe.39.1460111423936; Fri, 08 Apr 2016 03:30:23 -0700 (PDT) X-Originating-IP: [103.21.219.213] In-Reply-To: References: Date: Fri, 8 Apr 2016 17:30:23 +0700 Message-ID: Subject: Fwd: Data authorization/visibility limit in Accumulo From: Fikri Akbar To: dev@accumulo.apache.org Content-Type: multipart/alternative; boundary=001a114123d2c1b753052ff6aeaa --001a114123d2c1b753052ff6aeaa Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi Guys, We're a group of accumulo enthusiasts from Indonesia. We've been trying to implement accumulo for several different type of data processing purposes. We've got several questions regarding Accumulo, which you might help us with. We encounter these issues when we're trying to process heavy amount of data, our questions are as follows: 1. Let's say that I have a file in HDFS that's about 300 GB with a total 1.6 Billion rows, and each line are separated by "^". The question is, what is the most effective way to move the data to Accumulo (with assumption that the structure of each cell is [rowkey cf:cq vis value] =3D> [lineNumbe= r raw:columnName fileName columnValue])? 2. What is the most effective way to ingest data, if we're receiving data with the size of >1 TB on a daily basis? 3. We're currently testing the ability of Accumulo for its data-level access control, however the issue regarding the limit of dataset authorization occurred when the datasets reached >20,000. For example, lets say user X has a data called one.txt. This will make user X has authorization to one.txt (let's call it X.one.txt). Now, what if X has more than that (one.txt, two.xt, three.txt...n.txt), this will result in user X having multiple authorization (as much as the data or n authorization) and apparently when we tried it for datasets >20,000 (which user will have >20,000 authorization), we're not able to execute "get auth". We find that this is a very crucial issue, especially if (in one case) there's >20,000 datasets that is being granted authorization at once. The following are error logs from our system. *Error log in shell:* org.apache.accumulo.core.client.AccumuloException: org.apache.thrift.TApplicationException: Internal error processing getUserAuthorizations at org.apache.accumulo.core.client.impl.SecurityOperationsImpl.execute(Securit= yOperationsImpl.java:83) at org.apache.accumulo.core.client.impl.SecurityOperationsImpl.getUserAuthoriz= ations(SecurityOperationsImpl.java:182) at com.msk.auxilium.table.AuxUser.setUserAuth(AuxUser.java:310) at com.msk.auxilium.commons.UserSystem.getAuxUser(UserSystem.java:24) at com.msk.auxilium.tester.HDFSTest.main(HDFSTest.java:57) Caused by: org.apache.thrift.TApplicationException: Internal error processing getUserAuthorizations at org.apache.thrift.TApplicationException.read(TApplicationException.java:108= ) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:71) at org.apache.accumulo.core.client.impl.thrift.ClientService$Client.recv_getUs= erAuthorizations(ClientService.java:580) at org.apache.accumulo.core.client.impl.thrift.ClientService$Client.getUserAut= horizations(ClientService.java:565) at org.apache.accumulo.core.client.impl.SecurityOperationsImpl$6.execute(Secur= ityOperationsImpl.java:185) at org.apache.accumulo.core.client.impl.SecurityOperationsImpl$6.execute(Secur= ityOperationsImpl.java:182) at org.apache.accumulo.core.client.impl.ServerClient.executeRaw(ServerClient.j= ava:90) at org.apache.accumulo.core.client.impl.SecurityOperationsImpl.execute(Securit= yOperationsImpl.java:69) ... 4 more *Error log in accumulo master (web)* tserver: Zookeeper error, will retry org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode =3D ConnectionLoss for /accumulo/281c3ac0-74eb-4135-bc63-3158eabe2c47/tables/1a/conf/table.split.t= hreshold at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041) at org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:210) at org.apache.accumulo.fate.zookeeper.ZooCache.retry(ZooCache.java:132) at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:235) at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:190) at org.apache.accumulo.server.conf.TableConfiguration.get(TableConfigurat= ion.java:130) at org.apache.accumulo.server.conf.TableConfiguration.get(TableConfigurat= ion.java:118) at org.apache.accumulo.core.conf.AccumuloConfiguration.getMemoryInBytes(A= ccumuloConfiguration.java:100) at org.apache.accumulo.tserver.Tablet.findSplitRow(Tablet.java:2892) at org.apache.accumulo.tserver.Tablet.needsSplit(Tablet.java:3032) at org.apache.accumulo.tserver.TabletServer$MajorCompactor.run(TabletServ= er.java:2603) at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java= :34) at java.lang.Thread.run(Thread.java:745) *garbage collector:* Zookeeper error, will retry org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode =3D ConnectionLoss for /accumulo/281c3ac0-74eb-4135-bc63-3158eabe2c47/tables at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1468) at org.apache.accumulo.fate.zookeeper.ZooCache$1.run(ZooCache.java:169) at org.apache.accumulo.fate.zookeeper.ZooCache.retry(ZooCache.java:132) at org.apache.accumulo.fate.zookeeper.ZooCache.getChildren(ZooCache.java:= 180) at org.apache.accumulo.core.client.impl.Tables.getMap(Tables.java:126) at org.apache.accumulo.core.client.impl.Tables.getNameToIdMap(Tables.java= :197) at org.apache.accumulo.core.client.impl.Tables._getTableId(Tables.java:17= 3) at org.apache.accumulo.core.client.impl.Tables.getTableId(Tables.java:166= ) at org.apache.accumulo.core.client.impl.ConnectorImpl.getTableId(Connecto= rImpl.java:84) at org.apache.accumulo.core.client.impl.ConnectorImpl.createScanner(Conne= ctorImpl.java:151) at org.apache.accumulo.gc.SimpleGarbageCollector$GCEnv.getCandidates(Simp= leGarbageCollector.java:278) at org.apache.accumulo.gc.GarbageCollectionAlgorithm.getCandidates(Garbag= eCollectionAlgorithm.java:238) at org.apache.accumulo.gc.GarbageCollectionAlgorithm.collect(GarbageColle= ctionAlgorithm.java:272) at org.apache.accumulo.gc.SimpleGarbageCollector.run(SimpleGarbageCollect= or.java:544) at org.apache.accumulo.gc.SimpleGarbageCollector.main(SimpleGarbageCollec= tor.java:154) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.j= ava:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccess= orImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.accumulo.start.Main$1.run(Main.java:141) at java.lang.Thread.run(Thread.java:745) we tried finding some resources regarding this issue, but couldn't find any that mention the limit of authorizations per user and FYI we're using accumulo version 1.6. Sorry for the long email :) and have a great day. Regards, *Fikri Akbar* Technology *PT Mediatrac Sistem Komunikasi* Grha Tirtadi 2nd Floor | Jl. Senopati 71-73 | Jakarta 12110 | Indonesia | *M**ap* 6=C2=B013'57.37"S 106=C2=B048'42.29"E *P* +62 21 520 2568 | *F* +62 21 520 4180 | *M* +62 812 1243 4786 | *www.mediatrac.co.id * --001a114123d2c1b753052ff6aeaa--