Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 99166200D41 for ; Wed, 22 Nov 2017 09:14:05 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 97778160BDA; Wed, 22 Nov 2017 08:14:05 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id DE982160C0F for ; Wed, 22 Nov 2017 09:14:04 +0100 (CET) Received: (qmail 92649 invoked by uid 500); 22 Nov 2017 08:14:04 -0000 Mailing-List: contact dev-help@ranger.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@ranger.apache.org Delivered-To: mailing list dev@ranger.apache.org Received: (qmail 92637 invoked by uid 99); 22 Nov 2017 08:14:04 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Nov 2017 08:14:04 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 33E8618074A for ; Wed, 22 Nov 2017 08:14:03 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id 5xEuv-a7IVww for ; Wed, 22 Nov 2017 08:14:01 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 8683660DBD for ; Wed, 22 Nov 2017 08:14:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id CFC4DE12E3 for ; Wed, 22 Nov 2017 08:14:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 30CA9255AD for ; Wed, 22 Nov 2017 08:14:00 +0000 (UTC) Date: Wed, 22 Nov 2017 08:14:00 +0000 (UTC) From: "Don Bosco Durai (JIRA)" To: dev@ranger.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (RANGER-1837) Enhance Ranger Audit to HDFS to support ORC file format MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 22 Nov 2017 08:14:05 -0000 [ https://issues.apache.org/jira/browse/RANGER-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16262125#comment-16262125 ] Don Bosco Durai commented on RANGER-1837: ----------------------------------------- bq. and send it to AsyncAuditQueue. This chunk becomes the batch size of the data that is going to the next point in this follow, in this case, AsyncAuditQueue The issue I see is with similar to 2 phase commit. If you have read and put it on AsyncAuditQueue and for whatever reason, write to the destination fails (e.g. HDFS file close fails), then we are back to square one. This amplifies when you have multiple destinations. bq. I believe that you wanted to eliminate AsyncAuditQueue in this flow and send directly to HDFSDestination / SOLR destination via a AuditFileQueue. Every destination has a Queue associated with it. Most of them are BatchQueue, because it uses file spooler in case destination is down or the incoming flow is faster than what the destination can handle. My suggestion was in ORC/Parquet case, replace the BatchQueue with the new AuditFileQueue, so nothing is stored in the memory and we can build a bigger buffer in the file. And on regular intervals, copy the file to HDFS directly or using ORC library. bq. If you proposing this, then that is what I was mentioning of about the refactoring / introducing a new pipeline to handle this scenario I won't call it refactoring because it is part of the existing design. Yes, we will have to write a new queue and it is some work and testing required. bq. I have one more request which is related to data flow rate to different destination. This is exactly the reason we need the queue per destination. Having the AuditFileCacheProvider upfronts limits us doing variable flows and also doesn't give us the expected reliability. I did a quick review of AuditFileCacheProvider and AuditFileCacheProviderSpool. It seems you already have most of the code in AuditFileCacheProviderSpool. And AuditFileCacheProvider is also extending from BaseAuditHandler. So you could technically clone it to FileAuditQueue extend that from AuditQueue. And the runLogAudit() method in AuditFileCacheProviderSpool can call destination.logFile() (new method in BaseAuditHandler). The default implementation is what you already have in runLogAudit() which reads each lines and calls destination.log(). While any destination which can deal at file level override the method take the file name/handler and do bulk operation. > Enhance Ranger Audit to HDFS to support ORC file format > ------------------------------------------------------- > > Key: RANGER-1837 > URL: https://issues.apache.org/jira/browse/RANGER-1837 > Project: Ranger > Issue Type: Improvement > Components: audit > Reporter: Kevin Risden > Assignee: Ramesh Mani > Attachments: 0001-RANGER-1837-Enhance-Ranger-Audit-to-HDFS-to-support-.patch, 0001-RANGER-1837-Enhance-Ranger-Audit-to-HDFS-to-support_001.patch, AuditDataFlow.png > > > My team has done some research and found that Ranger HDFS audits are: > * Stored as JSON objects (one per line) > * Not compressed > This is currently very verbose and would benefit from compression since this data is not frequently accessed. > From Bosco on the mailing list: > {quote}You are right, currently one of the options is saving the audits in HDFS itself as JSON files in one folder per day. I have loaded these JSON files from the folder into Hive as compressed ORC format. The compressed files in ORC were less than 10% of the original size. So, it was significant decrease in size. Also, it is easier to run analytics on the Hive tables. > > So, there are couple of ways of doing it. > > Write an Oozie job which runs every night and loads the previous day worth audit logs into ORC or other format > Write a AuditDestination which can write into the format you want to. > > Regardless which approach you take, this would be a good feature for Ranger.{quote} > http://mail-archives.apache.org/mod_mbox/ranger-user/201710.mbox/%3CCAJU9nmiYzzUUX1uDEysLAcMti4iLmX7RE%3DmN2%3DdoLaaQf87njQ%40mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.4.14#64029)