ranger-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ramesh Mani (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (RANGER-1837) Enhance Ranger Audit to HDFS to support ORC file format
Date Tue, 21 Nov 2017 20:57:00 GMT

    [ https://issues.apache.org/jira/browse/RANGER-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16261466#comment-16261466
] 

Ramesh Mani edited comment on RANGER-1837 at 11/21/17 8:56 PM:
---------------------------------------------------------------

Current implementation is

# AuditFileCacheProvider -> This gets the audit log and stores it to Local filesystem via
AuditFileCacheProviderSpooler
# AuditFileCacheProviderSpooler has a thread that reads the local audit files in chunks (
configured via param “xasecure.audit.provider.filecache.filespool.buffer.size” ) and send
it to AsyncAuditQueue. This chunk becomes the batch size of the data that is going to the
next point in this follow, in this case AsyncAuditQueue.
# AsyncAuditQueue , this is existing one which using AuditBatchQueue for each of the destination
configured  Here Queue size is one buffer which can be configured "xasecure.audit.destination.<destination>.batch.batch.size”
( <destination>= hdfs/solr/etc. I used the existing AsyncAuditQueue, so that in case
of failures in the destination, this can backup with its own spooling and forwarding mechanism.
# Finally HDFSAuditDestination has a WRITER, which can writer in JSON/ORC file. When the write
is ORCWriter, it has a buffer size which determine the batch size of each ORC file that is
going to be created in HDFS or other destination.

So configuring these buffers will determine the Batch size when ORC files are created.


I believe that you wanted to eliminate AsyncAuditQueue in this flow and send directly to HDFSDestination
/ SOLR destination via a AuditFileQueue. If you proposing this, then that is what I was mentioning
of about the refactoring / introducing a new pipeline to handle this scenario. Please correct
me if I am wrong in this.

I have one more request which is related to data flow rate to different destination. Currently
if we store the data local and forwarding it, destinations will get the data at the same rate.
Say suppose that AuditFileCacheProvider file rollover time 1 hr, each destination will get
the data after 1 hr. Some may want SOLR destination to have the data more quickly than  HDFS
/S3. In that case we need to have the existing pipeline for one or more destination and store
and forward for other destinations. so this also need refactoring to introduce a  mechanism
to pick queues for each destination or group of destinations.  Please let me know about this.

!AuditDataFlow.jpg|thumbnail!



was (Author: rmani):
Current implementation is

# AuditFileCacheProvider -> This gets the audit log and stores it to Local filesystem via
AuditFileCacheProviderSpooler
# AuditFileCacheProviderSpooler has a thread that reads the local audit files in chunks (
configured via param “xasecure.audit.provider.filecache.filespool.buffer.size” ) and send
it to AsyncAuditQueue. This chunk becomes the batch size of the data that is going to the
next point in this follow, in this case AsyncAuditQueue.
# AsyncAuditQueue , this is existing one which using AuditBatchQueue for each of the destination
configured  Here Queue size is one buffer which can be configured "xasecure.audit.destination.<destination>.batch.batch.size”
( <destination>= hdfs/solr/etc. I used the existing AsyncAuditQueue, so that in case
of failures in the destination, this can backup with its own spooling and forwarding mechanism.
# Finally HDFSAuditDestination has a WRITER, which can writer in JSON/ORC file. When the write
is ORCWriter, it has a buffer size which determine the batch size of each ORC file that is
going to be created in HDFS or other destination.

So configuring these buffers will determine the Batch size when ORC files are created.


I believe that you wanted to eliminate AsyncAuditQueue in this flow and send directly to HDFSDestination
/ SOLR destination via a AuditFileQueue. If you proposing this, then that is what I was mentioning
of about the refactoring / introducing a new pipeline to handle this scenario. Please correct
me if I am wrong in this.

I have one more request which is related to data flow rate to different destination. Currently
if we store the data local and forwarding it, destinations will get the data at the same rate.
Say suppose that AuditFileCacheProvider file rollover time 1 hr, each destination will get
the data after 1 hr. Some may want SOLR destination to have the data more quickly than  HDFS
/S3. In that case we need to have the existing pipeline for one or more destination and store
and forward for other destinations. so this also need refactoring to introduce a  mechanism
to pick queues for each destination or group of destinations.  Please let me know about this.


> Enhance Ranger Audit to HDFS to support ORC file format
> -------------------------------------------------------
>
>                 Key: RANGER-1837
>                 URL: https://issues.apache.org/jira/browse/RANGER-1837
>             Project: Ranger
>          Issue Type: Improvement
>          Components: audit
>            Reporter: Kevin Risden
>            Assignee: Ramesh Mani
>         Attachments: 0001-RANGER-1837-Enhance-Ranger-Audit-to-HDFS-to-support-.patch,
0001-RANGER-1837-Enhance-Ranger-Audit-to-HDFS-to-support_001.patch, AuditDataFlow.png
>
>
> My team has done some research and found that Ranger HDFS audits are:
> * Stored as JSON objects (one per line)
> * Not compressed
> This is currently very verbose and would benefit from compression since this data is
not frequently accessed. 
> From Bosco on the mailing list:
> {quote}You are right, currently one of the options is saving the audits in HDFS itself
as JSON files in one folder per day. I have loaded these JSON files from the folder into Hive
as compressed ORC format. The compressed files in ORC were less than 10% of the original size.
So, it was significant decrease in size. Also, it is easier to run analytics on the Hive tables.
>  
> So, there are couple of ways of doing it.
>  
> Write an Oozie job which runs every night and loads the previous day worth audit logs
into ORC or other format
> Write a AuditDestination which can write into the format you want to.
>  
> Regardless which approach you take, this would be a good feature for Ranger.{quote}
> http://mail-archives.apache.org/mod_mbox/ranger-user/201710.mbox/%3CCAJU9nmiYzzUUX1uDEysLAcMti4iLmX7RE%3DmN2%3DdoLaaQf87njQ%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message