ranger-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ramesh Mani (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (RANGER-1837) Enhance Ranger Audit to HDFS to support ORC file format
Date Fri, 17 Nov 2017 23:07:00 GMT

    [ https://issues.apache.org/jira/browse/RANGER-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16257622#comment-16257622
] 

Ramesh Mani edited comment on RANGER-1837 at 11/17/17 11:06 PM:
----------------------------------------------------------------

[~bosco][~risdenk][~abhayk] I have attached the patch #2 on this after working on the comments,
sorry for the delay as I was busy with other commitments. 
[~bosco]
Regarding the question on the review related to not having the buffer, current implementation
uses existing AuditQueue for the data pipe into destination. If this has to be avoided we
need to have a one more major refactoring on the audit framework. 
1) a new Ranger Audit Pipeline which don't have buffer/Queues and should be able to support
multiple destination. This should be able to handle the batches received from the sources
2) this new Ranger Audit Pipeline should support  variable destination data flow rate. i.e
Audit to Solr Destination should be immediate ( basically no storing and forwarding) , where
as audit to hdfs can be of different rate based on the batch size / format etc.

Current framework provides 3 buffer size 
1) xasecure.audit.provider.filecache.filespool.buffer.size=
    This determine the batch size to be read from the local file and send to Audit Queue.
Default is 1000 lines which we need to increase for ORC file format. This is batch size of
the ORC file to be created, so this has to configured according the file spool size which
is determined by the file spool rollover time.
2) xasecure.audit.destination.hdfs.batch.batch.size=
   This is the audit queue batch size in this pipeline, this size will be read  from the queue
and send to Destination. Default is 1000  lines.
3) xasecure.audit.destination.hdfs.orc.buffersize=
  This is the ORCWriter batch size which holds the data before it writes. This dynamically
changes based on the Audit Batch size coming from the source.
I tested it for 1 hr data for hdfs plugin and having all three 10000 was fine. It didn't create
multiple files for the amount, but this depends on the amount of hdfs activities. I need to
check with KAFKA plugin
Please let me know.


was (Author: rmani):
[~bosco][~risdenk] I have attached the patch #2 on this after working on the comments, sorry
for the delay as I was busy with other commitments. 
[~bosco]
Regarding the question on the review related to not having the buffer, current implementation
uses existing AuditQueue for the data pipe into destination. If this has to be avoided we
need to have a one more major refactoring on the audit framework. 
1) a new Ranger Audit Pipeline which don't have buffer/Queues and should be able to support
multiple destination. This should be able to handle the batches received from the sources
2) this new Ranger Audit Pipeline should support  variable destination data flow rate. i.e
Audit to Solr Destination should be immediate ( basically no storing and forwarding) , where
as audit to hdfs can be of different rate based on the batch size / format etc.

Current framework provides 3 buffer size 
1) xasecure.audit.provider.filecache.filespool.buffer.size=
    This determine the batch size to be read from the local file and send to Audit Queue.
Default is 1000 lines which we need to increase for ORC file format. This is batch size of
the ORC file to be created, so this has to configured according the file spool size which
is determined by the file spool rollover time.
2) xasecure.audit.destination.hdfs.batch.batch.size=
   This is the audit queue batch size in this pipeline, this size will be read  from the queue
and send to Destination. Default is 1000  lines.
3) xasecure.audit.destination.hdfs.orc.buffersize=
  This is the ORCWriter batch size which holds the data before it writes. This dynamically
changes based on the Audit Batch size coming from the source.
I tested it for 1 hr data for hdfs plugin and having all three 10000 was fine. It didn't create
multiple files for the amount, but this depends on the amount of hdfs activities. I need to
check with KAFKA plugin
Please let me know.

> Enhance Ranger Audit to HDFS to support ORC file format
> -------------------------------------------------------
>
>                 Key: RANGER-1837
>                 URL: https://issues.apache.org/jira/browse/RANGER-1837
>             Project: Ranger
>          Issue Type: Improvement
>          Components: audit
>            Reporter: Kevin Risden
>            Assignee: Ramesh Mani
>         Attachments: 0001-RANGER-1837-Enhance-Ranger-Audit-to-HDFS-to-support-.patch,
0001-RANGER-1837-Enhance-Ranger-Audit-to-HDFS-to-support_001.patch
>
>
> My team has done some research and found that Ranger HDFS audits are:
> * Stored as JSON objects (one per line)
> * Not compressed
> This is currently very verbose and would benefit from compression since this data is
not frequently accessed. 
> From Bosco on the mailing list:
> {quote}You are right, currently one of the options is saving the audits in HDFS itself
as JSON files in one folder per day. I have loaded these JSON files from the folder into Hive
as compressed ORC format. The compressed files in ORC were less than 10% of the original size.
So, it was significant decrease in size. Also, it is easier to run analytics on the Hive tables.
>  
> So, there are couple of ways of doing it.
>  
> Write an Oozie job which runs every night and loads the previous day worth audit logs
into ORC or other format
> Write a AuditDestination which can write into the format you want to.
>  
> Regardless which approach you take, this would be a good feature for Ranger.{quote}
> http://mail-archives.apache.org/mod_mbox/ranger-user/201710.mbox/%3CCAJU9nmiYzzUUX1uDEysLAcMti4iLmX7RE%3DmN2%3DdoLaaQf87njQ%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message