ranger-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ramesh Mani (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (RANGER-1837) HDFS Audit Compression
Date Fri, 27 Oct 2017 20:29:00 GMT

    [ https://issues.apache.org/jira/browse/RANGER-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16222753#comment-16222753

Ramesh Mani commented on RANGER-1837:

[~bosco][~risdenk] [~madhan.neethiraj], I have completed a initial work on having an HdfsAuditDestination
with ORC as file format.  
Approach I took is
1) Create a new Audit Destination -> HDFSAuditDestinationORC. Parameter to trigger this
2) We need to use the AuditFileCacheProvider, so that audits are stored locally before pushing
it to respective destination.( Refer https://issues.apache.org/jira/browse/RANGER-1310). This
is need to create the batch for ORC file, else ORC file will be small set of records based
on the AuditQueue buffer size.
3) Batch for the ORC is records in each local file and batch size is determined by the rollover
time we specify for these local files in AuditFileCacheProvider.
4) Each ORC file will be written to hdfs, flushed and closed immediately.
4) We don't have a streaming API into ORC file unless HIVE APIs  are used and in that case
it expects the hive table for audit has to be there and writing will be done via HIVE JDBC
calls ( May be a another option which we can think of) 
4) Compression techniques for ORC files can be configured. Parameter "xasecure.audit.destination.hdfs.orc.compression=SNAPPY|ZLIP|NONE".
5) Buffersize/Stripesize which are configuration for ORC files can be configured based on
the needs.
6) Hive External table can be created on the location where the ORC files are and orc file
directory structure is same as the currently one used for HDFS Audit Files.

Please provide your comments on this. Thanks.

> HDFS Audit Compression
> ----------------------
>                 Key: RANGER-1837
>                 URL: https://issues.apache.org/jira/browse/RANGER-1837
>             Project: Ranger
>          Issue Type: Improvement
>          Components: audit
>            Reporter: Kevin Risden
> My team has done some research and found that Ranger HDFS audits are:
> * Stored as JSON objects (one per line)
> * Not compressed
> This is currently very verbose and would benefit from compression since this data is
not frequently accessed. 
> From Bosco on the mailing list:
> {quote}You are right, currently one of the options is saving the audits in HDFS itself
as JSON files in one folder per day. I have loaded these JSON files from the folder into Hive
as compressed ORC format. The compressed files in ORC were less than 10% of the original size.
So, it was significant decrease in size. Also, it is easier to run analytics on the Hive tables.
> So, there are couple of ways of doing it.
> Write an Oozie job which runs every night and loads the previous day worth audit logs
into ORC or other format
> Write a AuditDestination which can write into the format you want to.
> Regardless which approach you take, this would be a good feature for Ranger.{quote}
> http://mail-archives.apache.org/mod_mbox/ranger-user/201710.mbox/%3CCAJU9nmiYzzUUX1uDEysLAcMti4iLmX7RE%3DmN2%3DdoLaaQf87njQ%40mail.gmail.com%3E

This message was sent by Atlassian JIRA

View raw message