hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shahab Yunus <shahab.yu...@gmail.com>
Subject Re: Data cleansing in modern data architecture
Date Sun, 20 Jul 2014 21:20:04 GMT
I am assuming you meant the batch jobs that are/were used in old world for
data cleansing.

As far as I understand there is no hard and fast rule for it and it depends
functional and system requirements of the usecase.

It is also dependent on the technology being used and how it manages

E.g. in HBase or Cassandra, you can write batch jobs which clean or correct
or remove unwanted or incorrect data and than the underlying stores usually
have a concept of compaction which not only defragments data files but also
at this point removes from disk all the entries marked as deleted.

But there are considerations to be aware of given that compaction is a
heavy process and in some cases (e.g. Cassandra) there can be problems when
there are too much data to be removed. Not only that, in some cases,
marked-to-be-deleted data, until it is deleted/compacted can slow down
normal operations of the data store as well.

One can also leverage in HBase's case the versioning mechanism and the
afore-mentioned batch job can simply overwrite the same row key and the
previous version would no longer be the latest. If max-version parameter is
configured as 1 then no previous version would be maintained (physically it
would be and would be removed at compaction time but would not be

In the end, basically cleansing can be done after or before loading but
given the append-only and no hard-delete design approaches of most nosql
stores, I would say it would be easier to do cleaning before data is loaded
in the nosql store. Of course, it bears repeating that it depends on the
use case.

Having said that, on a side-note and a bit off-topic, it reminds me of the
Lamda Architecture that combines batch and real-time computation for big
data using various technologies and it uses the idea of constant periodic
refreshes to reload the data and within this periodic refresh, the
expectations are that any invalid older data would be corrected and
overwritten by the new refresh load. Those basically the 'batch part' of
the LA takes care of data cleansing by reloading everything. But LA is
mostly for thouse systems which are ok with eventually consistent behavior
and might not be suitable for some systems.


On Sun, Jul 20, 2014 at 2:36 PM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   In the old world, data cleaning used to be a large part of the data
> warehouse load. Now that we’re working in a schemaless environment, I’m not
> sure where data cleansing is supposed to take place. NoSQL sounds fun
> because theoretically you just drop everything in but transactional systems
> that generate the data are still full of bugs and create junk data.
> My question is, where does data cleaning/master data management/CDI belong
> in a modern data architecture? Before it hit hits Hadoop? After?
> B.

View raw message