hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adaryl \"Bob\" Wakefield, MBA" <adaryl.wakefi...@hotmail.com>
Subject Re: Data cleansing in modern data architecture
Date Sun, 10 Aug 2014 08:32:13 GMT
It’s a lot of theory right now so let me give you the full background and see if we can more
refine the answer.

I’ve had a lot of clients with data warehouses that just weren’t functional for various
reasons. I’m researching Hadoop to try and figure out a way to totally eliminate traditional
data warehouses. I know all the arguments for keeping them around and I’m not impressed
with any of them. I’ve noticed for a while that traditional data storage methods just aren’t
up to the task for the things we’re asking data to do these days.

I’ve got MOST of it figured out. I know how to store and deliver analytics using all the
various tools within the Apache project (and some NOT in the Apache project). What I haven’t
figured out is how to do data cleansing or master data management both of which are hard to
do if you can’t change anything.

So let’s say there is a transactional system. It’s a web application that is the businesses
main source of revenue. All the activity of the user on the website is easily structured (so
basically we’re not dealing with un-structured data). The nature of the data is financial.

The pipeline is fairly straight forward. The data is extracted from the transactional system
and placed into a Hadoop environment. From there, it’s exposed by Hive so non technical
business analyst with SQL skills can  do what they need to do. Pretty typical right?

The problem is the web app is not perfect and occasionally produces junk data. Nothing obvious.
It may be a few days before the error is noticed. An example would be phantom invoices. Those
invoices get in Hadoop. A few days later an analyst notices that the invoice figures for some
period are inflated. 

Once we identify the offending records there is NO reason for them to remain in the system;
it’s meaningless junk data. Those records are of zero value. I encounter this scenario in
the real world quite often. In the old world, we would just blow away the offending records.
Just write a view to skip over a couple of records or exclude a few dozen doesn’t make much
sense. It’s better to just blow these records away, I’m just not certain what the best
way to accomplish that is in the new world.

Adaryl "Bob" Wakefield, MBA
Mass Street Analytics
Twitter: @BobLovesData

From: Sriram Ramachandrasekaran 
Sent: Saturday, August 09, 2014 11:55 PM
To: user@hadoop.apache.org 
Subject: Re: Data cleansing in modern data architecture

While, I may not have enough context to your entire processing pipeline, here are my thoughts.

1. It's always useful to have raw data, irrespective of if it was right or wrong. The way
to look at it is, it's the source of truth at timestamp t.
2. Note that, You only know that the data at timestamp t for an id X was wrong because, subsequent
info about X seem to conflict with the one at t or some manual debugging finds it out.

All systems that does reporting/analytics is better off by not meddling with the raw data.
There should be processed or computed views of this data, that massages it, gets rids of noisy
data, merges duplicate entries, etc and then finally produces an output that's suitable for
your reports/analytics. So, your idea to write transaction logs to HDFS is fine(unless, you
are twisting your systems to get it that way), but, you just need to introduce one more layer
of indirection, which has the business logic to handle noise/errors like this. 

For your specific case, you could've a transaction processor up job which produces a view,
that takes care of squashing transactions based on id(something that makes sense in your system)
and then handles the business logic of how to handle the bugs/discrepancies in them. Your
views could be loaded into a nice columnar store for faster query retrieval(if you have pointed
queries - based on a key), else, a different store would be needed. Yes, this has the overhead
of running the view creation job, but, I think, the ability to go back to raw data and investigate
what happened there is worth it. 

Your approach of structuring it and storing it in HBase is also fine as long as you keep the
concerns separate(if your write/read workloads are poles apart).

Hope this helps.

On Sun, Aug 10, 2014 at 9:36 AM, Adaryl "Bob" Wakefield, MBA <adaryl.wakefield@hotmail.com>

  Or...as an alternative, since HBASE uses HDFS to store it’s data, can we get around the
no editing file rule by dropping structured data into HBASE? That way, we have data in HDFS
that can be deleted. Any real problem with that idea?

  Adaryl "Bob" Wakefield, MBA
  Mass Street Analytics
  Twitter: @BobLovesData

  From: Adaryl "Bob" Wakefield, MBA 
  Sent: Saturday, August 09, 2014 8:55 PM
  To: user@hadoop.apache.org 
  Subject: Re: Data cleansing in modern data architecture

  Answer: No we can’t get rid of bad records. We have to go back and rebuild the entire
file. We can’t edit records but we can get rid of entire files right? This would suggest
that appending data to files isn’t that great of an idea. It sounds like it would be more
appropriate to cut a hadoop data load up into periodic files (days, months, etc.) that can
easily be rebuilt should errors occur....

  Adaryl "Bob" Wakefield, MBA
  Mass Street Analytics
  Twitter: @BobLovesData

  From: Adaryl "Bob" Wakefield, MBA 
  Sent: Saturday, August 09, 2014 4:01 AM
  To: user@hadoop.apache.org 
  Subject: Re: Data cleansing in modern data architecture

  I’m sorry but I have to revisit this again. Going through the reply below I realized that
I didn’t quite get my question answered. Let me be more explicit with the scenario.

  There is a bug in the transactional system.
  The data gets written to HDFS where it winds up in Hive.
  Somebody notices that their report is off/the numbers don’t look right.
  We investigate and find the bug in the transactional system.

  Question: Can we then go back into HDFS and rid ourselves of the bad records? If not, what
is the recommended course of action?

  Adaryl "Bob" Wakefield, MBA
  Mass Street Analytics

  From: Shahab Yunus 
  Sent: Sunday, July 20, 2014 4:20 PM
  To: user@hadoop.apache.org 
  Subject: Re: Data cleansing in modern data architecture

  I am assuming you meant the batch jobs that are/were used in old world for data cleansing.

  As far as I understand there is no hard and fast rule for it and it depends functional and
system requirements of the usecase. 

  It is also dependent on the technology being used and how it manages 'deletion'.

  E.g. in HBase or Cassandra, you can write batch jobs which clean or correct or remove unwanted
or incorrect data and than the underlying stores usually have a concept of compaction which
not only defragments data files but also at this point removes from disk all the entries marked
as deleted.

  But there are considerations to be aware of given that compaction is a heavy process and
in some cases (e.g. Cassandra) there can be problems when there are too much data to be removed.
Not only that, in some cases, marked-to-be-deleted data, until it is deleted/compacted can
slow down normal operations of the data store as well.

  One can also leverage in HBase's case the versioning mechanism and the afore-mentioned batch
job can simply overwrite the same row key and the previous version would no longer be the
latest. If max-version parameter is configured as 1 then no previous version would be maintained
(physically it would be and would be removed at compaction time but would not be query-able.)

  In the end, basically cleansing can be done after or before loading but given the append-only
and no hard-delete design approaches of most nosql stores, I would say it would be easier
to do cleaning before data is loaded in the nosql store. Of course, it bears repeating that
it depends on the use case.

  Having said that, on a side-note and a bit off-topic, it reminds me of the Lamda Architecture
that combines batch and real-time computation for big data using various technologies and
it uses the idea of constant periodic refreshes to reload the data and within this periodic
refresh, the expectations are that any invalid older data would be corrected and overwritten
by the new refresh load. Those basically the 'batch part' of the LA takes care of data cleansing
by reloading everything. But LA is mostly for thouse systems which are ok with eventually
consistent behavior and might not be suitable for some systems.


  On Sun, Jul 20, 2014 at 2:36 PM, Adaryl "Bob" Wakefield, MBA <adaryl.wakefield@hotmail.com>

    In the old world, data cleaning used to be a large part of the data warehouse load. Now
that we’re working in a schemaless environment, I’m not sure where data cleansing is supposed
to take place. NoSQL sounds fun because theoretically you just drop everything in but transactional
systems that generate the data are still full of bugs and create junk data. 

    My question is, where does data cleaning/master data management/CDI belong in a modern
data architecture? Before it hit hits Hadoop? After?


It's just about how deep your longing is!

View raw message