hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sriram Ramachandrasekaran <sri.ram...@gmail.com>
Subject Re: Data cleansing in modern data architecture
Date Sun, 10 Aug 2014 04:55:21 GMT
While, I may not have enough context to your entire processing pipeline,
here are my thoughts.
1. It's always useful to have raw data, irrespective of if it was right or
wrong. The way to look at it is, it's the source of truth at timestamp t.
2. Note that, You only know that the data at timestamp t for an id X was
wrong because, subsequent info about X seem to conflict with the one at t
or some manual debugging finds it out.

All systems that does reporting/analytics is better off by not meddling
with the raw data. There should be processed or computed views of this
data, that massages it, gets rids of noisy data, merges duplicate entries,
etc and then finally produces an output that's suitable for your
reports/analytics. So, your idea to write transaction logs to HDFS is
fine(unless, you are twisting your systems to get it that way), but, you
just need to introduce one more layer of indirection, which has the
business logic to handle noise/errors like this.

For your specific case, you could've a transaction processor up job which
produces a view, that takes care of squashing transactions based on
id(something that makes sense in your system) and then handles the business
logic of how to handle the bugs/discrepancies in them. Your views could be
loaded into a nice columnar store for faster query retrieval(if you have
pointed queries - based on a key), else, a different store would be needed.
Yes, this has the overhead of running the view creation job, but, I think,
the ability to go back to raw data and investigate what happened there is
worth it.

Your approach of structuring it and storing it in HBase is also fine as
long as you keep the concerns separate(if your write/read workloads are
poles apart).

Hope this helps.





On Sun, Aug 10, 2014 at 9:36 AM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   Or...as an alternative, since HBASE uses HDFS to store it’s data, can
> we get around the no editing file rule by dropping structured data into
> HBASE? That way, we have data in HDFS that can be deleted. Any real problem
> with that idea?
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>
>  *From:* Adaryl "Bob" Wakefield, MBA <adaryl.wakefield@hotmail.com>
> *Sent:* Saturday, August 09, 2014 8:55 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Data cleansing in modern data architecture
>
>   Answer: No we can’t get rid of bad records. We have to go back and
> rebuild the entire file. We can’t edit records but we can get rid of entire
> files right? This would suggest that appending data to files isn’t that
> great of an idea. It sounds like it would be more appropriate to cut a
> hadoop data load up into periodic files (days, months, etc.) that can
> easily be rebuilt should errors occur....
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>
>  *From:* Adaryl "Bob" Wakefield, MBA <adaryl.wakefield@hotmail.com>
> *Sent:* Saturday, August 09, 2014 4:01 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Data cleansing in modern data architecture
>
>   I’m sorry but I have to revisit this again. Going through the reply
> below I realized that I didn’t quite get my question answered. Let me be
> more explicit with the scenario.
>
> There is a bug in the transactional system.
> The data gets written to HDFS where it winds up in Hive.
> Somebody notices that their report is off/the numbers don’t look right.
> We investigate and find the bug in the transactional system.
>
> Question: Can we then go back into HDFS and rid ourselves of the bad
> records? If not, what is the recommended course of action?
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
>
>  *From:* Shahab Yunus <shahab.yunus@gmail.com>
> *Sent:* Sunday, July 20, 2014 4:20 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Data cleansing in modern data architecture
>
>  I am assuming you meant the batch jobs that are/were used in old world
> for data cleansing.
>
> As far as I understand there is no hard and fast rule for it and it
> depends functional and system requirements of the usecase.
>
> It is also dependent on the technology being used and how it manages
> 'deletion'.
>
> E.g. in HBase or Cassandra, you can write batch jobs which clean or
> correct or remove unwanted or incorrect data and than the underlying stores
> usually have a concept of compaction which not only defragments data files
> but also at this point removes from disk all the entries marked as deleted.
>
> But there are considerations to be aware of given that compaction is a
> heavy process and in some cases (e.g. Cassandra) there can be problems when
> there are too much data to be removed. Not only that, in some cases,
> marked-to-be-deleted data, until it is deleted/compacted can slow down
> normal operations of the data store as well.
>
> One can also leverage in HBase's case the versioning mechanism and the
> afore-mentioned batch job can simply overwrite the same row key and the
> previous version would no longer be the latest. If max-version parameter is
> configured as 1 then no previous version would be maintained (physically it
> would be and would be removed at compaction time but would not be
> query-able.)
>
> In the end, basically cleansing can be done after or before loading but
> given the append-only and no hard-delete design approaches of most nosql
> stores, I would say it would be easier to do cleaning before data is loaded
> in the nosql store. Of course, it bears repeating that it depends on the
> use case.
>
> Having said that, on a side-note and a bit off-topic, it reminds me of the
> Lamda Architecture that combines batch and real-time computation for big
> data using various technologies and it uses the idea of constant periodic
> refreshes to reload the data and within this periodic refresh, the
> expectations are that any invalid older data would be corrected and
> overwritten by the new refresh load. Those basically the 'batch part' of
> the LA takes care of data cleansing by reloading everything. But LA is
> mostly for thouse systems which are ok with eventually consistent behavior
> and might not be suitable for some systems.
>
> Regards,
> Shahab
>
>
> On Sun, Jul 20, 2014 at 2:36 PM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>
>>   In the old world, data cleaning used to be a large part of the data
>> warehouse load. Now that we’re working in a schemaless environment, I’m not
>> sure where data cleansing is supposed to take place. NoSQL sounds fun
>> because theoretically you just drop everything in but transactional systems
>> that generate the data are still full of bugs and create junk data.
>>
>> My question is, where does data cleaning/master data management/CDI
>> belong in a modern data architecture? Before it hit hits Hadoop? After?
>>
>> B.
>>
>
>



-- 
It's just about how deep your longing is!

Mime
View raw message