hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peyman Mohajerian <mohaj...@gmail.com>
Subject Re: Data cleansing in modern data architecture
Date Sun, 24 Aug 2014 22:13:18 GMT
If you data is in different partitions in HDFS, you can simply use tools
like Hive or Pig to read the data in a give partition, filter out the bad
data and overwrite the partition. This data cleansing is common practice,
I'm not sure why there is such a back and forth on this topic.  Of course
HBase approach works too, but I think that would make sense if you have a
large number of bad record frequently, otherwise running a weekly or
nightly scan over you data and reading and writing them back, typically
map/reduce, is what is the conventional way to do it in HDFS.

On Mon, Aug 18, 2014 at 3:06 PM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   Exception files would only work in the case where a known error is
> thrown. The specific case I was trying to find a solution for is when data
> is the result of bugs in the transactional system or some other system that
> generates data based on human interaction. Here is an example:
> Customer Service Reps record interactions with clients through a web
> application.
> There is a bug in the web application such that invoices get double
> entered.
> This double entering goes on for days until it’s discovered by someone in
> accounting.
> We now have to go in an remove those double entries because it’s messing
> up every SUM() function result.
> In the old world, it was simply a matter of going in the warehouse and
> blowing away those records. I think the solution we came up with is instead
> of dropping that data into a file, drop it into HBASE where you can do row
> level deletes.
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>  *From:* Jens Scheidtmann <jens.scheidtmann@gmail.com>
> *Sent:* Monday, August 18, 2014 12:53 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Data cleansing in modern data architecture
>     Hi Bob,
> the answer to your original question depends entirely on the procedures
> and conventions set forth for your data warehouse. So only you can answer
> it.
> If you're asking for best practices, it still depends:
> - How large are your files?
> - Have you enough free space for recoding?
> - Are you better off writing an "exception" file?
> - How do you make sure it is always respected?
> - etc.
> Best regards,
> Jens

View raw message