spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: Masking username in Spark with regexp_replace and reverse functions
Date Sun, 17 Mar 2019 09:50:43 GMT
For the approach below you have to check for collisions, ie different name lead to same masked
value.

You could hash it. However in order to avoid that one can just try different hashes you need
to include in each name a different random factor. 

However, the anonymization problem is bigger, because based on other fields and correlation,
the individual might still be identifiable.

> Am 16.03.2019 um 18:39 schrieb Mich Talebzadeh <mich.talebzadeh@gmail.com>:
> 
> Hi,
> 
> I am looking at Description column of a bank statement (CSV download) that has the following
format
> 
> scala> account_table.printSchema
> root
>  |-- TransactionDate: date (nullable = true)
>  |-- TransactionType: string (nullable = true)
>  |-- Description: string (nullable = true)
>  |-- Value: double (nullable = true)
>  |-- Balance: double (nullable = true)
>  |-- AccountName: string (nullable = true)
>  |-- AccountNumber: string (nullable = true)
> 
> The column description for BACS payments contains the name of the individual who paid
into the third party account. I need to mask the name but cannot simply use a literal as below
for all contents of descriptions column!
> 
> f1.withColumn("Description", lit("*** Masked ***")).select('Description.as("Who paid")
> 
> So I try the following combination
> 
> f1.select(trim(substring(substring_index('Description, ",", 1),2,50)).as("name in clear"),
> reverse(regexp_replace(regexp_replace(regexp_replace(substring(regexp_replace('Description,
"^['A-Z]", "XX"),2,6),"[A-F]","X")," ","X"),"[,]","R")).as("Masked")).show
> +------------------+------+
> |          in clear|Masked|
> +------------------+------+
> |       FATAH SABAH|HXTXXX|
> |       C HIGGINSON|GIHXXX|
> |           SOLTA A|XTLOSX|
> +------------------+------+
> 
> This seems to work as it not only masks the name but also makes it consistent for all
names (in other words, the same username gets the same mask).
> 
> Are there any better alternatives?
> 
> Thanks
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage
or destruction of data or any other property which may arise from relying on this email's
technical content is explicitly disclaimed. The author will in no case be liable for any monetary
damages arising from such loss, damage or destruction.
>  

Mime
View raw message