Bill Olander

8 minutes read


In this quick tutorial, I share 3 methods to keep you and your data out of trouble.

Disclaimer : the fields of Data Security and Data Protection are vast. This tutorial hardly skims the surface. Check with your institution on the specific standards and tools which may be relevant to you.

Quick note on the tutorial

You should be able to follow and recreate all of the results by copying the syntax in the grey boxes.

Installing packages

To get started, if you don’t have them already, the following packages are necessary: charlatan,dpylr,safer and anonymizer. Note that you will need to install anonymizer from github as the package is not available on CRAN

## Getting all necessary package

using <- function(...) {
    libs <- unlist(list(...))
    req <- unlist(lapply(libs,require,character.only = TRUE))
    need <- libs[req == FALSE]
    if (length(need) > 0) { 
        install.packages(need)
        lapply(need,require,character.only = TRUE)
    }
}


using("charlatan","dpylr","safer")
## Loading required package: charlatan
## Loading required package: dpylr
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'dpylr'
## Loading required package: safer
## Installing package into '/home/edouard/R/x86_64-pc-linux-gnu-library/3.6'
## (as 'lib' is unspecified)
## Warning: package 'dpylr' is not available (for R version 3.6.1)
## Loading required package: dpylr
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'dpylr'
## [[1]]
## [1] FALSE
#devtools::install_github("paulhendricks/anonymizer")
library(anonymizer)

## Also removing files if exist
fn <- "fakedata.csv"
if (file.exists(fn))   file.remove(fn)
## [1] TRUE
fn <- "fakedata_encrypted.csv"
if (file.exists(fn))   file.remove(fn)
## [1] TRUE
fn <- "fakedata_decrypted.csv"
if (file.exists(fn))   file.remove(fn)
## [1] TRUE

Make a fake dataset

We can use the charlatan package to create a dataset with some fake sensitive data:

first, let’s load charlatan and let’s quickly make a fake dataset that has names, jobs and phone numbers for 30 people

library("charlatan")

fakedata <- ch_generate('name', 'job', 'phone_number', n = 30)

Then, let’s add 4 more fake variables: Food Consumption Groups (fcg), admin1name and GPS coordinates (lat & long)

fakedata$fcg <- rep(c("poor", "borderline", "acceptable"), 10)
fakedata$adm1name <- rep(c("North", "Mountain", "Isles", "Rock", "Stormlands",  "Dorne"), 5)
x <- fraudster()
fakedata$lat <- round(replicate(30, x$lat()),2)
fakedata$long <- round(replicate(30, x$lon()),2)

last, let’s take a look at the dataset we created

str(fakedata)
## Classes 'tbl_df', 'tbl' and 'data.frame':    30 obs. of  7 variables:
##  $ name        : chr  "Dr. Randy Pfannerstill V" "Jarrad Olson-Stracke" "Emogene Goodwin" "Bascom Koch" ...
##  $ job         : chr  "Environmental consultant" "Photographer" "English as a foreign language teacher" "Retail manager" ...
##  $ phone_number: chr  "238-806-6174x569" "1-916-904-8331x54314" "08030585426" "1-527-985-8556x476" ...
##  $ fcg         : chr  "poor" "borderline" "acceptable" "poor" ...
##  $ adm1name    : chr  "North" "Mountain" "Isles" "Rock" ...
##  $ lat         : num  33.3 47.1 -66.6 89.7 -76 ...
##  $ long        : num  -127.2 -149.3 154.6 93.8 -176.5 ...

Case #1 : Get rid of sensitive information before sharing

Maybe we only need to share the job , adm1name and fcg variables with someone else - these three variables are not “sensitive” so all we have to do is keep them or exclude the other variables in the dataset. Doing this is easy using select verb from dplyr

first, let’s load dplyr and let’s create the dataset we’d like to share, fakedata_external , from the dataset fakedata , selecting only the variables job , adm1name and fcg.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
fakedata_external <- fakedata %>% 
                     select(job, adm1name, fcg)

take a look - it only contains the 3 variables and is safe for sharing

str(fakedata_external)
## Classes 'tbl_df', 'tbl' and 'data.frame':    30 obs. of  3 variables:
##  $ job     : chr  "Environmental consultant" "Photographer" "English as a foreign language teacher" "Retail manager" ...
##  $ adm1name: chr  "North" "Mountain" "Isles" "Rock" ...
##  $ fcg     : chr  "poor" "borderline" "acceptable" "poor" ...

Alternatively, instead of specifying the variables you want to keep, like we did above, you can just specify the variables you want to get rid of.

Let’s create the dataset fakedata_external2 from the dataset fakedata , by de-selecting the variables name , phone_number , lat , long.

fakedata_external2 <- fakedata %>% 
                      select(-name, -phone_number, -lat, -long)

voila, we get the same results as above

str(fakedata_external2)
## Classes 'tbl_df', 'tbl' and 'data.frame':    30 obs. of  3 variables:
##  $ job     : chr  "Environmental consultant" "Photographer" "English as a foreign language teacher" "Retail manager" ...
##  $ fcg     : chr  "poor" "borderline" "acceptable" "poor" ...
##  $ adm1name: chr  "North" "Mountain" "Isles" "Rock" ...

Case #2 : Anonymize sensitive information for sharing

We might want to transform or anonymize sensitive information so it can be used but with less risk.

We can anonymize variables using the anonymizer (read more about it on the [anonymizer package documentation]https://github.com/paulhendricks/anonymizer) and the mutate verb from dplyr.

First, let’s load anonymizer and dplyr and let’s create the dataset fakedata_anonymized with anonymized values for the variables name , phone_number , lat , long using the algorithm crc32 (you can read more about this and other options in the anonymizer documentation)

library(anonymizer)
library(dplyr)

fakedata_anonymized <- mutate(fakedata, 
                              name = anonymize(name, .algo = "crc32"), 
                              phone_number = anonymize(phone_number, .algo = "crc32"), 
                              lat = anonymize(lat, .algo = "crc32"), 
                              long = anonymize(long, .algo = "crc32"))

Let’s take a look

str(fakedata_anonymized)
## Classes 'tbl_df', 'tbl' and 'data.frame':    30 obs. of  7 variables:
##  $ name        : chr  "d2e421d6" "c378128e" "e205f2fe" "01a00703" ...
##  $ job         : chr  "Environmental consultant" "Photographer" "English as a foreign language teacher" "Retail manager" ...
##  $ phone_number: chr  "99577e90" "6d98e308" "ae3fc32c" "cca13c83" ...
##  $ fcg         : chr  "poor" "borderline" "acceptable" "poor" ...
##  $ adm1name    : chr  "North" "Mountain" "Isles" "Rock" ...
##  $ lat         : chr  "00cfb3e3" "2860e8bf" "140a458c" "6ff47c9e" ...
##  $ long        : chr  "cea952c8" "eb28284f" "75278005" "8e89df28" ...

yep, all the variables with sensitive data have now been anonymized.

Case #3 : Encrypt a file containing sensitive information

Finally, sometimes we might need to share the whole dataset in its original condition. To do this, we’ll want to encrypt the dataset and we can use the package safer

First, let’s load safer and

library(safer)
write.csv(fakedata, "fakedata.csv")

Now, we will create the file fakedata_encrypted.csv by encrypting the file the fakedata . We created the password/key m@keupuR0wnp@ss

encrypt_file(infile = "fakedata.csv", key = "m@keupuR0wnp@ss", outfile = "fakedata_encrypted.csv")

importing and taking a quick look, fakedata_encrypted.csv looks unusable to those without the key

tried <- try(read.csv("fakedata_encrypted.csv"),
                 silent = TRUE)
## Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
## line 2 appears to contain embedded nulls
## Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
## line 3 appears to contain embedded nulls
## Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
## line 5 appears to contain embedded nulls
## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
## EOF within quoted string
## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
## embedded nul(s) found in input
head(tried, 2)
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             X
## 1 IP\xfc@\xc4\xefV\xb8\021e\xeb\x8b /\xc0\xe2Oa\001\xbd\x86\xf6\x96\xde廔<cã\xa27\x96Nn\a\xa8\xe5\x8aĕ9_\xdf\030\xc9\xe4\xf1O\xf66\fcķ\xca\xdḇ\\hT\027i\xb6+\xc7\025\xf7\0345\xec#\x82|\xca\xcfZݚ\xc0\x93\xd0_\xf9\177Z\xc1\030\177\xedx\x85\xf7n\xca\xe6P*60\xb2\x89)\xf6\xe9\xc1\x98\xd4\xc1\xbc\xb0\a6\037\x8c\x97i\xb4@QT1ٯi\xceu\xb7il\xedƷG\xf7F\xb8\xedo\xe5c\xec\022\xb8n\x8bm\x9bh\xf2\0037ls\a\x8e\003ܘ{\xe9\xc5\005\021\xc9s/\xd3W\xdb\xc5\025\xa9j\xd8H\xa4\x89\xce\xf0QI]7]T\x95k\bdz\x90vϞ\xb1SD
## 2                                                                                                                                                                                                                                                                                                                      \x94\xd9m\xb0V\xbf\xb3l\x90\x95\x9c\xa0\xc13s\aV37\016\xc83\x98k\xfd?\xb5<\xdb\035f\x83\xd8\017\x82:\xfbu%\xb7\025b.\x99TS\xf4\xb7\xa99d\xd8\xe2 \xde|\x98\x86@ɶi\x88\xa5A\t\xf2Ux\xc1

but if we share fakedata_encrypted.csv along with the key (it’s good to send the key to the recipient in a separate message, not in the same message/method that you share the encrypted file), your recipient can use the following code to decrypt the file

decrypt_file(infile = "fakedata_encrypted.csv", key = "m@keupuR0wnp@ss", outfile = "fakedata_decrypted.csv")

take a look, we’ve decrypted it and it’s now useable

fakedata_decrypted <- read.csv("fakedata_decrypted.csv")

str(fakedata_decrypted)
## 'data.frame':    30 obs. of  8 variables:
##  $ X           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ name        : Factor w/ 30 levels "Adrain Stehr-Williamson",..: 9 16 11 4 19 1 10 2 14 7 ...
##  $ job         : Factor w/ 28 levels "Acupuncturist",..: 12 18 11 23 2 9 3 26 23 10 ...
##  $ phone_number: Factor w/ 30 levels "(115)489-6918x03271",..: 22 16 11 13 27 1 24 15 14 29 ...
##  $ fcg         : Factor w/ 3 levels "acceptable","borderline",..: 3 2 1 3 2 1 3 2 1 3 ...
##  $ adm1name    : Factor w/ 6 levels "Dorne","Isles",..: 4 3 2 5 6 1 4 3 2 5 ...
##  $ lat         : num  33.3 47.1 -66.6 89.7 -76 ...
##  $ long        : num  -127.2 -149.3 154.6 93.8 -176.5 ...

The End

To wrap up, here are the 3 different scenarios which you might find yourself needing to transform and share sensitive data:

  • Case 1 - Get rid of only the sensitive data - use select verb to create a new data set which doesn’t contain the sensitive variables

  • Case 2 - Anonymize sensitive data: use package anonymizer and mutate verb to anonymize sensitive variables

  • Case 3 - Encrypt sensitive data: use package safer to create an encypted dataset. send your recipient the password seperately and they’ll be able to use safer to decrypt the file.

comments powered by Disqus