In this quick tutorial, I share 3 methods to keep you and your data out of trouble.
Disclaimer : the fields of Data Security and Data Protection are vast. This tutorial hardly skims the surface. Check with your institution on the specific standards and tools which may be relevant to you.
Quick note on the tutorial
You should be able to follow and recreate all of the results by copying the syntax in the grey boxes.
Installing packages
To get started, if you don’t have them already, the following packages are necessary: charlatan
,dpylr
,safer
and anonymizer
. Note that you will need to install anonymizer
from github as the package is not available on CRAN
## Getting all necessary package
using <- function(...) {
libs <- unlist(list(...))
req <- unlist(lapply(libs,require,character.only = TRUE))
need <- libs[req == FALSE]
if (length(need) > 0) {
install.packages(need)
lapply(need,require,character.only = TRUE)
}
}
using("charlatan","dpylr","safer")
## Loading required package: charlatan
## Loading required package: dpylr
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'dpylr'
## Loading required package: safer
## Installing package into '/home/edouard/R/x86_64-pc-linux-gnu-library/3.6'
## (as 'lib' is unspecified)
## Warning: package 'dpylr' is not available (for R version 3.6.1)
## Loading required package: dpylr
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'dpylr'
## [[1]]
## [1] FALSE
#devtools::install_github("paulhendricks/anonymizer")
library(anonymizer)
## Also removing files if exist
fn <- "fakedata.csv"
if (file.exists(fn)) file.remove(fn)
## [1] TRUE
fn <- "fakedata_encrypted.csv"
if (file.exists(fn)) file.remove(fn)
## [1] TRUE
fn <- "fakedata_decrypted.csv"
if (file.exists(fn)) file.remove(fn)
## [1] TRUE
Make a fake dataset
We can use the charlatan package to create a dataset with some fake sensitive data:
first, let’s load charlatan and let’s quickly make a fake dataset that has names, jobs and phone numbers for 30 people
library("charlatan")
fakedata <- ch_generate('name', 'job', 'phone_number', n = 30)
Then, let’s add 4 more fake variables: Food Consumption Groups (fcg), admin1name and GPS coordinates (lat & long)
fakedata$fcg <- rep(c("poor", "borderline", "acceptable"), 10)
fakedata$adm1name <- rep(c("North", "Mountain", "Isles", "Rock", "Stormlands", "Dorne"), 5)
x <- fraudster()
fakedata$lat <- round(replicate(30, x$lat()),2)
fakedata$long <- round(replicate(30, x$lon()),2)
last, let’s take a look at the dataset we created
str(fakedata)
## Classes 'tbl_df', 'tbl' and 'data.frame': 30 obs. of 7 variables:
## $ name : chr "Dr. Randy Pfannerstill V" "Jarrad Olson-Stracke" "Emogene Goodwin" "Bascom Koch" ...
## $ job : chr "Environmental consultant" "Photographer" "English as a foreign language teacher" "Retail manager" ...
## $ phone_number: chr "238-806-6174x569" "1-916-904-8331x54314" "08030585426" "1-527-985-8556x476" ...
## $ fcg : chr "poor" "borderline" "acceptable" "poor" ...
## $ adm1name : chr "North" "Mountain" "Isles" "Rock" ...
## $ lat : num 33.3 47.1 -66.6 89.7 -76 ...
## $ long : num -127.2 -149.3 154.6 93.8 -176.5 ...
Case #1 : Get rid of sensitive information before sharing
Maybe we only need to share the job , adm1name and fcg variables with someone else - these three variables are not “sensitive” so all we have to do is keep them or exclude the other variables in the dataset. Doing this is easy using select verb from dplyr
first, let’s load dplyr
and let’s create the dataset we’d like to share, fakedata_external
, from the dataset fakedata
, selecting only the variables job
, adm1name
and fcg
.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
fakedata_external <- fakedata %>%
select(job, adm1name, fcg)
take a look - it only contains the 3 variables and is safe for sharing
str(fakedata_external)
## Classes 'tbl_df', 'tbl' and 'data.frame': 30 obs. of 3 variables:
## $ job : chr "Environmental consultant" "Photographer" "English as a foreign language teacher" "Retail manager" ...
## $ adm1name: chr "North" "Mountain" "Isles" "Rock" ...
## $ fcg : chr "poor" "borderline" "acceptable" "poor" ...
Alternatively, instead of specifying the variables you want to keep, like we did above, you can just specify the variables you want to get rid of.
Let’s create the dataset fakedata_external2 from the dataset fakedata , by de-selecting the variables name
, phone_number
, lat
, long
.
fakedata_external2 <- fakedata %>%
select(-name, -phone_number, -lat, -long)
voila, we get the same results as above
str(fakedata_external2)
## Classes 'tbl_df', 'tbl' and 'data.frame': 30 obs. of 3 variables:
## $ job : chr "Environmental consultant" "Photographer" "English as a foreign language teacher" "Retail manager" ...
## $ fcg : chr "poor" "borderline" "acceptable" "poor" ...
## $ adm1name: chr "North" "Mountain" "Isles" "Rock" ...
Case #2 : Anonymize sensitive information for sharing
We might want to transform or anonymize sensitive information so it can be used but with less risk.
We can anonymize variables using the anonymizer (read more about it on the [anonymizer package documentation]https://github.com/paulhendricks/anonymizer) and the mutate verb from dplyr
.
First, let’s load anonymizer
and dplyr
and let’s create the dataset fakedata_anonymized with anonymized values for the variables name
, phone_number
, lat
, long
using the algorithm crc32
(you can read more about this and other options in the anonymizer documentation)
library(anonymizer)
library(dplyr)
fakedata_anonymized <- mutate(fakedata,
name = anonymize(name, .algo = "crc32"),
phone_number = anonymize(phone_number, .algo = "crc32"),
lat = anonymize(lat, .algo = "crc32"),
long = anonymize(long, .algo = "crc32"))
Let’s take a look
str(fakedata_anonymized)
## Classes 'tbl_df', 'tbl' and 'data.frame': 30 obs. of 7 variables:
## $ name : chr "d2e421d6" "c378128e" "e205f2fe" "01a00703" ...
## $ job : chr "Environmental consultant" "Photographer" "English as a foreign language teacher" "Retail manager" ...
## $ phone_number: chr "99577e90" "6d98e308" "ae3fc32c" "cca13c83" ...
## $ fcg : chr "poor" "borderline" "acceptable" "poor" ...
## $ adm1name : chr "North" "Mountain" "Isles" "Rock" ...
## $ lat : chr "00cfb3e3" "2860e8bf" "140a458c" "6ff47c9e" ...
## $ long : chr "cea952c8" "eb28284f" "75278005" "8e89df28" ...
yep, all the variables with sensitive data have now been anonymized.
Case #3 : Encrypt a file containing sensitive information
Finally, sometimes we might need to share the whole dataset in its original condition. To do this, we’ll want to encrypt the dataset and we can use the package safer
First, let’s load safer and
library(safer)
write.csv(fakedata, "fakedata.csv")
Now, we will create the file fakedata_encrypted.csv by encrypting the file the fakedata . We created the password/key m@keupuR0wnp@ss
encrypt_file(infile = "fakedata.csv", key = "m@keupuR0wnp@ss", outfile = "fakedata_encrypted.csv")
importing and taking a quick look, fakedata_encrypted.csv looks unusable to those without the key
tried <- try(read.csv("fakedata_encrypted.csv"),
silent = TRUE)
## Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
## line 2 appears to contain embedded nulls
## Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
## line 3 appears to contain embedded nulls
## Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
## line 5 appears to contain embedded nulls
## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
## EOF within quoted string
## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
## embedded nul(s) found in input
head(tried, 2)
## X
## 1 IP\xfc@\xc4\xefV\xb8\021e\xeb\x8b /\xc0\xe2Oa\001\xbd\x86\xf6\x96\xde廔<cã\xa27\x96Nn\a\xa8\xe5\x8aĕ9_\xdf\030\xc9\xe4\xf1O\xf66\fcķ\xca\xdḇ\\hT\027i\xb6+\xc7\025\xf7\0345\xec#\x82|\xca\xcfZݚ\xc0\x93\xd0_\xf9\177Z\xc1\030\177\xedx\x85\xf7n\xca\xe6P*60\xb2\x89)\xf6\xe9\xc1\x98\xd4\xc1\xbc\xb0\a6\037\x8c\x97i\xb4@QT1ٯi\xceu\xb7il\xedƷG\xf7F\xb8\xedo\xe5c\xec\022\xb8n\x8bm\x9bh\xf2\0037ls\a\x8e\003ܘ{\xe9\xc5\005\021\xc9s/\xd3W\xdb\xc5\025\xa9j\xd8H\xa4\x89\xce\xf0QI]7]T\x95k\bdz\x90vϞ\xb1SD
## 2 \x94\xd9m\xb0V\xbf\xb3l\x90\x95\x9c\xa0\xc13s\aV37\016\xc83\x98k\xfd?\xb5<\xdb\035f\x83\xd8\017\x82:\xfbu%\xb7\025b.\x99TS\xf4\xb7\xa99d\xd8\xe2 \xde|\x98\x86@ɶi\x88\xa5A\t\xf2Ux\xc1
but if we share fakedata_encrypted.csv along with the key (it’s good to send the key to the recipient in a separate message, not in the same message/method that you share the encrypted file), your recipient can use the following code to decrypt the file
decrypt_file(infile = "fakedata_encrypted.csv", key = "m@keupuR0wnp@ss", outfile = "fakedata_decrypted.csv")
take a look, we’ve decrypted it and it’s now useable
fakedata_decrypted <- read.csv("fakedata_decrypted.csv")
str(fakedata_decrypted)
## 'data.frame': 30 obs. of 8 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ name : Factor w/ 30 levels "Adrain Stehr-Williamson",..: 9 16 11 4 19 1 10 2 14 7 ...
## $ job : Factor w/ 28 levels "Acupuncturist",..: 12 18 11 23 2 9 3 26 23 10 ...
## $ phone_number: Factor w/ 30 levels "(115)489-6918x03271",..: 22 16 11 13 27 1 24 15 14 29 ...
## $ fcg : Factor w/ 3 levels "acceptable","borderline",..: 3 2 1 3 2 1 3 2 1 3 ...
## $ adm1name : Factor w/ 6 levels "Dorne","Isles",..: 4 3 2 5 6 1 4 3 2 5 ...
## $ lat : num 33.3 47.1 -66.6 89.7 -76 ...
## $ long : num -127.2 -149.3 154.6 93.8 -176.5 ...
The End
To wrap up, here are the 3 different scenarios which you might find yourself needing to transform and share sensitive data:
Case 1 - Get rid of only the sensitive data - use select verb to create a new data set which doesn’t contain the sensitive variables
Case 2 - Anonymize sensitive data: use package anonymizer and mutate verb to anonymize sensitive variables
Case 3 - Encrypt sensitive data: use package safer to create an encypted dataset. send your recipient the password seperately and they’ll be able to use safer to decrypt the file.
Share this post
Twitter
Google+
Facebook
Reddit
LinkedIn
StumbleUpon
Pinterest
Email