How Learning R Saves Me Many Hours at Work [And I’m not a Data Scientist]
You can learn more about using RStudio and Microsoft Excel in Udemy.
If you are an office worker (like me) who frequently works with massive data but does not have data science skills (yet!), you might find this story interesting.
Recently, due to organizational changes, I took over some reporting jobs from my colleagues. The reports are created to be consumed by the board of directors so that they have the most updated data and information for them to create business decisions.
So timeliness and accuracy are especially important for these reports.
However, in creating these reports, my colleagues have to process several excel and CSV files that consist of hundreds of thousands of data points. And, these data points need to be cross-referenced with each other to ensure data consistency.
These reports need to be produced on weekly basis and usually, it takes up to 3 days for my colleagues to create these reports since they only use Microsoft Excel to processing all the data sources.
My task is to learn the process and produce the reports to keep up with the weekly deadline, and still performing my existing jobs. And I only got 2 days to accomplish those.
I knew if I rely only on Microsoft Excel to process and create the reports, I will not make it to the deadline, not to mention the errors that I might make.
Enter the R Studio
Although not a data scientist, I have a strong interest in data science and machine learning.
So, out of curiosity, several months before the new assignment, I taught myself how to perform data processing using R and Python. I even took a class on machine learning and secretly hoping that I could implement that new knowledge at work (though I know it won’t happen).
Nevertheless, for me, the learning is an effort well taken. At least I can use data science skills for personal projects. Until I was presented with this reporting assignment.
My Reporting Workflow
Realizing that I can use R Studio to simplify the reporting process, I quickly came up with an R workflow that can ingest the files that are extracted from company systems and create an output that can be distributed to the whole company.
If you’re a data scientist, you will find that this flow is an obvious one, but that is not the case for ordinary office workers who do not have data science skills but have to work with massive data, sometimes unstructured and on regular basis.
Importing Data Source to R Studio
XLSX and CSV files are the most common structured data format that are distributed internally in a company. So, my very first step is to import the required data sources to the R Studio so they can be processed further.
To import the Excel files, I use read_excel
function from the readxl
library.
acct_df <- read_excel("file_cif.xlsx")
This will import an excel file named “file_cif.xlsx” into R Studio and convert it into a data frame.
Although the read_excel
function automatically determines the types of each column, sometimes you need to tell R what is the type of a column. This is especially true if you have datetime values in your excel file.
col_types <-c(rep("guess",5),"text",rep("guess",2),"text",
"guess","guess")acct_df <- read_excel("file_cif.xlsx",col_types = col_types)
To import the CSV file, I use read.csv
from the base library.
card_df <- read.csv("fname_card.csv", sep = ";")
This will import a file called “fname_card.csv” into R Studio and convert it into a data frame.
Most of the time, the function can automatically detect the character used to separate the values, but if necessary, you can use the argument sep
to manually enter the separator character.
Both the functions above will import and convert the input files into data frames. These data frames can easily be manipulated, cleansed and processed by R Studio.
Processing and Cleansing the Data
This is the part where, in my opinion, R works very well. I prefer to work with R than with Python for data wrangling and cleansing. But for machine learning applications, Python is my preferred tool.
There are some patterns in cleansing and wrangling data which is extracted from a company’s internal systems. I found that these types of data are what I frequently need to cleanse.
Converting numeric data from scientific notation
Sometimes after a file that contains a number column is imported, the numbers are converted into scientific notation e.g. 1.2345+e8. To convert it into a regular number format, we can use the format
function.
string <- "6.123456e+12"phone_number = format(string,scientific = F)
Removing commas or dots from the numbers
Depending on which part of the world that you’re from, numbers can be written using commas or dots as thousands separator. R cannot automatically interpret those kinds of numbers as numeric data. So we need to remove those characters. We can use simple regular expression in sub
or gsub
functions to remove those characters.
string <- "USD1,000,000.00price <- sub("USD","",string)price <- gsub(",","",price)price <- sub(".00","",price)
You will probably realize that these functions can also remove other unwanted characters from the numbers, for example, if the numbers are preceded by a currency symbol, then we can also sub
or gsub
to remove the symbol. As matter a fact, these functions can be used to remove or replace any unwanted character from the character data.
Converting numbers into numeric data
Sometimes when a file is imported, its numeric data is read by R as characters or string data. To convert it into numeric data, we can use the as.numeric
function.
string <- "100000"price <- as.numeric(string)
Converting datetime format
From my experience, datetime is the trickiest data type to process in R. There are several R libraries that are dedicated to process and cleanse datetime data. However, the frequent processing for datetime data is to convert datetime from character type into date type and then fix the inconsistency in the date format
string <- "12 01 2010"datetime <- as.Date(string,"%d %m %Y")
The script above convert the string into a datetime data type and convert the format into the default R date format i.e. yyyy-mm-dd.
Remove whitespace from character data
For various reasons, data that is imported to R may contain unwanted whitespace, especially at the beginning and end of character. To remove the whitespace, we can use the function trimws
from the base library.
string <- " this is a sentence. "clean_string <- trimws(string, which = "both")
Extract a substring from character data
Often we just need to take a portion of character data to be processed further. For example, you want to extract the month name for a character data “15Jan”. For this purpose, we can use the function substring
. The numbers indicate the start and end position of the characters.
string <- "15Jan"month_text <- substring(string,3,5)
Data Manipulation and Visualization
Now that all the necessary data are clean and consistent, we can proceed with the data manipulation. This is the step where actual works for creating the reports are happening. For this purpose, the dplyr
library is very useful. With dplyr
we can filter the data, select the data that we want to keep, and even merge one data frame with another. Let’s look at a script example below.
user_df <- read.csv("fname_user.csv",sep = ";")user_df <- user_df %>%
select(Cardnum,phone_number,registration_date) %>%
filter(registration_date > "2019-12-31")
In the script above, we use read_csv
to import a CSV file called “fname_user.csv” into R Studio and then convert it to data frame.
Then we use the function select
from dplyr
library to select the columns that we want to keep (and remove the rest). This will improve the processing time.
And last we use filter
to keep only the datapoints which “registration_date” is after Dec 31, 2019.
The actual script used for processing the data and creating the report is more complicated than this example. But this should give you some overview about what the script looks like.
The result of the data manipulation is a data frame in which format and structure have fulfilled the reporting requirements.
If it’s required that we provide some visualization of the data i.e. presenting the data in charts, then my favorite library is plotly
.
fig <- group_user %>% plot_ly(x = ~user_reg_datetime,
y = ~total,type = "bar") %>%
layout(xaxis = list(title = ""),
yaxis = list(title = "Number of Users",
range = c(0,9000)))
This script visualizes the data frame “group_user” into a barchart like shown in the picture below.
Creating The Output for Distribution
The reports are now done, and with the format and structure that is required by the management. So the next step is to take out the report from R to the format that can easily be distributed and consumed by other people in the office environment. There are several options to do this :
- Save the data frame as an xlsx file or CSV file. Using the function
writexl
from thewritexl
library, we can convert and save the data frame into an excel file that can be opened by other people that is not familiar with R. Similarly, usingwrite.csv
function, we can save the data frame into a CSV file. - If you visualize the data frame using
plotly
, you can save the chart into a PNG file by clicking the camera icon at the top of the chart. Then you can insert the image into any PowerPoint or Word documents. - There are also other options like using R Markdown documents or creating a dashboard using a library like Shiny Dashboard. But these options are rarely implemented in an organization that does not have data science function in it.
Using R and data science knowledge, I am able to produce the same reports faster, with consistent results, and are scalable. By scalable, it means, if in the future, this kind of report requires more data sources to be processed, then I just need to slightly modify the scripts.
At the beginning of the assignment, I need a few hours to understand the data flow, data structure, and report format, on top of the information like where to get the data and who I should contact to provide the data. Then I spent 3–4 hours to write the script in R, debug it and review the report prototype.
But after all these works are done, I only need a few minutes to run the script on weekly basis and spend some more minutes per week to maintain the code. By my estimation, it now takes less than 1 hour per week to run the script, maintain the codes, and then distribute the results.
In the end, I can meet the deadline, shorten the processing time from over 2 days to less than 1 hour, and still have some spare time for coffee breaks. All because I know R.