Data Science with ‘Strenge Jacke!’: R Packages and data wrangling issues

Today’s post shares another interesting Blog about Data Science. This time the new valued discovery comes from Germany (or by a German-speaking fellow…) and is called Strenge Jacke! I must say I don’t know almost nothing of the German language, but fortunately this blog is presented with a good english overall.

This is blog that deals mostly with Data Science topics, and with the technicalities of implementing analysis algorithms in R. The readership is therefore for an intermediate to advanced level. My level in R at the moment is between basic to intermediate and I am doing some diligences to step up to the next level. This is a best done by hands-on effort, with real data sets and going through all the grit.

Notwithstanding my current steep learning curve,  reading Blogs such as this one also is helpful. The main reason for this is the R scripts it displays. I am currently acknowledging that reading code while you learn the nitty-grit is further illuminating. With this in mind the best way to get to a proficient level in data analysis with R is to know as much as possible about its numerous packages – increasing on a daily basis -, install them and play with what is possible with it in real data sets.

The Blog chosen today is one such nice example to share with the followers of The Information Age. The post I’ve chosen is from November last year and is another one of the many posts from this Blog where the  R  scripts from the implementations of the packages relevant to the task at hand illuminates the Data Scientist or Engineer, especially if beginning their careers, or even for the more seasoned one.

The post is important also because informs us all what the trend is about current R packages. And the most important trends are revolving around data quality (tidy data tidyverse package), design of data packages and the increasingly critical topic of API design. In environments of high transactional/transformation rates of data, interface design becomes of paramount importance. To this end the author of this Blog post gives us a sketch overview of several packages in R relevant for those tasks, with a nice graphical additions worth to mention and to view.

In this post I decided to skip the R scripts to hasten the reader to check by himself, if so interested referring to the original post.

Pipe-friendly workflow with sjPlot, sjmisc and sjstats, part 1 #rstats #tidyverse

 

Recent development in R packages are increasingly focussing on the philosophy of tidy data and a common package design and api. Tidy data is an important part of data exploration and analysis, as shown in the following figure:

(…)

Tidying data not only includes data cleaning, but also data transformation, both being necessary to perform the core steps of data analysis and visualization. This is a complex process, which involves many steps. You need many packages and functions to perfom those tasks. This is where a common package design and api comes into play: „A powerful strategy for solving complex problems is to combine many simple pieces“, says the tidyverse manifesto. For a coding workflow, this means:

  • compose single functions with the pipe
  • design your API so that it is easy to use by humans

The latter bullet point is helpful to achieve the first bullet point.

The sj-packages, namely sjPlot (data visualization), sjmisc (labelled data and data transformation) and sjstats (common statistical functions, also for regression modelling), are revised accordingly, to follow this design philosophy (as far as possible and feasible).

This last paragraph is significant. It presents three different R packages, one for data visualization, one for data transformation and labelled data and finally on for common statistical function and regression modelling and computing. They all spring from the prefix sj from sj-packages. The %>% pipe operator is explained as follows:

PIPE-FRIENDLY FUNCTIONS AND TIDY DATA

 

The „pipe-operator“ (%>%) was introduced by the magrittr-package and aims at a semantical change of your code, making reading and writing code more intuive. The %>% simply takes the left-hand side of an input and puts its result as first argument to the right-hand side.

This kind of semantic code change might be helpful in many occasions for the developer or data engineer. For instance to recover data as a first argument after a tidy data process:

When doing data „tidying“ and transformation, the result of a left-hand side function is usually a data frame (for instance, when working with packages like dplyr or tidyr). Hence, the first argument of a function following the tidyverse-philosophy should always be the data.

The sjPlot Package

The sjPlot package presented in the post (I just wonder if sj stands for Strange Jacke, referring to the package builder being exactly the Blogger – that might true indeed… the full documentation provided hints at this) is for data visualization:

USING SJPLOT IN A PIPE-WORKFLOW

The sjPlot-package for data visualization (see comprehensive documentation here) already included some functions that required the data as first argument (e.g. sjp.likert() or sjp.pca()). Other commonly used function, however, do not follow this design: sjp.frq() requires a vector, and sjp.xtab() requires two vectors, but no data frame, impossible to seamlessly integrate in a pipe-workflow.

rplot

THE SJPLOT()-FUNCTION

On the other hand, to quickly create figures for data exploration, it is often more feasible to just pass a vector to a function, instead of a prepared data frame. For this reason, I decided not to revise these functions and change their argument-structure. Instead, the latest sjPlot-update got a new wrapper function, sjplot(), which allows an easy integration of sjPlot-functions into a pipe-workflow. Since sjplot() is generic, you have to specify the plot-function via the fun-argument.

rplot01

Functions that require multiple vectors as input – like sjp.scatter() or sjp.grpfrq() – work in the same manner:

(…)

Final remarks

 

The post goes on with further examples, like the plot_grid function arranging multiple plot into a single grid-layout plot and its possibilities and disadvantages. But I live it for now and recommend the reader to follow through the scripts, click the packages’ links and further your data science/ engineering practice.

The author also had some final words to dispense us with:

FINAL WORDS

These were some examples of how to use the sjPlot-package in a pipe-workflow. Feel free to add your comments, further suggestions, either here or (preferably) at GitHub.

featured image: Pipe-friendly workflow with sjPlot, sjmisc and sjstats, part 1 #rstats #tidyverse

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s