Preface

Welcome! If you are to perform any kind of data (statistical) analysis and use tools in the data science ecosystem, you may have seen a variety of tutorials, handbooks online (I’m not mentioning traditional bibles of statistics). Unfortunately most of them is written intentionally for specific purposes and/or target group of people and does not include many aspects, data types, difficult aspects, usually hiding them trying to solve theoretical problems of the matter. In this handbook I am not going to omit any of them. What is more, those, which are more problematic for data analyst will be underlined, explained in details, solved with the use of the newest libraries available at the time. Users who believe Python is the best tool for statistical analyses, probably might see some R advantages in the end of the study. We will start with preparing infrastructure for the analysis. Then, the quick tour through the R programming basics. Finally, two most important chapters of this book will walk you through the descriptive and inferential statistical data analysis. Online version is interactive: it means in many places reader will have the chance to go through tutorials, answer some questions, solve some tasks to go to the next page.

If you find any missing problems not solved, data types missing, mistakes, give me a shout-out on Twitter using #Karol_Flisik. If you have ideas for ways to improve this book, please consider contributing to the project.

Why `Statistics with R`?

Data analysts from a range of different fields use R and RStudio in their “workshop”. But the definition of the “workshop” is constantly changing and depends on the needs and environment.

Nowadays workshops can include only a technical part (i.e. data-wrangling, cleansing etc.) part of the analysis. Some of the analysts need only the quick descriptive reports. And on the other hand, some need the full, detailed statistical review of database they have.

Philosophy

Karl Pearson, a British mathematician and the father of modern Statistics, is credited with the quote:

“Statistics is the grammar of Science”

Any time there is a data-oriented problem, there must be a best, flexible and sometimes already prepared (libraries) solution for it. Usually we end with statistics. Unfortunately not everyone needs the most sophisticated reports and latest technical tools like R-based codes and reports. But there should be always some question asked: is it the best and proper solution at all? Is it the best grammar used in the story?

This is the reason why in this handbook all possible technical aspects of the analysis are be covered [at least I tried!]. It might remind you a typical cookbook. Many times we will go through the analysis starting in considering theoretical background, choosing the best technical tools (i.e. libraries in R), ending with the most important questions to answer - which statistics, estimators, tests we should apply in this particular case?

If anyone is interested in other philosophy of thinking, solving statistical analysis, and could improve this project, please contribute!

One example: Jennifer Bryan and Jim Hester offer a very interesting tutorial called “What They Forgot to Teach You About R”. They try to focus only on the typically omitted issues in typical data analysis like organizing projects, maintaining R libraries, debugging code, reproducing the problems.

What is in this handbook?

Through the book, you will see a variety of (un)solved examples (some of them I will leave for you). Please do remember that probably there are many other ways (maybe longer, maybe more complex) leading to the final (and correct!) solution as well.

R-Studio makes it easier. It is just the more convenient environment to work in. When you start a new problem, it’s best to delete all the variables to ensure you don’t accidentally use old data, just type rm(list=ls()). Note that you will need to reload your data if you need it. When you close R, you do not need to save the workspace if it asks. Saving the workspace just saves the variables you have defined.

Watch out for “gotchas” along the way. I tried my best to point out the tricky bits of a workflow as well as the problem we are solving.

In tip boxes like this one, I’ll point out some additional tips, to help you keep your analysis or report looking ✨fresh ✨.

These tips highlight advices and tricks from community members.

Resources

Finally, I want to point out a few pragmatic features:

For every chapter you can download the .Rmd source file simply by clicking on the download button in the upper toolbar.
The button to the left allows you to edit (say more about how this works).
Clicking on the GitHub icon on the far right takes you to the source repository for this book.
How to search within this book: click on the to see the instructions.

About me

Karol Flisikowski works as Assistant Professor at the Department of Statistics and Econometrics at Gdańsk University of Technology (Poland). He is responsible for lecturing descriptive and mathematical statistics (Statistics I-II) and Data Analysis. Recently his research interests focus on the studies of credit constraints. As dean’s proxy for e-learning development he is providing support for the faculty members, organizing trainings, monitoring online courses and evaluating their quality. As the university coordinator for e-learning he’s been involved in the process of certification of lecturers (according to the current ministry-level regulations its obligatory for each faculty member) issuing a certificate confirming skills in the field of e-course design. Besides that, since 2015 Karol works as university administrator of the moodle (http://enauczanie.pg.edu.pl)) platform – taking care of technical processes such as registration, the logical division of courses within the category of faculty/center and backup. He is also the founder and a member of newly opened Center for Modern Education at Gdansk University of Technology (2020). In June, 2020 Karol received “Masters of Didactics” grant from the Ministry of Higher Education and Science for years 2020-2022 (GUT).

Department of Statistics and Econometrics, Gdańsk Tech, https://flisikowski.eu/↩︎

Statistics with R

A Handbook for Statistical Analysis with R and R-Studio