# Chapter 15 Univariate Analysis

## 15.1 Measurement Scales

We have two kinds of variables:

• Qualitative, or Attribute, or Categorical, Variable: A variable that categorizes or describes an element of a population. Note: Arithmetic operations, such as addition and averaging, are not meaningful for data resulting from a qualitative variable.

• Quantitative, or Numerical, Variable: A variable that quantifies an element of a population. Note: Arithmetic operations such as addition and averaging, are meaningful for data resulting from a quantitative variable.

Qualitative and quantitative variables may be further subdivided:

Nominal Variable: A qualitative variable that categorizes (or describes, or names) an element of a population.

Ordinal Variable: A qualitative variable that incorporates an ordered position, or ranking.

Discrete Variable: A quantitative variable that can assume a countable number of values. Intuitively, a discrete variable can assume values corresponding to isolated points along a line interval. That is, there is a gap between any two values. One example: binary variable (0-1).

Continuous Variable: A quantitative variable that can assume an uncountable number of values. Intuitively, a continuous variable can assume any value along a line interval, including every possible value between any two values.

## 15.2 Central Tendency

We can use many different statistics to describe the central tendency of a given distribution.

### 15.2.1 Arithmetic mean

The arithmetic mean (mean) is the most common measure of central tendency.

Mean = sum of values divided by the number of values, but unfortunately it is easily affected by extreme values (outliers).

• It requires at least the interval scale.
• All values are used
• It is unique
• It is easy to calculate and allow easy mathematical treatment
• The sum of the deviations from the mean is 0
• The arithmetic mean is the only measure of central tendency where the sum of the deviations of each value from the mean is zero!
• It is easily affected by extremes, such as very big or small numbers in the set (non-robust)
• For data stored in frequency tables use weighted mean!

Let’s calculate the mean for miles per gallon variable (“mtcars” data):

## [1] 19.2

### 15.2.3 Mode

Mode is a measure of central tendency = the value that occurs most often. It is not affected by extreme values!

Usually used for either numerical or categorical data!

There may may be no mode!

There may be several modes!!

Mode function is included in “DescTools” library (not in Base R):

library("DescTools")
Mode(mtcars$mpg) ## [1] 10.4 15.2 19.2 21.0 21.4 22.8 30.4 ## attr(,"freq") ## [1] 2 ### 15.2.4 Quantiles Quantiles are values that split sorted data or a probability distribution into equal parts. In general terms, a q-quantile divides sorted data into q parts. The most commonly used quantiles have special names: • Quartiles (4-quantiles): Three quartiles split the data into four parts. • Deciles (10-quantiles): Nine deciles split the data into 10 parts. • Percentiles (100-quantiles): 99 percentiles split the data into 100 parts. There is always one fewer quantile than there are parts created by the quantiles. #### 15.2.4.1 Quartiles Quartiles split the ranked data into 4 segments with an equal number of values per segment: The first quartile, Q1, is the value for which 25% of the observations are smaller and 75% are larger. Q2 is the same as the median (50% are smaller, 50% are larger). Only 25% of the observations are greater than the third quartile! Let’s calculate quartile for mpg variable using quantile function: quantile(mtcars$mpg)
##     0%    25%    50%    75%   100%
## 10.400 15.425 19.200 22.800 33.900

#### 15.2.4.2 Deciles

Further division of a distribution into a number of equal parts is sometimes used; the most common of these are deciles, percentiles, and fractiles.

Deciles divide the sorted data into 10 sections.

Now, let’s calculate all deciles for the mpg variable:

##    37%
## 17.535

#### 15.2.4.4 Fractiles

Instead of using a percentile we would refer to a fractile. For example, the 30th percentile is the 0.30 fractile.

## 15.3 Dispersion

Measures of variation give us information on the spread - dispersion or variability of the data values.

We can have the same center but different variation!

### 15.3.1 Range

Range is the simplest measure of variation.

Difference between the largest and the smallest observations:

Example:

Range ignores the way in which data are distributed and is very sensitive to outliers:

Let’s calculate range for mpg variable:

## [1] 7.375

### 15.3.3 Variance

Variance is the measure of spread. It is the average of squared deviations of values from the mean:

Sample variance is the average (approximately) of squared deviations of values from the mean:

Let’s calculate the sample variance for mpg variable:

## [1] 6.026948

### 15.3.5 % Variability

Many times it is easier to interpret volatility by simply converting the standard deviation into the percentage (relative) spread around the mean.

The coefficient of variability measures relative variation.

It can be used to compare two or more sets of data measured in different units.

The formula for CV:

Let’s write our own function to calculate it for mpg variable:

cv <- function(variable) { sd (variable) / mean (variable) }
## [1] 19.69615

### 15.9.2 Winsorized mean

Winsorized mean, often mistakenly called the “windsor mean” :) is one of averages, statistical measure of central tendency close to the usual arithmetic mean or median, and the most similar to the trimmed mean. It is calculated in the same way as the arithmetic mean, replacing the previously selected extreme observations (the predetermined number of the smallest and largest values in the sample) with the maximum and minimum values from the remaining part.

This procedure is sometimes called winsorisation. This name (and the name of the average) comes from the surname of the statistician Charles Winsor (1895-1951).

Typically, 10 to 25 percent of the range from both ends of the distribution is replaced. In the case when the coefficient is 0 percent, the Winsor average is reduced to the arithmetic mean, when all observations are replaced with the exception of one or two, it comes down to the median.

Winsorized mean is more robust to outliers than the arithmetic mean and even more robust than median to asymmetrical distribution of the variable.

The winsorized mean is less than the median resistant to outliers and less resistant than the arithmetic mean to the asymmetric distribution of the variable. It is an example of a robust estimate of the arithmetic mean in the population. However, with asymmetrical distributions, this is not an unbalanced estimator. An additional disadvantage, compared to the trimmed mean, is the large weight with which errors of estimation fall into the errors of two observations, the values of which are replaced by outliers.

Now, let’s calculate the winsorized mean and standard deviation of the mpg distribution:

library(psych)
winsor.mean(mtcars$mpg, trim = 0.2, na.rm = TRUE) ## [1] 19.3675 winsor.sd(mtcars$mpg, trim = 0.2, na.rm = TRUE)
## [1] 3.508253
## [1] 19.22
## [1] 5.41149

### 15.9.5 IQR deviation

We can also focus our attention on the middle of the distribution, and look at the properties from the IQR range:

Interquartile deviation Qx:

Interquartile coefficient of variability Vx based on quartiles (1,2 and 3):

## 15.10 Summary reports

Using kableextra package we can easily create summary tables with graphics and/or statistics.

##
## Dołączanie pakietu: 'kableExtra'
## Następujący obiekt został zakryty z 'package:dplyr':
##
##     group_rows
cyl boxplot histogram line1 line2 points1
4
6
8

We can finally summarize basic measures for mpg variable by number of cylinders using ‘kable’ package. You can customize your final report. See some hints here.

(#tab:kable_report2)Miles per gallon per number of cylinders
4 cylinders 6 cylinders 8 cylinders
Min 10.40 10.40 10.40
Max 33.90 33.90 33.90
Q1 15.43 15.43 15.43
Median 19.20 19.20 19.20
Q3 22.80 22.80 22.80
Mean 20.09 20.09 20.09
Sd 6.03 6.03 6.03
IQR 7.38 7.38 7.38
Sx 3.69 3.69 3.69
Var % 0.30 0.30 0.30
IQR Var % 0.38 0.38 0.38
Skewness 0.61 0.61 0.61
Kurtosis -0.37 -0.37 -0.37

## 15.11 Tutorial

Please play this tutorial “Univariate description and visualization” written by Bruce Dudek on how to summarize and visualize different datasets: