This vignette shows the usage of the tidycharts package. It contains different chart types examples and tips for proper data visualization.
This package and vignette are created for a user who:
# install from CRAN
# install.packages(tidycharts)
library(tidycharts)
Bar charts shoud be used to show structure in one moment of time. One of typical usecase of the barchart is to visualize profit of a company in a division by departments. The data structure could look the following:
data_barchart <- data.frame(
dep = c('Services', 'Production', 'Marketing', 'Purchasing'),
profit = c(17, 15, 2, -3),
operational = c(9, 7, 1.5, -0.4),
property = c(4, 4, 0.5, -0.6),
bonus = c(4, 4, 0, -2),
prev_year = c(10, 16, 4, -1),
plan = c(11, 13, 2, -2.5)
)
In the example data operational
, property
and bonus
are parts of profit
and sum up to it.
Creation of the barchart is simple. We use barchart_plot
function to do that. After calling the function chart will be automatically printed. It can also be assigned to a variable as one element character vector with SVG content.
bar_chart(data = data_barchart, cat = 'dep', series = 'profit')
A plot should contain an informative title. We can use add_title
function to make one. We can chain the commands by pipe operator (%>%
).
bar_chart(data = data_barchart, cat = 'dep', series = 'profit') %>%
add_title('The company XYZ', 'Profit', 'in mEUR', 'by departments, 2020')
We can show the structure of each value by specyfing different series
argument. It can be a vector of column names. It that case, stacked barchart will be generated.
Normalized barplot should be used to show the proportions of parts in each category. Typical intention of using this kind of plot could be to visualize the percentage structure of profit among different departments in a company.
bar_chart_normalized(data = data_barchart, cat = 'dep',
series = c('operational', 'property', 'bonus'),
series_labels = c('op', 'porp', 'bon')) %>%
add_title('The company XYZ', 'Profit', 'in mEUR', 'normalized in department')
We use reference values (indecies) to show a reference value on the plot. In the following example, index line is used to show the best result in previous year (PY).
bar_chart_reference(data = data_barchart,
cat = 'dep', series = 'profit',
ref_val = 10, ref_label = 'PY best result') %>%
add_title('The company XYZ', 'Profit', 'in mEUR',
'with reference value of 10 mEUR')
To visualize 2 or 3 series of data, which do not sum up to some value, grouped barchart should be used. First series is visualized by bars in the foreground, second by bars in the background and third in the form of triangles. Style of the bars and triangles indicates type of data, so called scenerios.
The most typical usecase of this chart is to visualize profit of different departments in a company with comparison to bugdet and previous year data.
# names of columns in styles data frame are not important,
# only order of column is
styles <- data.frame(style_foreground = rep('actual', 4),
style_background = rep('previous',4),
style_markers = rep('plan', 4))
bar_chart_grouped(data = data_barchart,
cat = 'dep',
foreground = 'profit',
background = 'prev_year',
markers = 'plan',
series_labels = c('PY', 'AC', 'PL'),
styles = styles) %>%
add_title('The company XYZ', 'Profit', 'in mEUR',
'compared to different scenarios')
Show variance in data using relative variance plots or absolute variance plots. Define the baseline and the real values. Axis on variance plots use styles to scenario of baseline data. Relative variance plot shows difference in percents and absolute variance plot shows it in base units.
Example usage: Visualize difference between two scenarios in division by departments.
bar_chart_absolute_variance(data = data_barchart,
cat = 'dep',
baseline = 'plan',
real = 'profit',
y_title = 'Plan vs. actual',
y_style = 'plan') %>%
add_title('The company XYZ', 'Profit variance', 'in mEUR',
'between plan and actual')
bar_chart_relative_variance(data = data_barchart,
cat = 'dep',
baseline = 'plan',
real = 'profit',
y_title = 'Plan vs. actual',
y_style = 'plan') %>%
add_title('The company XYZ', 'Profit variance', 'in %',
'between plan and actual (plan=100%)')
For demonstration we will use mtcars
dataset available in R as built-in.
scatter_data <- mtcars[c('hp','qsec','cyl', 'wt')]
Scatter plots, also known as point plots, are used to visualize mulitidimensional relationships between variables. Therefore, they are extensively used in exploratory data analysis.
In scatter plot 2 numerical dimensions are visualized by position of a point on the Cartesian plane.
scatter_plot(scatter_data, x = 'hp', y = 'qsec',
legend_title = '',
x_names = c('Horsepower', 'in hp'),
y_names = c('1/4 mile time', 'in s')) %>%
add_title('The mtcars dataset', '', '', '')
Optionally, categorical dimension can be added in a form of a point color.
scatter_plot(scatter_data, x = 'hp',
y = 'qsec',
cat = 'cyl',
legend_title = 'No. cylinder',
x_names = c('Horsepower', 'in hp'),
y_names = c('1/4 mile time', 'in s')) %>%
add_title('The mtcars dataset', '', '', '')
Bubble plots can visualize the same dimensions as scatter plots. However even third numeric dimension can be added in a form of point size. On the other hand, there is a tradeoff between dimensionality and size of your data and readability of generated plots, so be carefull when using bubble charts.
scatter_plot(scatter_data, x = 'hp',
y = 'qsec',
cat = 'cyl',
bubble_value = 'wt',
legend_title = 'No. cylinders',
x_names = c('Horsepower', 'in hp'),
y_names = c('1/4 mile time', 'in s')) %>%
add_title('The mtcars dataset', '', '', '')
Charts with vertical columns are intended to visualize time series data. What is worth noticing, column width dependt on the x-axis interval. The longer the interval, the wider the column. General guideline for this kind of chart is to plot up to 24 columns. If your data has more than 24 time points see line chart section.
Here is how an example column chart data frame could look like:
data_time_series <- data.frame(
time = month.abb,
Poland = 2 + 0.5 * sin(1:12),
Germany = 3 + sin(3:14),
Slovakia = 2 + 2 * cos(1:12)
)
The time
column consists of the three-letter abbreviations for the English month names and other columns constit of some artificial data, it could be for example sales in different countries.
Use basic column chart to make a simple visualization of a time series. Pass interval
parameter to change the width of columns.
Typical task related to this kind of plot could be the following: Show sales from different countries over the months.
column_chart(data_time_series, x = 'time',
series = c('Poland', 'Germany', 'Slovakia'), interval = 'months') %>%
add_title('The company XYZ', 'Profit', 'in mEUR', 'by country, 2020')
To visualize contribution waterfall charts can be used. We need to transform the data a little bit to before passing it into the plotting function.
Example usage: visualize contribution of monthly sales as a part of year sales.
data_time_series %>% group_by(time) %>%
summarise(Sales = sum(Poland, Germany, Slovakia)) %>%
arrange(match(time, month.abb)) %>%
mutate(Sales = round(cumsum(Sales), 2)) -> df_summarized
column_chart_waterfall(df_summarized, x = 'time', series = 'Sales') %>%
add_title('The company XYZ', 'Profit', 'in mEUR', 'cumulative, 2020')
Other types of column charts are available, ie. column_chart_grouped
or column_chart_normalized
. When using them similar data visualization rules apply as for bar charts. Feel free to explore them and see reference page if need help.
Line charts, as column charts, should be used to show time series data. Some lineplots however require more complicated data structure.
The basic lineplot uses lines with markers to show the data. Typical usage is to visualize several data series, which do not sum up, for example the market value of different companies among the years.
set.seed(123)
data_companies <- data.frame(
time = 2010:2020,
Alpha.inc = 25 + round(1:11 + rnorm(11)),
Beta.inc = round(40 + rnorm(11, sd = 2)),
Gamma.inc = round(50 + rnorm(11, sd = 5))
)
line_chart_markers(data_companies, x = 'time',
series = c('Alpha.inc', 'Beta.inc', 'Gamma.inc'),
series_labels = c('Alpha.inc', 'Beta.inc', 'Gamma.inc'),
interval = 'years') %>%
add_title('Some companies', 'Market value', 'in mEUR', '2010...2020')
Use desnse line plot to visualize up to 6 time series with more than one point in a category on x-axis. The more advanced users are encouraged to use the line_chart_dense_custom
function, where they can choose points that will be highlited by value label.
The most typical example is to show data with time granularity of 1 day among the years (mean day temperature in the course of 16 months).
data_dense <- data.frame(
dates = seq.Date(as.Date('2019-01-01'), as.Date('2019-12-31'), by = 1),
Warsaw = 7 + 9 * sin((365:1 - 60)/ 365 * 2 * pi) + rnorm(365, sd = 2),
London = 8 + 5 * sin((365:1 - 55)/ 365 * 2 * pi) + rnorm(365, sd = 1.2)
)
line_chart_dense(data_dense, 'dates', c('Warsaw', 'London')) %>%
add_title('Temperature in European Cities', 'Daily mean', 'in deg. C', 'In 2019')
One can wonder what type of plot choose: line chart or column chart. The answer depends on the data. If you want to visualize only one series, both line and column chart are appropriate. More differences occur when number of series increases. If the sum of series means something reasonable, use stacked columns, optionally stacked lines. If not, use line plot.