This document demonstrates the process of generating a simulated data set and visualizing the relationship between age and Body Mass Index (BMI) using R. The analysis involves creating hypothetical observations for height, weight, and age, calculating BMI, performing a linear regression analysis, and producing a scatter plot with a regression line. The goal is to illustrate a simple workflow for data generation, transformation, analysis, and visualization in R using the tidyverse and ggplot2 packages.
The analysis utilised packages such as dplyr for data manipulation and ggplot2 for data visualisation, which are all available within the tidyverse ecosystem. So, rather than load the individual packages, I loaded only the tidyverse package.
The ggpmisc package was used to add the regression equation and R² values directly on the scatter plot.
library("tidyverse")
## Warning: package 'lubridate' was built under R version 4.5.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.0
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.1
## ✔ purrr 1.2.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library("ggpmisc")
## Loading required package: ggpp
## Warning: package 'ggpp' was built under R version 4.5.3
## Registered S3 methods overwritten by 'ggpp':
## method from
## heightDetails.titleGrob ggplot2
## widthDetails.titleGrob ggplot2
##
## Attaching package: 'ggpp'
##
## The following object is masked from 'package:ggplot2':
##
## annotate
Because I aimed to generate random numbers for the variables, height
and weight, I used the set.seed() function to ensure that
random numbers generated in this analysis remain the same every time the
code is run. Since this document is intended to be shared, the use of
the function set.seed() ensures that readers obtain the
same random numbers.
set.seed(123)
A sequence of possible heights ranging from 2 to 6.5 feet was first
created using the seq() function with increments of 0.1
feet. Height values were then generated randomly using the
sample() function.From this sequence, 15 observations were
randomly selected with replacement to simulate observations for this
hypothetical study.
The argument replace = True allows the same value to be
selected more than once.
height<-sample(seq(from=2, by=0.1, to=6.5), size=15, replace=TRUE)
Because BMI calculations require height in meters, the generated height values were converted from feet to meters.
The conversion factor used was:
1 foot = 0.3048 meters
Each height value was multiplied by 0.3048.
height <- height * 0.3048
A vector of weight values was created to represent the body weight of respondents. Weights ranged from 23 kg to 75.5 kg, increasing in increments of 0.5 kg. Similar to the height generation process, 15 observations were randomly sampled with replacement.
weight <- sample(seq(from = 23, by = 0.5, to = 75.5), size = 15, replace = TRUE)
weight
## [1] 69.0 72.0 58.5 35.5 26.0 43.5 27.0 64.0 40.5 61.5 63.0 44.0 74.0 60.5 30.0
The ages of the respondents were then specified manually using the
c() function. This vector contained 15 age values,
corresponding to the number of individuals in the data set.
age <- c(28, 35, 19, 44, 50, 23, 29, 41, 38, 22, 36, 57, 48, 31, 25)
age
## [1] 28 35 19 44 50 23 29 41 38 22 36 57 48 31 25
The vectors for height, weight, and
age were combined into a single data frame using the
data.frame() function. This structure allows the variables
to be organized in tabular format where each row represents a respondent
and each column represents a variable.
The data frame was stored in a variable named resp.
resp <- data.frame(height, weight, age)
resp
## height weight age
## 1 1.52400 69.0 28
## 2 1.03632 72.0 35
## 3 1.00584 58.5 19
## 4 0.67056 35.5 44
## 5 1.85928 26.0 50
## 6 1.88976 43.5 23
## 7 1.70688 27.0 29
## 8 1.00584 64.0 41
## 9 1.34112 40.5 38
## 10 1.37160 61.5 22
## 11 1.40208 63.0 36
## 12 0.73152 44.0 57
## 13 1.40208 74.0 48
## 14 1.43256 60.5 31
## 15 0.85344 30.0 25
BMI was calculated using the standard formula: \[BMI = \frac{weight}{height^2}\]
Weight was measured in kilograms and height in meters.The BMI values were added as a new column in the data set.
resp$BMI <- resp$weight / (resp$height^2)
For clarity and consistency, BMI values were rounded to two decimal
places using the round() function.
resp$BMI <- round(resp$BMI, 2)
The BMI variable was renamed from BMI to bmi using the
rename() function.
resp <- resp |>
rename(bmi = BMI)
Only the variables relevant to the regression analysis (age and bmi)
were retained using the select() function.
resp <- resp |>
select(c(age, bmi))
It is worth noting that the data set could have been subsetted using
base R functions such as subset() to select only the
variables required for the visualisation. Similarly, column names could
have been modified using the colnames() function.
I could also have continued the analysis using the original data frame without creating a reduced version of the data set. However, the tidyverse approach was intentionally used to manipulate the data.
Simple linear regression analysis was carried out to understand how Body Mass Index (BMI) changes with age.
rlm <- lm(bmi ~ age, data = resp)
The summary() function displays the regression result including the regression coefficients, R², and p-values.
rlm_summary<-summary(rlm)
rlm_summary
##
## Call:
## lm(formula = bmi ~ age, data = resp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -43.408 -15.636 -5.467 22.528 32.318
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.1296 20.7888 0.728 0.480
## age 0.7160 0.5662 1.264 0.228
##
## Residual standard error: 23.85 on 13 degrees of freedom
## Multiple R-squared: 0.1095, Adjusted R-squared: 0.04102
## F-statistic: 1.599 on 1 and 13 DF, p-value: 0.2283
Note: The regression equation, coefficients,
p-values, and model statistics reported here were dynamically extracted
from rlm_summary and inserted into the text using inline R
code. This ensures that any changes to the data set, model
specification, or random seed are automatically reflected in the output,
eliminating the need for manual updates.
The fitted regression model is expressed as:
\[BMI = 15.13 + 0.72 × Age\]
The regression analysis indicates a positive relationship between age and BMI, as reflected by the slope coefficient 0.72. This suggest that, on average, BMI is expected to increase by approximately 0.72 units for every one-year increase in age. However, this relationship is not statistically significant (p= 0.228), indicating that there is insufficient evidence to conclude that age has a meaningful effect on BMI in this data set.
The intercept (15.13) represents the estimated BMI when age is zero. Although this provides the baseline level of BMI in the regression equation, it has limited practical interpretation in this context and is also not statistically significant (p= 0.48).
The coefficient of determination measures the proportion of variation in BMI explained by age. For this study, an \(R^2\) value of 0.11 implies that approximately 10.95% of the variability in BMI is accounted for by age, while the remaining 89.05% is due to other factors not included in the model. This indicates that age is a weak predictor of BMI.
The F-statistic (1.6, p = 0.228) tests the overall significance of the regression model. In this case, the model is not statistically significant, indicating that age does not significantly improve the prediction of BMI compared to a model with no predictors.
A scatter plot was created using the ggplot2 package to visualize the relationship between age and BMI.
The ggplot() function initializes the plot and
defines the data set and aesthetic mappings.
Individual observations were displayed using
geom_point().
The geom_smooth() function was used to add a
straight line of best fit to the plot in order to visualize the trends
in the data set.
rp <- ggplot(resp, aes(x = age, y = bmi)) +
geom_point() +
geom_smooth(method = lm, colour = "red", linewidth = 1.2)
The regression equation and \(R^{2}\) were added using the
stat_poly_eq() function from the ggpmisc
package.
rp<-rp+ stat_poly_eq(
aes(
label= paste(
after_stat(eq.label),
after_stat(rr.label),
sep = "~~~")),
formula = y~x,
parse = TRUE,
label.x = "right",
label.y = "top"
)
The scale_x_continuous() and
scale_y_continuous() functions extends the default axis
limits and set it to my preferred choice.
The labs() function helps to name the axis labels and
plot title. This overrides the default axis name derived from the column
name in the tibble called resp.
rp<- rp +
scale_x_continuous(limits=c(15,60))+
scale_y_continuous(limits=c(0,150))+
labs(
x= "Age (years)",
y = "BMI (Kg/m²)",
title="SCATTER PLOT OF BMI VS AGE"
)
The theme() function customizes the visual appearance of
the plot. The axis titles, plot title, and tick labels were set to
bold. The hjust argument was used to
center-align the plot title.
rp<- rp +
theme(
axis.title =element_text(face="bold"),
plot.title = element_text(face="bold", hjust = 0.5),
axis.text= element_text(face="bold")
)
The plot object was printed to display the graph.
rp
## `geom_smooth()` using formula = 'y ~ x'
Figure 1: Scatter plot showing the relationship between age and BMI with fitted regression line and 95% confidence interval
As shown in Figure 1, the independent variable (age) was
plotted on the \(X\)-axis, while the
dependent variable (BMI) was plotted on the \(Y\)-axis. The plot suggests that BMI tends
to increase as age increases, indicating the presence of a positive
linear relationship between the two variables.
However, the plot also shows that the data points are widely scattered around the regression line, indicating high variability. Furthermore, there is no strong clustering of points around the line, which supports the conclusion that the relationship is weak. A few points also appear relatively far from the regression line, suggesting the presence of possible outliers in the data set.
The grey shaded area in the plot represents the confidence interval (typically 95%) around the regression line. It indicates where the average predicted value of BMI is expected to lie. This shaded region helps to assess the reliability of the trend line: narrow bands indicate higher confidence, whereas wider bands indicate lower confidence.
In this case, the confidence band is relatively wide, particularly at higher age values. This suggests that there is considerable uncertainty in the estimated relationship between age and BMI, and therefore the regression line is not highly precise.
Although the plot shows a positive trend, the relatively wide confidence interval supports the conclusion that the relationship between age and BMI is weak and less reliable, consistent with the low \(R^2\) value observed. As such, age alone does not adequately explain variations in BMI, and additional variables would likely be required to develop a more reliable model.
Finally, the plot was exported as a PNG image file using the
ggsave() function.
ggsave("visualization.png",
plot = rp,
width = 8,
height = 6,
dpi = 300)
## `geom_smooth()` using formula = 'y ~ x'
Thank you for reading!
Let me know if you found the write-up explanatory by sending me a message.