Introduction

This document demonstrates the process of generating a simulated data set and visualizing the relationship between age and Body Mass Index (BMI) using R. The analysis involves creating hypothetical observations for height, weight, and age, calculating BMI, performing a linear regression analysis, and producing a scatter plot with a regression line. The goal is to illustrate a simple workflow for data generation, transformation, analysis, and visualization in R using the tidyverse and ggplot2 packages.

Work Flow

1. Loading Required Packages

The analysis utilised packages such as dplyr for data manipulation and ggplot2 for data visualisation, which are all available within the tidyverse ecosystem. So, rather than load the individual packages, I loaded only the tidyverse package.

The ggpmisc package was used to add the regression equation and R² values directly on the scatter plot.

library("tidyverse")
## Warning: package 'lubridate' was built under R version 4.5.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.0
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.1
## ✔ purrr     1.2.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library("ggpmisc")
## Loading required package: ggpp
## Warning: package 'ggpp' was built under R version 4.5.3
## Registered S3 methods overwritten by 'ggpp':
##   method                  from   
##   heightDetails.titleGrob ggplot2
##   widthDetails.titleGrob  ggplot2
## 
## Attaching package: 'ggpp'
## 
## The following object is masked from 'package:ggplot2':
## 
##     annotate

2. Setting Seed for Reproducibility

Because I aimed to generate random numbers for the variables, height and weight, I used the set.seed() function to ensure that random numbers generated in this analysis remain the same every time the code is run. Since this document is intended to be shared, the use of the function set.seed() ensures that readers obtain the same random numbers.

                          set.seed(123)

3. Height Data Generation

A sequence of possible heights ranging from 2 to 6.5 feet was first created using the seq() function with increments of 0.1 feet. Height values were then generated randomly using the sample() function.From this sequence, 15 observations were randomly selected with replacement to simulate observations for this hypothetical study.

The argument replace = True allows the same value to be selected more than once.

      height<-sample(seq(from=2, by=0.1, to=6.5), size=15, replace=TRUE)

4. Converting Height from Feet to Meters

Because BMI calculations require height in meters, the generated height values were converted from feet to meters.

The conversion factor used was:

           1 foot = 0.3048 meters

Each height value was multiplied by 0.3048.

          height <- height * 0.3048

5. Weight Data Generation

A vector of weight values was created to represent the body weight of respondents. Weights ranged from 23 kg to 75.5 kg, increasing in increments of 0.5 kg. Similar to the height generation process, 15 observations were randomly sampled with replacement.

 weight <- sample(seq(from = 23, by = 0.5, to = 75.5), size = 15, replace = TRUE) 

weight
##  [1] 69.0 72.0 58.5 35.5 26.0 43.5 27.0 64.0 40.5 61.5 63.0 44.0 74.0 60.5 30.0

6. Defining Age Values

The ages of the respondents were then specified manually using the c() function. This vector contained 15 age values, corresponding to the number of individuals in the data set.

      age <- c(28, 35, 19, 44, 50, 23, 29, 41, 38, 22, 36, 57, 48, 31, 25)
      
      age
##  [1] 28 35 19 44 50 23 29 41 38 22 36 57 48 31 25

7. Creating a Data Frame

The vectors for height, weight, and age were combined into a single data frame using the data.frame() function. This structure allows the variables to be organized in tabular format where each row represents a respondent and each column represents a variable.

The data frame was stored in a variable named resp.

  resp <- data.frame(height, weight, age)

  resp
##     height weight age
## 1  1.52400   69.0  28
## 2  1.03632   72.0  35
## 3  1.00584   58.5  19
## 4  0.67056   35.5  44
## 5  1.85928   26.0  50
## 6  1.88976   43.5  23
## 7  1.70688   27.0  29
## 8  1.00584   64.0  41
## 9  1.34112   40.5  38
## 10 1.37160   61.5  22
## 11 1.40208   63.0  36
## 12 0.73152   44.0  57
## 13 1.40208   74.0  48
## 14 1.43256   60.5  31
## 15 0.85344   30.0  25

8. Calculating Body Mass Index

BMI was calculated using the standard formula: \[BMI = \frac{weight}{height^2}\]

Weight was measured in kilograms and height in meters.The BMI values were added as a new column in the data set.

       resp$BMI <- resp$weight / (resp$height^2)

9. Rounding the BMI Values

For clarity and consistency, BMI values were rounded to two decimal places using the round() function.

       resp$BMI <- round(resp$BMI, 2)

10. Renaming the BMI Variable

The BMI variable was renamed from BMI to bmi using the rename() function.

resp <- resp |>
  rename(bmi = BMI)

11. Selecting Variables for Analysis

Only the variables relevant to the regression analysis (age and bmi) were retained using the select() function.

resp <- resp |>
  select(c(age, bmi))

N/B

It is worth noting that the data set could have been subsetted using base R functions such as subset() to select only the variables required for the visualisation. Similarly, column names could have been modified using the colnames() function.

I could also have continued the analysis using the original data frame without creating a reduced version of the data set. However, the tidyverse approach was intentionally used to manipulate the data.

12. Performing Linear Regression Analysis

Simple linear regression analysis was carried out to understand how Body Mass Index (BMI) changes with age.

rlm <- lm(bmi ~ age, data = resp)
  • The lm() is the core function for fitting a linear model.
  • bmi ~ age: This formula tells R that bmi is the dependent variable (response) and age is the independent variable (predictor).
  • data = resp: This tells R to look for the variables bmi and age inside a data set named resp.

13. Linear regression result

The summary() function displays the regression result including the regression coefficients, R², and p-values.

rlm_summary<-summary(rlm)

rlm_summary
## 
## Call:
## lm(formula = bmi ~ age, data = resp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -43.408 -15.636  -5.467  22.528  32.318 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  15.1296    20.7888   0.728    0.480
## age           0.7160     0.5662   1.264    0.228
## 
## Residual standard error: 23.85 on 13 degrees of freedom
## Multiple R-squared:  0.1095, Adjusted R-squared:  0.04102 
## F-statistic: 1.599 on 1 and 13 DF,  p-value: 0.2283

Note: The regression equation, coefficients, p-values, and model statistics reported here were dynamically extracted from rlm_summary and inserted into the text using inline R code. This ensures that any changes to the data set, model specification, or random seed are automatically reflected in the output, eliminating the need for manual updates.

14. Interpretation of Result

The fitted regression model is expressed as:

\[BMI = 15.13 + 0.72 × Age\]

The regression analysis indicates a positive relationship between age and BMI, as reflected by the slope coefficient 0.72. This suggest that, on average, BMI is expected to increase by approximately 0.72 units for every one-year increase in age. However, this relationship is not statistically significant (p= 0.228), indicating that there is insufficient evidence to conclude that age has a meaningful effect on BMI in this data set.

The intercept (15.13) represents the estimated BMI when age is zero. Although this provides the baseline level of BMI in the regression equation, it has limited practical interpretation in this context and is also not statistically significant (p= 0.48).

The coefficient of determination measures the proportion of variation in BMI explained by age. For this study, an \(R^2\) value of 0.11 implies that approximately 10.95% of the variability in BMI is accounted for by age, while the remaining 89.05% is due to other factors not included in the model. This indicates that age is a weak predictor of BMI.

The F-statistic (1.6, p = 0.228) tests the overall significance of the regression model. In this case, the model is not statistically significant, indicating that age does not significantly improve the prediction of BMI compared to a model with no predictors.

15. Creating A Scatter Plot

A scatter plot was created using the ggplot2 package to visualize the relationship between age and BMI.

  • The ggplot() function initializes the plot and defines the data set and aesthetic mappings.

  • Individual observations were displayed using geom_point().

  • The geom_smooth() function was used to add a straight line of best fit to the plot in order to visualize the trends in the data set.

rp <- ggplot(resp, aes(x = age, y = bmi)) +
  geom_point() +
  geom_smooth(method = lm, colour = "red", linewidth = 1.2)

16. Adding Regression Equation and R² Value

The regression equation and \(R^{2}\) were added using the stat_poly_eq() function from the ggpmisc package.

rp<-rp+ stat_poly_eq(
    aes(
      label= paste(
        after_stat(eq.label),
        after_stat(rr.label),
        sep = "~~~")),
    formula = y~x,
    parse = TRUE,
    label.x = "right", 
    label.y = "top"     
  )

17. Add Axis Limits and Labels

The scale_x_continuous() and scale_y_continuous() functions extends the default axis limits and set it to my preferred choice.

The labs() function helps to name the axis labels and plot title. This overrides the default axis name derived from the column name in the tibble called resp.

rp<- rp + 
  scale_x_continuous(limits=c(15,60))+
  scale_y_continuous(limits=c(0,150))+
  labs(
    x= "Age (years)",
    y =  "BMI (Kg/m²)",
    title="SCATTER PLOT OF  BMI VS AGE"
  )

18. Theme Customization

The theme() function customizes the visual appearance of the plot. The axis titles, plot title, and tick labels were set to bold. The hjust argument was used to center-align the plot title.

rp<- rp +
theme(
    axis.title =element_text(face="bold"),
    plot.title = element_text(face="bold", hjust = 0.5),
    axis.text= element_text(face="bold")
  )

19. Final Plot!

The plot object was printed to display the graph.

 rp
## `geom_smooth()` using formula = 'y ~ x'
Figure 1: Scatter plot showing the relationship between age and BMI with fitted regression line and 95% confidence interval

Figure 1: Scatter plot showing the relationship between age and BMI with fitted regression line and 95% confidence interval

20. Interpretation of the Plot

As shown in Figure 1, the independent variable (age) was plotted on the \(X\)-axis, while the dependent variable (BMI) was plotted on the \(Y\)-axis. The plot suggests that BMI tends to increase as age increases, indicating the presence of a positive linear relationship between the two variables.

However, the plot also shows that the data points are widely scattered around the regression line, indicating high variability. Furthermore, there is no strong clustering of points around the line, which supports the conclusion that the relationship is weak. A few points also appear relatively far from the regression line, suggesting the presence of possible outliers in the data set.

The grey shaded area in the plot represents the confidence interval (typically 95%) around the regression line. It indicates where the average predicted value of BMI is expected to lie. This shaded region helps to assess the reliability of the trend line: narrow bands indicate higher confidence, whereas wider bands indicate lower confidence.

In this case, the confidence band is relatively wide, particularly at higher age values. This suggests that there is considerable uncertainty in the estimated relationship between age and BMI, and therefore the regression line is not highly precise.

21. Conclusion

Although the plot shows a positive trend, the relatively wide confidence interval supports the conclusion that the relationship between age and BMI is weak and less reliable, consistent with the low \(R^2\) value observed. As such, age alone does not adequately explain variations in BMI, and additional variables would likely be required to develop a more reliable model.

22. Save the plot using ggsave

Finally, the plot was exported as a PNG image file using the ggsave() function.

ggsave("visualization.png",
       plot = rp,
       width = 8,
       height = 6,
       dpi = 300)
## `geom_smooth()` using formula = 'y ~ x'

Thank you for reading!

Let me know if you found the write-up explanatory by sending me a message.