--- title: "A Quick Start Guide on Using std_selected()" author: "Shu Fai Cheung" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{A Quick Start Guide on Using std_selected()} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 6, fig.height = 4, fig.align = "center" ) ``` # Introduction This vignette illustrates how to use `std_selected()`, the main function from the `stdmod` package. More about this package can be found in `vignette("stdmod", package = "stdmod")` or at [https://sfcheung.github.io/stdmod/](https://sfcheung.github.io/stdmod/). # This Guide Shows to use `std_selected()` to: - get the correct standardized regression coefficients of a moderated regression model, and - form the valid confidence intervals of the standardized regression coefficients using nonparametric bootstrapping that takes into account the sampling variation due to standardization. # Sample Dataset ```{r} library(stdmod) dat <- sleep_emo_con head(dat, 3) ``` This dataset has 500 cases, with sleep duration (measured in average hours), conscientiousness, emotional stability, age, and gender (a `"female"` and `"male"`). The names of some variables are shortened for readability: ```{r} colnames(dat)[2:4] <- c("sleep", "cons", "emot") head(dat, 3) ``` # Model Suppose this is the moderated regression model: - Dependent variable (Outcome Variable): sleep duration (`sleep`) - Independent variable (Predictor / Focal Variable): emotional stability (`emot`) - Moderator: conscientiousness (`cons`) - Control variables: `age` and `gender` `lm()` can be used to fit this model: ```{r} lm_out <- lm(sleep ~ age + gender + emot * cons, dat = dat) summary(lm_out) ``` The unstandardized moderation effect is significant, B = `r formatC(coef(lm_out)["emot:cons"], 4, format = "f")`. For each one unit increase of conscientiousness score, the effect of emotional stability decreases by `r formatC(-1 * coef(lm_out)["emot:cons"], 4, format = "f")`. # Correct Standardization For the Moderated Regression Suppose we want to find the correct standardized solution for the moderated regression, that is, all variables except for categorical variables are standardized. In a moderated regression model, the product term should be formed *after* standardization. Instead of doing the standardization ourselves before calling `lm()`, we can pass the `lm()` output to `std_selected()`, and use `~ .` for the arguments `to_scale` and `to_center`. ```r lm_stdall <- std_selected(lm_out, to_scale = ~ ., to_center = ~ .) ``` Since 0.2.6.3, `to_standardize` can be used as a shortcut: ```{r} lm_stdall <- std_selected(lm_out, to_standardize = ~ .) ``` ```{r} summary(lm_stdall) ``` In this example, the coefficient of the product term, which naturally can be called the **standardized moderation effect**, is significant, B = `r formatC(coef(lm_stdall)["emot:cons"], 4, format = "f")`. For each one *standard deviation* increase of conscientiousness score, the **standardized effect** of emotional stability decreases by `r formatC(-1 * coef(lm_stdall)["emot:cons"], 4, format = "f")`. ## The Arguments Standardization is equivalent to centering by mean and then scaling by (dividing by) standard deviation. The argument `to_center` specifies the variables to be centered by their means, and the argument `to_scale` specifies the variables to be scaled by their standard deviations. The formula interface of `lm()` is used in these two arguments, with the variables on the right hand side being the variables to be centered and/or scaled. The "`.`" on the right hand side represents all variables in the model, including the outcome variable (sleep duration in this example). `std_selected()` will also skip categorical variables automatically skipped because standardizing them will make their coefficients not easy to interpret. Since 0.2.6.3, `to_standardize` is added as a shortcut. Listing a variable on `to_standardize` is equivalent to listing this variable on both `to_center` and `to_scale`. ## Advantage Using `std_selected` minimizes impact on the workflow. Do regression as usual. Get the correct standardized coefficients only when we need to interpret them. ## Nonparametric Bootstrap Confidence Intervals There is one problem with standardized coefficients. The confidence intervals based on ordinary least squares (OLS) fitted to the standardized variables do not take into account the sampling variation of the sample means and standard deviations ([Yuan & Chan, 2011](https://doi.org/10.1007/s11336-011-9224-6)). [Cheung, Cheung, Lau, Hui, and Vong (2022)](https://doi.org/10.1037/hea0001188) suggest using nonparametric bootstrapping, with standardization conducted in each bootstrap sample. This can be done by `std_selected_boot()`, a wrapper of `std_selected()`: ```{r echo = FALSE} if (file.exists("stdmod_lm_stdall_boot.rds")) { lm_stdall_boot <- readRDS("stdmod_lm_stdall_boot.rds") } else { set.seed(870432) lm_stdall_boot <- std_selected_boot(lm_out, to_scale = ~ ., to_center = ~ ., nboot = 5000) saveRDS(lm_stdall_boot, "stdmod_lm_stdall_boot.rds", compress = "xz") } ``` ```{r eval = FALSE} set.seed(870432) lm_stdall_boot <- std_selected_boot(lm_out, to_scale = ~ ., to_center = ~ ., nboot = 5000) ``` Since 0.2.6.3, `to_standardize` can be used as a shortcut: ```r lm_stdall_boot <- std_selected_boot(lm_out, to_standardize = ~ . nboot = 5000) ``` The minimum additional argument is `nboot`, the number of bootstrap samples. ```{r} summary(lm_stdall_boot) ``` The output is similar to that of `std_selected()`, with additional information on the bootstrapping process. ```{r echo = FALSE} tmp <- summary(lm_stdall_boot)$coefficients ``` The 95% bootstrap percentile confidence interval of the standardized moderation effect is `r formatC(tmp["emot:cons", "CI Lower"], 4, format = "f")` to `r formatC(tmp["emot:cons", "CI Upper"], 4, format = "f")`. # Standardize Independent Variable (Focal Variable) and Moderator `std_selected()` and `std_selected_boot()` can also be used to standardize only selected variables. There are cases in which we do not want to standardize some continuous variables because they are measured on interpretable units, such as hours. Suppose we want to standardize only emotional stability and conscientiousness, and do not standardize sleep duration. We just list `emot` and `cons` on `to_center` and `to_scale`: ```r lm_std1 <- std_selected(lm_out, to_scale = ~ emot + cons, to_center = ~ emot + cons) ``` Since 0.2.6.3, `to_standardize` can be used a shortuct: ```{r} lm_std1 <- std_selected(lm_out, to_standardize = ~ emot + cons) ``` ```{r} summary(lm_std1) ``` The *partially* standardized moderation effect is `r formatC(coef(lm_std1)["emot:cons"], 4, format = "f")`. For each one *standard deviation* increase of conscientiousness score, the *partially* standardized effect of emotional stability decreases by `r formatC(-1 * coef(lm_std1)["emot:cons"], 4, format = "f")`. ## Nonparametric Bootstrap Confidence Intervals The function `std_selected_boot()` can also be used to form the nonparametric bootstrap confidence interval when only some of the variables are standardized: ```{r echo = FALSE} if (file.exists("stdmod_lm_std1_boot.rds")) { lm_std1_boot <- readRDS("stdmod_lm_std1_boot.rds") } else { set.seed(870432) lm_std1_boot <- std_selected_boot(lm_out, to_scale = ~ emot + cons, to_center = ~ emot + cons, nboot = 5000) saveRDS(lm_std1_boot, "stdmod_lm_std1_boot.rds", compress = "xz") } ``` ```{r eval = FALSE} set.seed(870432) lm_std1_boot <- std_selected_boot(lm_out, to_scale = ~ emot + cons, to_center = ~ emot + cons, nboot = 5000) ``` Since 0.2.6.3, `to_standardize` can be used as a shortcut: ```r lm_std1_boot <- std_selected_boot(lm_out, to_standardize = ~ emot + cons, nboot = 5000) ``` Again, the only additional argument is `nboot`. ```{r} summary(lm_std1_boot) ``` ```{r echo = FALSE} tmp <- summary(lm_std1_boot)$coefficients ``` The 95% bootstrap percentile confidence interval of the partially standardized moderation effect is `r formatC(tmp["emot:cons", "CI Lower"], 4, format = "f")` to `r formatC(tmp["emot:cons", "CI Upper"], 4, format = "f")`. # Further Information A more detailed illustration can be found at `vignette("moderation", package = "stdmod")`. `vignette("std_selected", package = "stdmod")` illustrates how `std_selected()` can be used to form nonparametric bootstrap percentile confidence interval for standardized regression coefficients ("betas") for regression models without a product term. Further information on the functions can be found in their help pages (`std_selected()` and `std_selected_boot()`). For example, parallel computation can be used when doing bootstrapping, if the number of bootstrapping samples request is large. # Reference(s) Cheung, S. F., Cheung, S.-H., Lau, E. Y. Y., Hui, C. H., & Vong, W. N. (2022) Improving an old way to measure moderation effect in standardized units. *Health Psychology*, *41*(7), 502-505. https://doi.org/10.1037/hea0001188. Yuan, K.-H., & Chan, W. (2011). Biases and standard errors of standardized regression coefficients. *Psychometrika, 76*(4), 670-690. https://doi.org/10.1007/s11336-011-9224-6