Health Survey Data Analysis of BMI

project
code
analysis
Author

Hamza Rahal

Published

June 25, 2024

1. Survey of BMI and physical activity

We’ve all taken a survey at some point, but do you ever wonder what happens to your answers? Surveys are given to a carefully selected sample of people with the goal of generalizing the results to a much larger population.

The National Health and Nutrition Examination Survey (NHANES) data is a complex survey of tens of thousands of people designed to assess the health and nutritional status of adults and children in the United States. The NHANES data includes many measurements related to overall health, physical activity, diet, psychological health, socioeconomic factors and more.

Depending on the sampling design, each person has a sampling weight that quantifies how many people in the larger population their data is meant to represent. In this notebook, we’ll apply survey methods that use sampling weights to estimate and model relationships between measurements.

We are going to focus on a common health indicator, Body Mass Index (BMI kg/m2), and how it is related to physical activity. We’ll visualize the data and use survey-weighted regression to test for associations.

Code
# Load the NHANES and dplyr packages
library(NHANES)
library(dplyr)

Attachement du package : 'dplyr'
Les objets suivants sont masqués depuis 'package:stats':

    filter, lag
Les objets suivants sont masqués depuis 'package:base':

    intersect, setdiff, setequal, union
Code
# Load the NHANESraw data
data("NHANESraw")

# Take a glimpse at the contents
glimpse(NHANESraw)
Rows: 20,293
Columns: 78
$ ID               <int> 51624, 51625, 51626, 51627, 51628, 51629, 51630, 5163…
$ SurveyYr         <fct> 2009_10, 2009_10, 2009_10, 2009_10, 2009_10, 2009_10,…
$ Gender           <fct> male, male, male, male, female, male, female, female,…
$ Age              <int> 34, 4, 16, 10, 60, 26, 49, 1, 10, 80, 10, 80, 4, 35, …
$ AgeMonths        <int> 409, 49, 202, 131, 722, 313, 596, 12, 124, NA, 121, N…
$ Race1            <fct> White, Other, Black, Black, Black, Mexican, White, Wh…
$ Race3            <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ Education        <fct> High School, NA, NA, NA, High School, 9 - 11th Grade,…
$ MaritalStatus    <fct> Married, NA, NA, NA, Widowed, Married, LivePartner, N…
$ HHIncome         <fct> 25000-34999, 20000-24999, 45000-54999, 20000-24999, 1…
$ HHIncomeMid      <int> 30000, 22500, 50000, 22500, 12500, 30000, 40000, 4000…
$ Poverty          <dbl> 1.36, 1.07, 2.27, 0.81, 0.69, 1.01, 1.91, 1.36, 2.68,…
$ HomeRooms        <int> 6, 9, 5, 6, 6, 4, 5, 5, 7, 4, 5, 5, 7, NA, 6, 6, 5, 6…
$ HomeOwn          <fct> Own, Own, Own, Rent, Rent, Rent, Rent, Rent, Own, Own…
$ Work             <fct> NotWorking, NA, NotWorking, NA, NotWorking, Working, …
$ Weight           <dbl> 87.4, 17.0, 72.3, 39.8, 116.8, 97.6, 86.7, 9.4, 26.0,…
$ Length           <dbl> NA, NA, NA, NA, NA, NA, NA, 75.7, NA, NA, NA, NA, NA,…
$ HeadCirc         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ Height           <dbl> 164.7, 105.4, 181.3, 147.8, 166.0, 173.0, 168.4, NA, …
$ BMI              <dbl> 32.22, 15.30, 22.00, 18.22, 42.39, 32.61, 30.57, NA, …
$ BMICatUnder20yrs <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ BMI_WHO          <fct> 30.0_plus, 12.0_18.5, 18.5_to_24.9, 12.0_18.5, 30.0_p…
$ Pulse            <int> 70, NA, 68, 68, 72, 72, 86, NA, 70, 88, 84, 54, NA, N…
$ BPSysAve         <int> 113, NA, 109, 93, 150, 104, 112, NA, 108, 139, 94, 12…
$ BPDiaAve         <int> 85, NA, 59, 41, 68, 49, 75, NA, 53, 43, 45, 60, NA, N…
$ BPSys1           <int> 114, NA, 112, 92, 154, 102, 118, NA, 106, 142, 94, 12…
$ BPDia1           <int> 88, NA, 62, 36, 70, 50, 82, NA, 60, 62, 38, 62, NA, N…
$ BPSys2           <int> 114, NA, 114, 94, 150, 104, 108, NA, 106, 140, 92, 12…
$ BPDia2           <int> 88, NA, 60, 44, 68, 48, 74, NA, 50, 46, 40, 62, NA, N…
$ BPSys3           <int> 112, NA, 104, 92, 150, 104, 116, NA, 110, 138, 96, 11…
$ BPDia3           <int> 82, NA, 58, 38, 68, 50, 76, NA, 56, 40, 50, 58, NA, N…
$ Testosterone     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ DirectChol       <dbl> 1.29, NA, 1.55, 1.89, 1.16, 1.16, 1.16, NA, 1.58, 1.9…
$ TotChol          <dbl> 3.49, NA, 4.97, 4.16, 5.22, 4.14, 6.70, NA, 4.14, 4.7…
$ UrineVol1        <int> 352, NA, 281, 139, 30, 202, 77, NA, 39, 128, 109, 38,…
$ UrineFlow1       <dbl> NA, NA, 0.415, 1.078, 0.476, 0.563, 0.094, NA, 0.300,…
$ UrineVol2        <int> NA, NA, NA, NA, 246, NA, NA, NA, NA, NA, NA, NA, NA, …
$ UrineFlow2       <dbl> NA, NA, NA, NA, 2.51, NA, NA, NA, NA, NA, NA, NA, NA,…
$ Diabetes         <fct> No, No, No, No, Yes, No, No, No, No, No, No, Yes, No,…
$ DiabetesAge      <int> NA, NA, NA, NA, 56, NA, NA, NA, NA, NA, NA, 70, NA, N…
$ HealthGen        <fct> Good, NA, Vgood, NA, Fair, Good, Good, NA, NA, Excell…
$ DaysPhysHlthBad  <int> 0, NA, 2, NA, 20, 2, 0, NA, NA, 0, NA, 0, NA, NA, NA,…
$ DaysMentHlthBad  <int> 15, NA, 0, NA, 25, 14, 10, NA, NA, 0, NA, 0, NA, NA, …
$ LittleInterest   <fct> Most, NA, NA, NA, Most, None, Several, NA, NA, None, …
$ Depressed        <fct> Several, NA, NA, NA, Most, Most, Several, NA, NA, Non…
$ nPregnancies     <int> NA, NA, NA, NA, 1, NA, 2, NA, NA, NA, NA, NA, NA, NA,…
$ nBabies          <int> NA, NA, NA, NA, 1, NA, 2, NA, NA, NA, NA, NA, NA, NA,…
$ Age1stBaby       <int> NA, NA, NA, NA, NA, NA, 27, NA, NA, NA, NA, NA, NA, N…
$ SleepHrsNight    <int> 4, NA, 8, NA, 4, 4, 8, NA, NA, 6, NA, 9, NA, 7, NA, N…
$ SleepTrouble     <fct> Yes, NA, No, NA, No, No, Yes, NA, NA, No, NA, No, NA,…
$ PhysActive       <fct> No, NA, Yes, NA, No, Yes, No, NA, NA, Yes, NA, No, NA…
$ PhysActiveDays   <int> NA, NA, 5, NA, NA, 2, NA, NA, NA, 4, NA, NA, NA, NA, …
$ TVHrsDay         <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ CompHrsDay       <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ TVHrsDayChild    <int> NA, 4, NA, 1, NA, NA, NA, NA, 1, NA, 3, NA, 2, NA, 5,…
$ CompHrsDayChild  <int> NA, 1, NA, 1, NA, NA, NA, NA, 0, NA, 0, NA, 1, NA, 0,…
$ Alcohol12PlusYr  <fct> Yes, NA, NA, NA, No, Yes, Yes, NA, NA, Yes, NA, No, N…
$ AlcoholDay       <int> NA, NA, NA, NA, NA, 19, 2, NA, NA, 1, NA, NA, NA, NA,…
$ AlcoholYear      <int> 0, NA, NA, NA, 0, 48, 20, NA, NA, 52, NA, 0, NA, NA, …
$ SmokeNow         <fct> No, NA, NA, NA, Yes, No, Yes, NA, NA, No, NA, No, NA,…
$ Smoke100         <fct> Yes, NA, NA, NA, Yes, Yes, Yes, NA, NA, Yes, NA, Yes,…
$ SmokeAge         <int> 18, NA, NA, NA, 16, 15, 38, NA, NA, 16, NA, 21, NA, N…
$ Marijuana        <fct> Yes, NA, NA, NA, NA, Yes, Yes, NA, NA, NA, NA, NA, NA…
$ AgeFirstMarij    <int> 17, NA, NA, NA, NA, 10, 18, NA, NA, NA, NA, NA, NA, N…
$ RegularMarij     <fct> No, NA, NA, NA, NA, Yes, No, NA, NA, NA, NA, NA, NA, …
$ AgeRegMarij      <int> NA, NA, NA, NA, NA, 12, NA, NA, NA, NA, NA, NA, NA, N…
$ HardDrugs        <fct> Yes, NA, NA, NA, No, Yes, Yes, NA, NA, NA, NA, NA, NA…
$ SexEver          <fct> Yes, NA, NA, NA, Yes, Yes, Yes, NA, NA, NA, NA, NA, N…
$ SexAge           <int> 16, NA, NA, NA, 15, 9, 12, NA, NA, NA, NA, NA, NA, NA…
$ SexNumPartnLife  <int> 8, NA, NA, NA, 4, 10, 10, NA, NA, NA, NA, NA, NA, NA,…
$ SexNumPartYear   <int> 1, NA, NA, NA, NA, 1, 1, NA, NA, NA, NA, NA, NA, NA, …
$ SameSex          <fct> No, NA, NA, NA, No, No, Yes, NA, NA, NA, NA, NA, NA, …
$ SexOrientation   <fct> Heterosexual, NA, NA, NA, NA, Heterosexual, Heterosex…
$ WTINT2YR         <dbl> 80100.544, 53901.104, 13953.078, 11664.899, 20090.339…
$ WTMEC2YR         <dbl> 81528.772, 56995.035, 14509.279, 12041.635, 21000.339…
$ SDMVPSU          <int> 1, 2, 1, 2, 2, 1, 2, 2, 2, 1, 1, 1, 2, 2, 1, 1, 1, 1,…
$ SDMVSTRA         <int> 83, 79, 84, 86, 75, 88, 85, 86, 88, 77, 86, 79, 84, 7…
$ PregnantNow      <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, U…

2. Visualize survey weight and strata variables

We see from glimpse() that the NHANESraw data has many health measurement variables. It also contains a sampling weight variable WTMEC2YR.

Since NHANESraw data spans 4 years (2009–2012) and the sampling weights are based on 2 years of data, we first need to create a weight variable that scales the sample across the full 4 years. We will divide the 2-year weight in half so that in total, the weights sum to the total US population.

The NHANES data has oversampled some geographic regions and specific minority groups. By examining the distribution of sampling weights for each race, we can see that Whites are undersampled and have higher weights while oversampled Black, Mexican, Hispanic people have lower weights since each sampled person in these minority groups represents fewer US people.

Code
# Load the ggplot2 package
library(ggplot2)

# Use mutate to create a 4-year weight variable and call it WTMEC4YR
NHANESraw <- NHANESraw %>%
                mutate(WTMEC4YR = WTMEC2YR / 2) 

# Calculate the sum of this weight variable
NHANESraw %>% summarize(total_WTMEC4YR = sum(WTMEC4YR))
# A tibble: 1 × 1
  total_WTMEC4YR
           <dbl>
1     304267200.
Code
# Plot the sample weights using boxplots, with Race1 on the x-axis
ggplot(NHANESraw, aes(x=Race1, y=WTMEC4YR))+
geom_boxplot()

3. Specify the survey design

We will now use the survey package to specify the complex survey design that we will use in later analyses. The NHANESraw data contains a strata variable SDMVSTRA, and a cluster id variable (also known as a primary sampling unit, PSU), SDMVPSU, that accounts for design effects of clustering. These clusters (PSUs) are nested within strata.

Code
# Load the survey package
library(survey)
Le chargement a nécessité le package : grid
Le chargement a nécessité le package : Matrix
Le chargement a nécessité le package : survival

Attachement du package : 'survey'
L'objet suivant est masqué depuis 'package:graphics':

    dotchart
Code
# Specify the survey design
nhanes_design <- svydesign(
    data = NHANESraw,
    strata = ~SDMVSTRA,
    id = ~SDMVPSU,
    nest = TRUE,
    weights = ~WTMEC4YR)

# Print a summary of this design
summary(nhanes_design)
Stratified 1 - level Cluster Sampling design (with replacement)
With (62) clusters.
svydesign(data = NHANESraw, strata = ~SDMVSTRA, id = ~SDMVPSU, 
    nest = TRUE, weights = ~WTMEC4YR)
Probabilities:
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
8.986e-06 5.664e-05 1.054e-04       Inf 1.721e-04       Inf 
Stratum Sizes: 
            75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91
obs        803 785 823 829 696 751 696 724 713 683 592 946 598 647 251 862 998
design.PSU   2   2   2   2   2   2   2   2   2   2   2   3   2   2   2   3   3
actual.PSU   2   2   2   2   2   2   2   2   2   2   2   3   2   2   2   3   3
            92  93  94  95  96  97  98  99 100 101 102 103
obs        875 602 688 722 676 608 708 682 700 715 624 296
design.PSU   3   2   2   2   2   2   2   2   2   2   2   2
actual.PSU   3   2   2   2   2   2   2   2   2   2   2   2
Data variables:
 [1] "ID"               "SurveyYr"         "Gender"           "Age"             
 [5] "AgeMonths"        "Race1"            "Race3"            "Education"       
 [9] "MaritalStatus"    "HHIncome"         "HHIncomeMid"      "Poverty"         
[13] "HomeRooms"        "HomeOwn"          "Work"             "Weight"          
[17] "Length"           "HeadCirc"         "Height"           "BMI"             
[21] "BMICatUnder20yrs" "BMI_WHO"          "Pulse"            "BPSysAve"        
[25] "BPDiaAve"         "BPSys1"           "BPDia1"           "BPSys2"          
[29] "BPDia2"           "BPSys3"           "BPDia3"           "Testosterone"    
[33] "DirectChol"       "TotChol"          "UrineVol1"        "UrineFlow1"      
[37] "UrineVol2"        "UrineFlow2"       "Diabetes"         "DiabetesAge"     
[41] "HealthGen"        "DaysPhysHlthBad"  "DaysMentHlthBad"  "LittleInterest"  
[45] "Depressed"        "nPregnancies"     "nBabies"          "Age1stBaby"      
[49] "SleepHrsNight"    "SleepTrouble"     "PhysActive"       "PhysActiveDays"  
[53] "TVHrsDay"         "CompHrsDay"       "TVHrsDayChild"    "CompHrsDayChild" 
[57] "Alcohol12PlusYr"  "AlcoholDay"       "AlcoholYear"      "SmokeNow"        
[61] "Smoke100"         "SmokeAge"         "Marijuana"        "AgeFirstMarij"   
[65] "RegularMarij"     "AgeRegMarij"      "HardDrugs"        "SexEver"         
[69] "SexAge"           "SexNumPartnLife"  "SexNumPartYear"   "SameSex"         
[73] "SexOrientation"   "WTINT2YR"         "WTMEC2YR"         "SDMVPSU"         
[77] "SDMVSTRA"         "PregnantNow"      "WTMEC4YR"        

4. Subset the data

Analysis of survey data requires careful consideration of the sampling design and weights at every step. Something as simple as filtering the data becomes complicated when weights are involved.

When we wish to examine a subset of the data (i.e. the subpopulation of adult Hispanics with diabetes, or pregnant women), we must explicitly specify this in the design. We cannot simply remove that subset of the data through filtering the raw data because the survey weights will no longer be correct and will not add up to the full US population.

BMI categories are different for children and young adults younger than 20 so we will subset the data to only analyze adults of at least 20 years of age.

Code
# Select adults of Age >= 20 with subset
nhanes_adult <- nhanes_design%>%
                subset(Age >=20)

# Print a summary of this subset
summary(nhanes_adult)
Stratified 1 - level Cluster Sampling design (with replacement)
With (62) clusters.
subset(., Age >= 20)
Probabilities:
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
8.986e-06 4.303e-05 8.107e-05       Inf 1.240e-04       Inf 
Stratum Sizes: 
            75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91
obs        471 490 526 500 410 464 447 400 411 395 357 512 327 355 153 509 560
design.PSU   2   2   2   2   2   2   2   2   2   2   2   3   2   2   2   3   3
actual.PSU   2   2   2   2   2   2   2   2   2   2   2   3   2   2   2   3   3
            92  93  94  95  96  97  98  99 100 101 102 103
obs        483 376 368 454 362 315 414 409 377 460 308 165
design.PSU   3   2   2   2   2   2   2   2   2   2   2   2
actual.PSU   3   2   2   2   2   2   2   2   2   2   2   2
Data variables:
 [1] "ID"               "SurveyYr"         "Gender"           "Age"             
 [5] "AgeMonths"        "Race1"            "Race3"            "Education"       
 [9] "MaritalStatus"    "HHIncome"         "HHIncomeMid"      "Poverty"         
[13] "HomeRooms"        "HomeOwn"          "Work"             "Weight"          
[17] "Length"           "HeadCirc"         "Height"           "BMI"             
[21] "BMICatUnder20yrs" "BMI_WHO"          "Pulse"            "BPSysAve"        
[25] "BPDiaAve"         "BPSys1"           "BPDia1"           "BPSys2"          
[29] "BPDia2"           "BPSys3"           "BPDia3"           "Testosterone"    
[33] "DirectChol"       "TotChol"          "UrineVol1"        "UrineFlow1"      
[37] "UrineVol2"        "UrineFlow2"       "Diabetes"         "DiabetesAge"     
[41] "HealthGen"        "DaysPhysHlthBad"  "DaysMentHlthBad"  "LittleInterest"  
[45] "Depressed"        "nPregnancies"     "nBabies"          "Age1stBaby"      
[49] "SleepHrsNight"    "SleepTrouble"     "PhysActive"       "PhysActiveDays"  
[53] "TVHrsDay"         "CompHrsDay"       "TVHrsDayChild"    "CompHrsDayChild" 
[57] "Alcohol12PlusYr"  "AlcoholDay"       "AlcoholYear"      "SmokeNow"        
[61] "Smoke100"         "SmokeAge"         "Marijuana"        "AgeFirstMarij"   
[65] "RegularMarij"     "AgeRegMarij"      "HardDrugs"        "SexEver"         
[69] "SexAge"           "SexNumPartnLife"  "SexNumPartYear"   "SameSex"         
[73] "SexOrientation"   "WTINT2YR"         "WTMEC2YR"         "SDMVPSU"         
[77] "SDMVSTRA"         "PregnantNow"      "WTMEC4YR"        
Code
# Compare the number of observations in the full data to the adult data
nrow(nhanes_adult)
[1] 11778
Code
nrow(nhanes_design)
[1] 20293

5. Visualizing BMI

We let svydesign() do its magic, but how does this help us learn about the full US population? With survey methods, we can use the sampling weights to estimate the true distributions of measurements within the entire population. This works for many statistics such as means, proportions, and standard deviations.

We’ll use survey methods to estimate average BMI in the US adult population and also to draw a weighted histogram of the distribution.

Code
# Calculate the mean BMI in NHANESraw
bmi_mean_raw <- NHANESraw %>% 
    filter(Age >= 20) %>%
    summarize(avg.BMI = mean(BMI, na.rm=TRUE))
bmi_mean_raw
# A tibble: 1 × 1
  avg.BMI
    <dbl>
1    29.0
Code
# Calculate the survey-weighted mean BMI of US adults
bmi_mean <- svymean(~BMI, design = nhanes_adult, na.rm = TRUE)
bmi_mean
      mean     SE
BMI 28.734 0.1235
Code
# Draw a weighted histogram of BMI in the US population
NHANESraw %>% 
  filter(Age >= 20) %>%
    ggplot(mapping=aes(x=BMI, weight=WTMEC4YR)) + 
    geom_histogram()+
    geom_vline(xintercept = coef(bmi_mean), color="red")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 547 rows containing non-finite outside the scale range
(`stat_bin()`).

6. Is BMI lower in physically active people?

The distribution of BMI looks to be about what we might expect with most people under 40 kg/m2 and a slight positive skewness because a few people have much higher BMI. Now to the question of interest: does the distribution of BMI differ between people who are physically active versus those who are not physically active? We can visually compare BMI with a boxplot as well as formally test for a difference in mean BMI.

Code
# Load the broom library
library(broom)

# Make a boxplot of BMI stratified by physically active status
NHANESraw %>% 
  filter(Age>=20) %>%
    ggplot(mapping=aes(x=PhysActive, y= BMI, weight=WTMEC4YR))+
    geom_boxplot()
Warning: Removed 547 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Warning: Computation failed in `stat_boxplot()`.
Caused by error in `loadNamespace()`:
! aucun package nommé 'quantreg' n'est trouvé

Code
# Conduct a t-test comparing mean BMI between physically active status
survey_ttest <- svyttest(BMI~PhysActive, design = nhanes_adult)

# Use broom to show the tidy results
tidy(survey_ttest)
# A tibble: 1 × 8
  estimate statistic  p.value parameter conf.low conf.high method    alternative
     <dbl>     <dbl>    <dbl>     <dbl>    <dbl>     <dbl> <chr>     <chr>      
1    -1.85     -9.72 4.56e-11        32    -2.23     -1.46 Design-b… two.sided  

7. Could there be confounding by smoking? (part 1)

The relationship between physical activity and BMI is likely not so simple as “if you exercise you will lower your BMI.” In fact, many other lifestyle or demographic variables could be confounding this relationship. One such variable could be smoking status. If someone smokes, is he or she more or less likely to be physically active? Are smokers more likely to have higher or lower BMI? We can examine these relationships in the survey data. Note that many people chose not to answer the smoking question, so we reduce our sample size when looking at this data.

First, let’s look at the relationship between smoking and physical activity.

Code
# Estimate the proportion who are physically active by current smoking status
phys_by_smoke <- svyby(~PhysActive, by = ~SmokeNow, 
                       FUN = svymean, 
                       design = nhanes_adult, 
                       keep.names = FALSE)

# Print the table
phys_by_smoke
  SmokeNow PhysActiveNo PhysActiveYes se.PhysActiveNo se.PhysActiveYes
1       No    0.4566990     0.5433010      0.01738054       0.01738054
2      Yes    0.5885421     0.4114579      0.01163246       0.01163246
Code
# Plot the proportions
ggplot(data = phys_by_smoke, aes(SmokeNow, PhysActiveYes, fill = SmokeNow)) +
 geom_col()+
    ylab("Proportion Physically Active")

8. Could there be confounding by smoking? (part 2)

Now let’s examine the relationship between smoking with BMI.

Code
# Estimate mean BMI by current smoking status
BMI_by_smoke <- svyby(~BMI, by = ~SmokeNow, 
                       FUN = svymean, 
                       design = nhanes_adult, 
                       na.rm = TRUE)
BMI_by_smoke
    SmokeNow      BMI        se
No        No 29.25734 0.1915138
Yes      Yes 27.74873 0.1652377
Code
# Plot the distribution of BMI by current smoking status
NHANESraw %>% 
  filter(Age>=20, !is.na(SmokeNow)) %>% 
    ggplot(mapping=aes(x=SmokeNow, y= BMI, weight=WTMEC4YR))+
    geom_boxplot()
Warning: Removed 244 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Warning: Computation failed in `stat_boxplot()`.
Caused by error in `loadNamespace()`:
! aucun package nommé 'quantreg' n'est trouvé

9. Add smoking in the mix

We saw that people who smoke are less likely to be physically active and have a higher BMI on average. We also saw that people who are physically active have a lower BMI on average. How do these seemingly conflicting associations work together? To get a better sense of what’s going on, we can compare BMI by physical activity stratified by smoking status.

Code
# Plot the distribution of BMI by smoking and physical activity status
NHANESraw %>% 
  filter(Age>=20) %>%
    ggplot(mapping=aes(x=SmokeNow, y= BMI, weight=WTMEC4YR, color=PhysActive))+
    geom_boxplot()
Warning: Removed 547 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Warning: Computation failed in `stat_boxplot()`.
Caused by error in `loadNamespace()`:
! aucun package nommé 'quantreg' n'est trouvé

10. Incorporate possible confounding in the model

In the above plot, we see that people who are physically active tend to have lower BMI no matter their smoking status, and this is true even if they didn’t answer the question. However, we also see that smokers have lower BMI in general. Also, looking closely we see the difference in BMI comparing physically active people to non-physically active people is slightly smaller in smokers than in non-smokers.

Previously, we used a simple t-test to compare mean BMI in physically active people and non-physically active people. In order to adjust for smoking status, as well as other possible confounders or predictors of BMI, we can use a linear regression model with multiple independent variables. When using survey data, we use a weighted linear regression method which is a special case of generalized linear models (GLMs).

Code
# Fit a multiple regression model
mod1 <- svyglm(BMI ~ SmokeNow * PhysActive, design = nhanes_adult)

# Tidy the model results
tidy_mod1 <- tidy(mod1)
tidy_mod1
# A tibble: 4 × 5
  term                      estimate std.error statistic  p.value
  <chr>                        <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)                  30.5      0.210    146.   2.62e-44
2 SmokeNowYes                  -2.24     0.267     -8.40 2.26e- 9
3 PhysActiveYes                -2.35     0.236     -9.97 4.96e-11
4 SmokeNowYes:PhysActiveYes     1.00     0.344      2.92 6.52e- 3
Code
# Calculate expected mean difference in BMI for activity within non-smokers
diff_non_smoke <- tidy_mod1 %>% 
    filter(term == "PhysActiveYes") %>% 
    select(estimate)
diff_non_smoke
# A tibble: 1 × 1
  estimate
     <dbl>
1    -2.35
Code
# Calculate expected mean difference in BMI for activity within smokers
diff_smoke <- tidy_mod1 %>% 
   filter(term %in% c('PhysActiveYes','SmokeNowYes:PhysActiveYes')) %>% 
    summarize(estimate = sum(estimate))
diff_smoke
# A tibble: 1 × 1
  estimate
     <dbl>
1    -1.35

11. What does it all mean?

We fit a linear regression model where the association of physical activity with BMI could vary by smoking status. The interaction between physical activity and smoking has a small p-value, which suggests the association does vary by smoking status. The difference between physically active and non-physically active people is larger in magnitude in the non-smoker population.

We should check the model fit and technical assumptions of our regression model. Then, we can conclude that physically active people tend to have lower BMI, as do smokers. Although they have similar effect sizes, we probably wouldn’t want to recommend smoking along with exercise!

In order to determine whether physical activity causes lower BMI, we would need to use causal inference methods or a randomized control study. We can adjust for other possible confounders in our regression model to determine if physical activity is still associated with BMI, but we fall short of confirming that physical activity itself can lower one’s BMI.

Code
# Adjust mod1 for other possible confounders
mod2 <- svyglm(BMI ~ PhysActive*SmokeNow + Race1 + Alcohol12PlusYr + Gender, 
               design = nhanes_adult)

# Tidy the output
tidy(mod2)
# A tibble: 10 × 5
   term                      estimate std.error statistic  p.value
   <chr>                        <dbl>     <dbl>     <dbl>    <dbl>
 1 (Intercept)                 33.2       0.316   105.    1.75e-33
 2 PhysActiveYes               -2.11      0.273    -7.75  5.56e- 8
 3 SmokeNowYes                 -2.23      0.303    -7.34  1.40e- 7
 4 Race1Hispanic               -1.47      0.420    -3.49  1.88e- 3
 5 Race1Mexican                -0.191     0.464    -0.412 6.84e- 1
 6 Race1White                  -2.08      0.320    -6.49  1.04e- 6
 7 Race1Other                  -3.11      0.620    -5.01  4.09e- 5
 8 Alcohol12PlusYrYes          -0.855     0.358    -2.39  2.50e- 2
 9 Gendermale                  -0.256     0.230    -1.11  2.78e- 1
10 PhysActiveYes:SmokeNowYes    0.737     0.387     1.90  6.92e- 2