Astronaut
Bagging with tidymodels and #TidyTuesday astronaut missions
From the blog by Julia Silge Julia Silge Tidymodels Blog
Lately I’ve been publishing screencasts demonstrating how to use the tidymodels framework, from first steps in modeling to how to evaluate complex models. Today’s screencast focuses on bagging using this week’s #TidyTuesday dataset on astronaut missions.
Here is the code I used in the video, for those who prefer reading instead of or in addition to video.
Explore the data
## Rows: 1277 Columns: 24
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (11): name, original_name, sex, nationality, military_civilian, selectio...
## dbl (13): id, number, nationwide_number, year_of_birth, year_of_selection, m...
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 289 x 2
## in_orbit n
## <chr> <int>
## 1 ISS 174
## 2 Mir 71
## 3 Salyut 6 24
## 4 Salyut 7 24
## 5 STS-42 8
## 6 explosion 7
## 7 STS-103 7
## 8 STS-107 7
## 9 STS-109 7
## 10 STS-110 7
## # ... with 279 more rows
## # A tibble: 10 x 24
## id number nationwide_number name original_name sex year_of_birth
## <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl>
## 1 1 1 1 Gagarin,~ <U+0413><U+0410><U+0413><U+0410><U+0420><U+0418><U+041D> <U+042E><U+0440><U+0438><U+0439> <U+0410><U+043B>~ male 1934
## 2 2 2 2 Titov, G~ <U+0422><U+0418><U+0422><U+041E><U+0412> <U+0413><U+0435><U+0440><U+043C><U+0430><U+043D> <U+0421><U+0442>~ male 1935
## 3 3 3 1 Glenn, J~ Glenn, John H.,~ male 1921
## 4 4 3 1 Glenn, J~ Glenn, John H.,~ male 1921
## 5 5 4 2 Carpente~ Carpenter, M. S~ male 1925
## 6 6 5 2 Nikolaye~ <U+041D><U+0418><U+041A><U+041E><U+041B><U+0410><U+0415><U+0412> <U+0410><U+043D><U+0434><U+0440><U+0438><U+044F>~ male 1929
## 7 7 5 2 Nikolaye~ <U+041D><U+0418><U+041A><U+041E><U+041B><U+0410><U+0415><U+0412> <U+0410><U+043D><U+0434><U+0440><U+0438><U+044F>~ male 1929
## 8 8 6 4 Popovich~ <U+041F><U+041E><U+041F><U+041E><U+0412><U+0418><U+0427> <U+041F><U+0430><U+0432><U+0435><U+043B> <U+0420>~ male 1930
## 9 9 6 4 Popovich~ <U+041F><U+041E><U+041F><U+041E><U+0412><U+0418><U+0427> <U+041F><U+0430><U+0432><U+0435><U+043B> <U+0420>~ male 1930
## 10 10 7 3 Schirra,~ Schirra, Walter~ male 1923
## # ... with 17 more variables: nationality <chr>, military_civilian <chr>,
## # selection <chr>, year_of_selection <dbl>, mission_number <dbl>,
## # total_number_of_missions <dbl>, occupation <chr>, year_of_mission <dbl>,
## # mission_title <chr>, ascend_shuttle <chr>, in_orbit <chr>,
## # descend_shuttle <chr>, hours_mission <dbl>, total_hrs_sum <dbl>,
## # field21 <dbl>, eva_hrs_mission <dbl>, total_eva_hrs <dbl>
How has the duration of missions changed over time?
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 6 rows containing non-finite values (stat_boxplot).
This duration is what we want to build a model to predict, using the other information in this per-astronaut-per-mission dataset. Let’s get ready for modeling next, by bucketing some of the spacecraft together (such as all the space shuttle missions) and taking the logarithm of the mission length.
It may make more sense to perform transformations like taking the logarithm of the outcome during data cleaning, before feature engineering and using any tidymodels packages like recipes. This kind of transformation is deterministic and can cause problems for tuning and resampling.
Build a model
We can start by loading the tidymodels metapackage, and splitting our data into training and testing sets.
Next, let’s preprocess our data to get it ready for modeling.
Let’s walk through the steps in this recipe.
- First, we must tell the recipe() what our model is going to be (using a formula here) and what data we are using.
- Next, update the role for the two columns that are not predictors or outcome. This way, we can keep them in the data for identification later.
- There are a lot of different occupations and spacecraft in this dataset, so let’s collapse some of the less frequently occurring levels into an “Other” category, for each predictor.
- Finally, we can create indicator variables.
We’re going to use this recipe in a workflow() so we don’t need to stress about whether to prep() or not.
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: None
##
## -- Preprocessor ----------------------------------------------------------------
## 2 Recipe Steps
##
## * step_other()
## * step_dummy()
For this analysis, we are going to build a bagging, i.e. bootstrap aggregating, model. This is an ensembling and model averaging method that:
- improves accuracy and stability
- reduces overfitting and variance
In tidymodels, you can create bagging ensemble models with baguette, a parsnip-adjacent package. The baguette functions create new bootstrap training sets by sampling with replacement and then fit a model to each new training set. These models are combined by averaging the predictions for the regression case, like what we have here (by voting, for classification).
Let’s make two bagged models, one with decision trees and one with MARS models.
## Warning: package 'baguette' was built under R version 4.1.2
## Bagged Decision Tree Model Specification (regression)
##
## Main Arguments:
## cost_complexity = 0
## min_n = 2
##
## Engine-Specific Arguments:
## times = 25
##
## Computational engine: rpart
## Bagged MARS Model Specification (regression)
##
## Engine-Specific Arguments:
## times = 25
##
## Computational engine: earth
Let’s fit these models to the training data.
## == Workflow [trained] ==========================================================
## Preprocessor: Recipe
## Model: bag_tree()
##
## -- Preprocessor ----------------------------------------------------------------
## 2 Recipe Steps
##
## * step_other()
## * step_dummy()
##
## -- Model -----------------------------------------------------------------------
## Bagged CART (regression with 25 members)
##
## Variable importance scores include:
##
## # A tibble: 13 x 4
## term value std.error used
## <chr> <dbl> <dbl> <int>
## 1 year_of_mission 799. 28.9 25
## 2 in_orbit_Other 439. 48.0 25
## 3 occupation_flight.engineer 284. 30.8 25
## 4 in_orbit_STS 265. 23.4 25
## 5 in_orbit_Mir 161. 16.4 25
## 6 in_orbit_Salyut 91.9 7.99 25
## 7 occupation_pilot 73.5 16.3 25
## 8 occupation_msp 71.6 7.11 25
## 9 occupation_other..space.tourist. 44.1 4.04 25
## 10 military_civilian_military 37.6 3.29 25
## 11 occupation_psp 19.5 4.39 25
## 12 occupation_Other 18.9 2.58 21
## 13 in_orbit_Mir.EP 11.5 2.47 25
## == Workflow [trained] ==========================================================
## Preprocessor: Recipe
## Model: bag_mars()
##
## -- Preprocessor ----------------------------------------------------------------
## 2 Recipe Steps
##
## * step_other()
## * step_dummy()
##
## -- Model -----------------------------------------------------------------------
## Bagged MARS (regression with 25 members)
##
## Variable importance scores include:
##
## # A tibble: 13 x 4
## term value std.error used
## <chr> <dbl> <dbl> <int>
## 1 in_orbit_STS 100 0 25
## 2 in_orbit_Other 91.6 1.80 25
## 3 year_of_mission 63.8 4.41 25
## 4 in_orbit_Mir.EP 29.7 1.74 25
## 5 in_orbit_Salyut 24.4 2.60 24
## 6 occupation_Other 7.41 1.19 14
## 7 military_civilian_military 4.91 0.551 15
## 8 occupation_flight.engineer 2.80 0 1
## 9 in_orbit_Mir 0.666 0.668 4
## 10 occupation_msp 0.372 0.353 3
## 11 occupation_other..space.tourist. 0.293 0 1
## 12 occupation_pilot 0.236 0 1
## 13 occupation_psp 0.172 0 1
The models return aggregated variable importance scores, and we can see that the spacecraft and year are importance in both models.
Evaluate model
Let’s evaluate how well these two models did by evaluating performance on the test data.
## # A tibble: 318 x 9
## name mission_title hours_mission military_civili~ occupation year_of_mission
## <chr> <chr> <dbl> <chr> <chr> <dbl>
## 1 Tito~ Vostok 2 3.22 military pilot 1961
## 2 Glen~ MA-6 1.61 military pilot 1962
## 3 Glen~ STS-95 5.36 military psp 1998
## 4 Niko~ Soyuz 9 6.05 military pilot 1970
## 5 Popo~ Soyuz 14 5.93 military commander 1974
## 6 Byko~ Soyuz 31/29 5.24 military commander 1978
## 7 Koma~ Soyuz 1 3.29 military commander 1967
## 8 Leon~ Voskhod 2 3.26 military pilot 1965
## 9 Borm~ Gemini 7 5.80 military commander 1965
## 10 Borm~ Apollo 8 4.99 military commander 1968
## # ... with 308 more rows, and 3 more variables: in_orbit <chr>,
## # .pred_tree <dbl>, .pred_mars <dbl>
We can use the metrics() function from yardstick for both sets of predictions
## # A tibble: 3 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 0.663
## 2 rsq standard 0.769
## 3 mae standard 0.356
## # A tibble: 3 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 0.712
## 2 rsq standard 0.733
## 3 mae standard 0.384
Both models performed pretty similarly.
Let’s make some “new” astronauts to understand the kinds of predictions our bagged tree model is making.
## # A tibble: 18 x 6
## in_orbit military_civilian occupation year_of_mission name mission_title
## <fct> <chr> <chr> <dbl> <chr> <chr>
## 1 ISS civilian Other 2000 id id
## 2 ISS civilian Other 2010 id id
## 3 ISS civilian Other 2020 id id
## 4 STS civilian Other 1980 id id
## 5 STS civilian Other 1990 id id
## 6 STS civilian Other 2000 id id
## 7 STS civilian Other 2010 id id
## 8 Mir civilian Other 1990 id id
## 9 Mir civilian Other 2000 id id
## 10 Mir civilian Other 2010 id id
## 11 Mir civilian Other 2020 id id
## 12 Other civilian Other 1960 id id
## 13 Other civilian Other 1970 id id
## 14 Other civilian Other 1980 id id
## 15 Other civilian Other 1990 id id
## 16 Other civilian Other 2000 id id
## 17 Other civilian Other 2010 id id
## 18 Other civilian Other 2020 id id
Let’s start with the decision tree model.
What about the MARS model?
You can really get a sense of how these two kinds of models work from the differences in these plots (tree vs. splines with knots), but from both, we can see that missions to space stations are longer, and missions in that “Other” category change characteristics over time pretty dramatically.