Astronaut

Bagging with tidymodels and #TidyTuesday astronaut missions

From the blog by Julia Silge Julia Silge Tidymodels Blog

Lately I’ve been publishing screencasts demonstrating how to use the tidymodels framework, from first steps in modeling to how to evaluate complex models. Today’s screencast focuses on bagging using this week’s #TidyTuesday dataset on astronaut missions.

Here is the code I used in the video, for those who prefer reading instead of or in addition to video.

Explore the data

## Rows: 1277 Columns: 24
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (11): name, original_name, sex, nationality, military_civilian, selectio...
## dbl (13): id, number, nationwide_number, year_of_birth, year_of_selection, m...
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 289 x 2
##    in_orbit      n
##    <chr>     <int>
##  1 ISS         174
##  2 Mir          71
##  3 Salyut 6     24
##  4 Salyut 7     24
##  5 STS-42        8
##  6 explosion     7
##  7 STS-103       7
##  8 STS-107       7
##  9 STS-109       7
## 10 STS-110       7
## # ... with 279 more rows
## # A tibble: 10 x 24
##       id number nationwide_number name      original_name    sex   year_of_birth
##    <dbl>  <dbl>             <dbl> <chr>     <chr>            <chr>         <dbl>
##  1     1      1                 1 Gagarin,~ <U+0413><U+0410><U+0413><U+0410><U+0420><U+0418><U+041D> <U+042E><U+0440><U+0438><U+0439> <U+0410><U+043B>~ male           1934
##  2     2      2                 2 Titov, G~ <U+0422><U+0418><U+0422><U+041E><U+0412> <U+0413><U+0435><U+0440><U+043C><U+0430><U+043D> <U+0421><U+0442>~ male           1935
##  3     3      3                 1 Glenn, J~ Glenn, John H.,~ male           1921
##  4     4      3                 1 Glenn, J~ Glenn, John H.,~ male           1921
##  5     5      4                 2 Carpente~ Carpenter, M. S~ male           1925
##  6     6      5                 2 Nikolaye~ <U+041D><U+0418><U+041A><U+041E><U+041B><U+0410><U+0415><U+0412> <U+0410><U+043D><U+0434><U+0440><U+0438><U+044F>~ male           1929
##  7     7      5                 2 Nikolaye~ <U+041D><U+0418><U+041A><U+041E><U+041B><U+0410><U+0415><U+0412> <U+0410><U+043D><U+0434><U+0440><U+0438><U+044F>~ male           1929
##  8     8      6                 4 Popovich~ <U+041F><U+041E><U+041F><U+041E><U+0412><U+0418><U+0427> <U+041F><U+0430><U+0432><U+0435><U+043B> <U+0420>~ male           1930
##  9     9      6                 4 Popovich~ <U+041F><U+041E><U+041F><U+041E><U+0412><U+0418><U+0427> <U+041F><U+0430><U+0432><U+0435><U+043B> <U+0420>~ male           1930
## 10    10      7                 3 Schirra,~ Schirra, Walter~ male           1923
## # ... with 17 more variables: nationality <chr>, military_civilian <chr>,
## #   selection <chr>, year_of_selection <dbl>, mission_number <dbl>,
## #   total_number_of_missions <dbl>, occupation <chr>, year_of_mission <dbl>,
## #   mission_title <chr>, ascend_shuttle <chr>, in_orbit <chr>,
## #   descend_shuttle <chr>, hours_mission <dbl>, total_hrs_sum <dbl>,
## #   field21 <dbl>, eva_hrs_mission <dbl>, total_eva_hrs <dbl>

How has the duration of missions changed over time?

## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 6 rows containing non-finite values (stat_boxplot).

This duration is what we want to build a model to predict, using the other information in this per-astronaut-per-mission dataset. Let’s get ready for modeling next, by bucketing some of the spacecraft together (such as all the space shuttle missions) and taking the logarithm of the mission length.

It may make more sense to perform transformations like taking the logarithm of the outcome during data cleaning, before feature engineering and using any tidymodels packages like recipes. This kind of transformation is deterministic and can cause problems for tuning and resampling.

Build a model

We can start by loading the tidymodels metapackage, and splitting our data into training and testing sets.

Next, let’s preprocess our data to get it ready for modeling.

Let’s walk through the steps in this recipe.

  • First, we must tell the recipe() what our model is going to be (using a formula here) and what data we are using.
  • Next, update the role for the two columns that are not predictors or outcome. This way, we can keep them in the data for identification later.
  • There are a lot of different occupations and spacecraft in this dataset, so let’s collapse some of the less frequently occurring levels into an “Other” category, for each predictor.
  • Finally, we can create indicator variables.

We’re going to use this recipe in a workflow() so we don’t need to stress about whether to prep() or not.

## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: None
## 
## -- Preprocessor ----------------------------------------------------------------
## 2 Recipe Steps
## 
## * step_other()
## * step_dummy()

For this analysis, we are going to build a bagging, i.e. bootstrap aggregating, model. This is an ensembling and model averaging method that:

  • improves accuracy and stability
  • reduces overfitting and variance

In tidymodels, you can create bagging ensemble models with baguette, a parsnip-adjacent package. The baguette functions create new bootstrap training sets by sampling with replacement and then fit a model to each new training set. These models are combined by averaging the predictions for the regression case, like what we have here (by voting, for classification).

Let’s make two bagged models, one with decision trees and one with MARS models.

## Warning: package 'baguette' was built under R version 4.1.2
## Bagged Decision Tree Model Specification (regression)
## 
## Main Arguments:
##   cost_complexity = 0
##   min_n = 2
## 
## Engine-Specific Arguments:
##   times = 25
## 
## Computational engine: rpart
## Bagged MARS Model Specification (regression)
## 
## Engine-Specific Arguments:
##   times = 25
## 
## Computational engine: earth

Let’s fit these models to the training data.

## == Workflow [trained] ==========================================================
## Preprocessor: Recipe
## Model: bag_tree()
## 
## -- Preprocessor ----------------------------------------------------------------
## 2 Recipe Steps
## 
## * step_other()
## * step_dummy()
## 
## -- Model -----------------------------------------------------------------------
## Bagged CART (regression with 25 members)
## 
## Variable importance scores include:
## 
## # A tibble: 13 x 4
##    term                             value std.error  used
##    <chr>                            <dbl>     <dbl> <int>
##  1 year_of_mission                  799.      28.9     25
##  2 in_orbit_Other                   439.      48.0     25
##  3 occupation_flight.engineer       284.      30.8     25
##  4 in_orbit_STS                     265.      23.4     25
##  5 in_orbit_Mir                     161.      16.4     25
##  6 in_orbit_Salyut                   91.9      7.99    25
##  7 occupation_pilot                  73.5     16.3     25
##  8 occupation_msp                    71.6      7.11    25
##  9 occupation_other..space.tourist.  44.1      4.04    25
## 10 military_civilian_military        37.6      3.29    25
## 11 occupation_psp                    19.5      4.39    25
## 12 occupation_Other                  18.9      2.58    21
## 13 in_orbit_Mir.EP                   11.5      2.47    25
## == Workflow [trained] ==========================================================
## Preprocessor: Recipe
## Model: bag_mars()
## 
## -- Preprocessor ----------------------------------------------------------------
## 2 Recipe Steps
## 
## * step_other()
## * step_dummy()
## 
## -- Model -----------------------------------------------------------------------
## Bagged MARS (regression with 25 members)
## 
## Variable importance scores include:
## 
## # A tibble: 13 x 4
##    term                               value std.error  used
##    <chr>                              <dbl>     <dbl> <int>
##  1 in_orbit_STS                     100         0        25
##  2 in_orbit_Other                    91.6       1.80     25
##  3 year_of_mission                   63.8       4.41     25
##  4 in_orbit_Mir.EP                   29.7       1.74     25
##  5 in_orbit_Salyut                   24.4       2.60     24
##  6 occupation_Other                   7.41      1.19     14
##  7 military_civilian_military         4.91      0.551    15
##  8 occupation_flight.engineer         2.80      0         1
##  9 in_orbit_Mir                       0.666     0.668     4
## 10 occupation_msp                     0.372     0.353     3
## 11 occupation_other..space.tourist.   0.293     0         1
## 12 occupation_pilot                   0.236     0         1
## 13 occupation_psp                     0.172     0         1

The models return aggregated variable importance scores, and we can see that the spacecraft and year are importance in both models.

Evaluate model

Let’s evaluate how well these two models did by evaluating performance on the test data.

## # A tibble: 318 x 9
##    name  mission_title hours_mission military_civili~ occupation year_of_mission
##    <chr> <chr>                 <dbl> <chr>            <chr>                <dbl>
##  1 Tito~ Vostok 2               3.22 military         pilot                 1961
##  2 Glen~ MA-6                   1.61 military         pilot                 1962
##  3 Glen~ STS-95                 5.36 military         psp                   1998
##  4 Niko~ Soyuz 9                6.05 military         pilot                 1970
##  5 Popo~ Soyuz 14               5.93 military         commander             1974
##  6 Byko~ Soyuz 31/29            5.24 military         commander             1978
##  7 Koma~ Soyuz 1                3.29 military         commander             1967
##  8 Leon~ Voskhod 2              3.26 military         pilot                 1965
##  9 Borm~ Gemini 7               5.80 military         commander             1965
## 10 Borm~ Apollo 8               4.99 military         commander             1968
## # ... with 308 more rows, and 3 more variables: in_orbit <chr>,
## #   .pred_tree <dbl>, .pred_mars <dbl>

We can use the metrics() function from yardstick for both sets of predictions

## # A tibble: 3 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       0.663
## 2 rsq     standard       0.769
## 3 mae     standard       0.356
## # A tibble: 3 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       0.712
## 2 rsq     standard       0.733
## 3 mae     standard       0.384

Both models performed pretty similarly.

Let’s make some “new” astronauts to understand the kinds of predictions our bagged tree model is making.

## # A tibble: 18 x 6
##    in_orbit military_civilian occupation year_of_mission name  mission_title
##    <fct>    <chr>             <chr>                <dbl> <chr> <chr>        
##  1 ISS      civilian          Other                 2000 id    id           
##  2 ISS      civilian          Other                 2010 id    id           
##  3 ISS      civilian          Other                 2020 id    id           
##  4 STS      civilian          Other                 1980 id    id           
##  5 STS      civilian          Other                 1990 id    id           
##  6 STS      civilian          Other                 2000 id    id           
##  7 STS      civilian          Other                 2010 id    id           
##  8 Mir      civilian          Other                 1990 id    id           
##  9 Mir      civilian          Other                 2000 id    id           
## 10 Mir      civilian          Other                 2010 id    id           
## 11 Mir      civilian          Other                 2020 id    id           
## 12 Other    civilian          Other                 1960 id    id           
## 13 Other    civilian          Other                 1970 id    id           
## 14 Other    civilian          Other                 1980 id    id           
## 15 Other    civilian          Other                 1990 id    id           
## 16 Other    civilian          Other                 2000 id    id           
## 17 Other    civilian          Other                 2010 id    id           
## 18 Other    civilian          Other                 2020 id    id

Let’s start with the decision tree model.

What about the MARS model?

You can really get a sense of how these two kinds of models work from the differences in these plots (tree vs. splines with knots), but from both, we can see that missions to space stations are longer, and missions in that “Other” category change characteristics over time pretty dramatically.

Edward Hillenaar
Edward Hillenaar
Writer - Data Scientist - Philosopher

My research interests include psychology, philosophy and data science of the origin and nature of human consciousness.