Lauri N
evasalm
i
E 67
A
N
N
A
LES U
N
IV
ERSITATIS TU
RK
U
EN
SIS
ISBN 978-951-29-8222-6 (PRINT)
ISBN 978-951-29-8223-3 (PDF)
ISSN 2343-3159 (Painettu/Print)
ISSN 2343-3167 (Verkkojulkaisu/Online)
Pa
in
os
al
am
a 
O
y, 
Tu
rk
u,
 F
in
la
nd
 2
02
0
 
  
 
 
–
– –
ESSAYS ON ECONOMIC
FORECASTING USING
MACHINE LEARNING 
Lauri Nevasalmi 
TURUN YLIOPISTON JULKAISUJA  ANNALES UNIVERSITATIS TURKUENSIS 
SARJA – SER. E OSA  TOM. 67 |  OECONOMICA |  TURKU 2020 

 
 
 
 
 
 
 
 
  
      
–
- –
~f~ UNIVERSITY 
~~ OFTURKU 
ESSAYS ON ECONOMIC 
FORECASTING USING 
MACHINE LEARNING
Lauri Nevasalmi
TURUN YLIOPISTON JULKAISUJA ANNALES UNIVERSITATIS TURKUENSIS
SARJA – SER. E OSA  TOM. 67 | OECONOMICA | TURKU 2020
   
  
 
  
 
  
 
 
  
 
  
 
  
  
 
  
  
  
 
 
 
  
 
 
 
 
 
  
 
 
 
 
  
University of Turku
Turku School of Economics
Department of Economics
Economics
Doctoral programme of Turku School of Economics
Supervised by
Professor Heikki Kauppi
University of Turku
Turku, Finland
Professor Matti Viren
University of Turku
Turku, Finland
Reviewed by
Professor Charlotte Christiansen
Aarhus University
Aarhus, Denmark
Professor Seppo Pynnönen
University of Vaasa
Vaasa, Finland
Opponent
Professor Charlotte Christiansen
Aarhus University
Aarhus, Denmark
The originality of this publication has been checked in accordance with the University 
of Turku quality assurance system using the Turnitin OriginalityCheck service.
ISBN 978-951-29-8222-6 (PRINT)
ISBN 978-951-29-8223-3 (PDF)
ISSN 2343-3159 (Painettu/Print)
ISSN 2343-3167 (Verkkojulkaisu/Online)
Painosalama, Turku, Finland 2020
  
 
 
 
 
 
   
  
 
     
     
  
   
   
      
 
    
  
   
  
 
 
     
      
  
  
   
  
   
      
  
     
   
  
UNIVERSITY OF TURKU
Turku School of Economics
Department of Economics
Economics
LAURI NEVASALMI: Essays on economic forecasting using machine 
learning
Doctoral Dissertation, 163 pp.
Doctoral Programme of Turku School of Economics
November 2020
ABSTRACT
This thesis studies the additional value introduced by different machine learning
methods to economic forecasting. Flexible machine learning methods can discover
various complex relationships in data and are well-suited for analysing so called big
data and potential problems therein. Several new extensions to existing machine
learning methods are proposed from the viewpoint of economic forecasting.
In Chapter 2, the main objective is to predict U.S. economic recession periods
with a high-dimensional dataset. A cost-sensitive extension to the gradient boosting 
machine learning algorithm is proposed, which takes into account the scarcity of
recession periods. The results show how the cost-sensitive extension outperforms the
traditional gradient boosting model and leads to more accurate recession forecasts.
Chapter 3 considers a variety of different machine learning methods when
predicting daily returns of the S&P 500 stock market index. A new multinomial
approach is suggested, which allows us to focus on predicting the large absolute 
returns instead of the noisy variation around zero return. In terms of both the
statistical and economic evaluation criteria gradient boosting turns out to be the best-
performing machine learning method.
In Chapter 4, the asset allocation decisions between risky and risk-free assets are
determined using a flexible utility maximization based approach. Instead of the 
merely considered two-step approach where portfolio weights are based on the
excess return predictions obtained with statistical predictive regressions, here the 
optimal weights are found directly by incorporating a custom objective function to 
the gradient boosting algorithm. The empirical results using monthly U.S. market
returns show that the utility-based approach leads to substantial and quantitatively
meaningful economic value over the past approaches.
iii 
 
 
 
 
 
  
 
 
 
    
     
 
 
  
   
  
 
  
        
 
   
    
    
  
 
  
 
       
  
 
TURUN YLIOPISTO
Turun Kauppakorkeakoulu
Taloustieteen laitos
Taloustiede
LAURI NEVASALMI: Essays on economic forecasting using machine 
learning
Väitöskirja, 163 s.
Turun Kauppakorkeakoulun Tohtoriohjelma
Marraskuu 2020
TIIVISTELMÄ
Tässä väitöskirjassa tarkastellaan millaista lisäarvoa koneoppimismenetelmät voivat
tuoda taloudellisiin ennustesovelluksiin. Joustavat koneoppimismenetelmät kyke-
nevät mallintamaan monimutkaisia funktiomuotoja ja soveltuvat hyvin big datan eli
suurten aineistojen analysointiin. Väitöskirjassa laajennetaan koneoppimismene-
telmiä erityisesti taloudellisten ennustesovellusten lähtökohdista katsoen.
Luvussa 2 ennustetaan Yhdysvaltojen talouden taantumajaksoja käyttäen hyvin
suurta selittäjäjoukkoa. Gradient boosting -koneoppimismenetelmää laajennetaan
huomioimaan aineiston merkittävä tunnuspiirre eli se, että taantumajaksoja esiintyy 
melko harvoin talouden ollessa suurimman osan ajasta noususuhdanteessa. Tulokset
osoittavat, että laajennettu gradient boosting -menetelmä kykenee ennustamaan
tulevia taantumakuukausia huomattavasti perinteisiä menetelmiä tarkemmin.
Luvussa 3 hyödynnetään useampaa erilaista koneoppimismenetelmää S&P 500
-osakemarkkinaindeksin päivätuottojen ennustamisessa. Aiemmista lähestymis-
tavoista poiketen tässä tutkimuksessa kategorisoidaan tuotot kolmeen eri luokkaan
pyrkimyksenä keskittyä informatiivisempien suurten positiivisten ja negatiivisten
tuottojen ennustamiseen. Tulosten perusteella gradient boosting osoittautuu
parhaaksi menetelmäksi niin tilastollisten kuin taloudellistenkin ennustekriteerien
mukaan.
Luvussa 4 tarkastellaan, kuinka perinteisen tuottoennusteisiin nojautuvan kaksi-
vaiheisen lähestymistavan sijaan allokaatiopäätös riskisen ja riskittömän sijoitus-
kohteen välillä voidaan muodostaa suoraan sijoittajan kokeman hyödyn pohjalta. 
Hyödyn maksimoinnissa käytetään gradient boosting -menetelmää ja sen mahdol-
listamaa itsemäärättyä tavoitefunktiota. Yhdysvaltojen aineistoon perustuvat empii-
riset tulokset osoittavat kuinka sijoittajan hyötyyn pohjautuva salkkuallokaatio
johtaa perinteistä kaksivaiheista lähestymistapaa tuottavampiin allokaatiopäätöksiin.
iv 
Acknowledgements 
After completing my Master’s Thesis the idea of learning more about binary 
dependent variable models grew inside me. In Finland there are only a few 
people with an in-depth knowledge on this matter. Soon after starting my 
career as a doctoral student I found myself working with both of them. 
First of all, I would like to thank my thesis supervisor Professor Heikki 
Kauppi. Thank you for introducing me to the Turku School of Economics and 
providing me the chance to pursue my goals. Also thank you for the endless 
amount of fexibility when it comes to meeting arrangements or adapting to 
new situations in life. We have had many extremely fruitful discussions from 
which I have learned so much. Who would have thought that your vague idea 
to give a closer look at one particular paper on machine learning resulted in 
an entire Doctoral Thesis and an additional Master’s degree on the matter? 
Next my deepest gratitude goes towards Associate Professor Henri Nyberg. 
You have always found the time to carefully read and comment my ongoing 
work. Without your help throughout the years I would not be writing this. 
Also thank you for constantly challenging me as a young scientist. Whether 
it was about frst public appearance or frst submission you always pushed 
me forward but also provided invaluable help when needed. I am extremely 
grateful for all the things you have done for me. I can not thank you enough 
so I sincerely hope the color of Manchester will be red in the near future. 
Thank you Susanne and Salla for getting the best out of the hectic frst year 
as a doctoral student. The unforgettable year provided many great memories 
and hopefully life-long friendships. I wish to thank the pre-examiners of 
my thesis, Professor Charlotte Christiansen and Professor Seppo Pynnönen, 
for their insightful comments and suggestions. The long list of people who 
have given helpful comments on different parts of this thesis receive my 
utmost appreciation at the beginning of each corresponding chapter. The 
fnancial support from the Emil Aaltonen foundation (personal grant and 
project funding) and the Academy of Finland (grant 321968) is also gratefully 
acknowledged. Similarly, I am greatly indebted to my parents for the help 
v 
during the times when the fnancial support have felt insuffcient. 
Finally, I would like to thank my wife Mari. Thank you for always believing 
in me and being there for me when things seemed helpless. I am forever 
grateful for being able to reach towards my dreams knowing that our children 
are having the best of times with you. And no matter what life brings ahead 
of us after graduation, I could not feel more confdent as long as I am with 
you. Thank you Leevi, Milli and Masi for reminding me how the little things 
in life matter and always putting a smile on my face. This thesis is dedicated 
to all of you. 
Espoo, October 2020 
Lauri Nevasalmi 
vi 
Contents 
Abstract iii 
Tiivistelmä iv 
1 Introduction 1 
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 
1.2 Econometric framework . . . . . . . . . . . . . . . . . . . . . . 3 
1.2.1 More fexible functional forms using machine learning 4 
1.2.2 Loss function . . . . . . . . . . . . . . . . . . . . . . . . . 7 
1.3 Summary of the essays . . . . . . . . . . . . . . . . . . . . . . . 8 
1.3.1 Recession forecasting with big data . . . . . . . . . . . 9 
1.3.2 Forecasting multinomial stock returns using machine 
learning methods . . . . . . . . . . . . . . . . . . . . . . 9 
1.3.3 Moving forward from predictive regressions: Boosting 
asset allocation decisions . . . . . . . . . . . . . . . . . 10 
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 
2 Recession forecasting with big data 15 
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 
2.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 
2.2.1 Gradient boosting . . . . . . . . . . . . . . . . . . . . . . 19 
2.2.2 Cost-sensitive gradient boosting with class weights . . 22 
2.2.3 Regularization parameters in gradient boosting . . . . 25 
2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 
2.3.1 Data and model setup . . . . . . . . . . . . . . . . . . . . 27 
2.3.2 In-sample results . . . . . . . . . . . . . . . . . . . . . . 29 
2.3.3 Out-of-sample results . . . . . . . . . . . . . . . . . . . 33 
2.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 
vii 
3 Forecasting multinomial stock returns using machine learning meth-
ods 43 
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 
3.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 
3.2.1 Multinomial stock returns . . . . . . . . . . . . . . . . . 46 
3.2.2 Machine learning methods . . . . . . . . . . . . . . . . 49 
3.3 Data and model setup . . . . . . . . . . . . . . . . . . . . . . . . 61 
3.3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 
3.3.2 Tuning parameter optimization . . . . . . . . . . . . . . 65 
3.4 Empirical results . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 
3.4.1 Statistical predictive performance . . . . . . . . . . . . . 67 
3.4.2 Economic predictive performance . . . . . . . . . . . . 70 
3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 
Appendix A: Full predictor set . . . . . . . . . . . . . . . . . . . . . . 81 
Appendix B: Model selection results of the tree-based methods . . . 83 
4 Moving forward from predictive regressions: Boosting asset allo-
cation decisions 85 
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 
4.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 
4.2.1 Starting point and two-step statistical approach . . . . 89 
4.2.2 Objective function . . . . . . . . . . . . . . . . . . . . . 94 
4.2.3 Customized gradient boosting . . . . . . . . . . . . . . . 97 
4.3 Empirical results . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 
4.3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 
4.3.2 Evaluation and benchmarks . . . . . . . . . . . . . . . 103 
4.3.3 In-sample (full sample) results . . . . . . . . . . . . . . 106 
4.3.4 In-sample extensions . . . . . . . . . . . . . . . . . . . . 111 
4.3.5 Out-of-sample forecasting results . . . . . . . . . . . . 113 
4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 
4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 
Tables and Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 
Appendix A: Additional empirical results . . . . . . . . . . . . . . . 134 
viii 
Appendix B: Comparison to Brandt and Santa-Clara (2006) . . . . . 150 
Appendix C: Tuning customized gradient boosting . . . . . . . . . . . 151 
ix 

Chapter 1 
Introduction 
1.1 Background 
The amount of data has grown exponentially in the previous decade. The 
ever-increasing fow of information opens up a wide range of completely new 
possibilities for economic research. Various studies show how web searches 
or satellite images, for example, can be used to predict economic activity (see 
e.g., Ettredge, Gerdes and Karuga, 2005; Choi and Varian, 2012; Henderson, 
Storeygard and Weil, 2012). New data-related issues such as missing data, 
large amount of potential predictors and mixed data types arise with such 
’modern’ economic data. Mullainathan and Spiess (2017) emphasize how the 
term big data refects the change in both the scale and nature of data. The 
demand for fexible methods that can handle such datasets and the potential 
problems therein grows rapidly both in the industry as well as academia. 
The popularity of fexible machine learning methods has been increasing in 
the previous two decades. Machine learning methods are typically introduced 
to economic forecasting as a tool to exploit the potential non-linearities in the 
dataset (see e.g., Stock and Watson, 1999; Kuan and Liu, 1995). Other interest-
ing features of machine learning, such as the model selection capabilities of 
certain methods, have received attention only recently (see e.g., Bühlmann, 
2006; Wohlrabe and Buchen, 2014). This thesis puts emphasis on both ap-
proaches and studies the additional value introduced by different machine 
learning methods to economic forecasting. Not just by allowing for more 
fexible functional forms in the data but also in terms of the ability to deal 
1 
Lauri Nevasalmi 
with various issues that can be encoutered in economic forecasting problems 
with big data. These include studying the class imbalance problem with a 
high-dimensional dataset or even customizing the entire objective function to 
better meet the needs of a particular forecasting problem. 
The class imbalance and the effects on classifcation is well covered in 
the machine learning literature (see e.g., Galar et al., 2012). Suprisingly the 
class imbalance problem has not been properly taken into consideration in 
previous economic research. That is the case despite the numerous potential 
economic applications, such as recession forecasting or fraud detection, where 
the binary response can be quite severely imbalanced. Moreover, most of the 
standard machine learning algorithms assume balanced class distributions 
and hence the forecasting performance can be deteriorated in the presence of 
class imbalance (see He and Garcia, 2009). On the other hand, customizing the 
entire objective function is connected to the recent discussion by Elliott and 
Timmermann (2016, Chapter 2) on what is the appropriate objective function 
in econometric inference. In this thesis we introduce an innovative synthesis 
between fnancial economics and machine learning. This is done by utilizing 
the gradient boosting machine learning algorithm as a tool to optimize a 
custom objective function motivated by economics and fnance. 
In an ideal situation the economic forecasting model is based on economic 
theory and the main analysis concerns specifying the exact functional form for 
the relationship between the dependent variable and the predictors. Majority 
of the time however the fnal model composition (and the functional form) 
is purely empirical. But how to identify the optimal predictor set? Several 
potential predictor candidates have been discovered in the economic literature 
starting from the pioneering work of Mitchell and Burns (1938). Different 
fnancial variables have shown predictive power beyond the benchmark au-
toregressive specifcations when forecasting GDP growth, whereas the yield 
curve is the single best predictor of economic recession periods for example 
(see e.g., Stock and Watson, 2003; Wheelock and Wohar, 2009). In such a 
data-rich environment induced by the recent growth in the availability and 
accessibility of data the amount of potential predictive variables can be very 
large (see e.g., Bühlmann, 2006). Traditional econometric methods can only 
handle very limited predictor sets at a time and some sort of model selection 
procedure is required. Such procedure can easily become computationally 
2 
Introduction 
very demanding or even infeasible as the number of predictors grow. 
The ability to handle large predictor sets also varies between different ma-
chine learning methods. The risk of overlearning the data with large predictor 
sets is a serious issue for many commonly used machine learning methods. 
Methods such as the support vector machines or nearest neighbor are known 
to suffer from the curse of dimensionality (see e.g., Hastie, Tibshirani and 
Friedman, 2009). Ensemble methods random forest and gradient boosting 
machine on the other hand contain an attractive feature as model selection 
is conducted simultaneously with model estimation. These methods require 
no prior knowledge on the relevance of different predictors and can provide 
important details about the most informative predictive variables. For a more 
detailed description of the internal variable selection associated with these 
two methods, see Breiman (2001) and Friedman (2001). 
Unlike in various business-oriented applications the computational eff-
ciency of the forecasting method does not play a signifcant role as the amount 
of observations in economic datasets is typically quite limited. Economic fore-
casting however places a different type of restriction on the selected method. 
Although accurate fnal predictions possibly exploiting non-linearities in the 
data is the ultimate goal, the econometrician favours interpretable models 
(Einav and Levin, 2014). Some machine learning methods, such as neural 
networks, fall in the category of so called ’black-box’ models (Hastie et al., 
2009). Such models provide no exact information on the linkages between the 
response and the predictive variables or this information is hard to visualize. 
Thus the interpretability of the model, in addition to the ability to handle 
potentially large predictor sets, needs to be taken into account when selecting 
the appropriate forecasting method with modern economic data. 
The rest of this introductory chapter is organized as follows. Section 1.2 
presents the econometric framework with different machine learning methods 
by allowing for various functional forms and loss functions. In addition, short 
summaries of the three essays are presented in Section 1.3. 
1.2 Econometric framework 
This section considers the underlying functional forms and objective functions 
of the fve machine learning methods used in this thesis. These are the gradient 
3 
Lauri Nevasalmi 
boosting machine, random forest, neural networks, support vector machine 
and k-nearest neighbor method. 
1.2.1 More fexible functional forms using machine learning 
The goal of statistical modelling and prediction is to specify the functional 
relationship between the dependent variable and the predictors. In economic 
forecasting we typically assume the response at time t + h, where h is the 
forecast horizon, to be a function of the predictive variables and forecast error 
yt+h = F (xt) + et+h, t = 1, . . . , N, (1.1) 
where yt+h is the response, F (xt) is a potentially non-linear function of the 
predictors xt and et+h is the forecast error. In order to make the function 
approximation strategy feasible F (·) is typically restricted to certain parame-
terized class of functions. 
Linear regression model has been the cornerstone of statistical analysis 
for several decades (Hastie et al., 2009). In the linear regression model we 
assume the underlying functional form to be a linear function of the predictor 
variables 
0F (xt) = β0 + x β, (1.2)t 
where β0 is the constant term and xt is a p × 1 vector of predictors. For many 
real-world applications the assumption of linearity is very restrictive. Machine 
learning methods allow for more fexible functional forms and can discover 
all sorts of complex relationships in data. It should however be noted that 
the linearity assumption could easily be incorporated to majority of these 
methods as well. 
As an extension to the linear model in (1.2) let us consider a single hidden-
layer feed-forward neural network with linear output. In this case the network 
consists of three layers: the input layer, a single hidden layer with M hidden 
nodes and the output layer. The subsequent layers in the network are con-
nected with each other through weights. The network architecture induced 
by the assumptions of a single hidden-layer and feed-forward weights is the 
most commonly used one in practice (Bishop, 2006). The fnal model in such 
a network is a linear combination of the original predictors going through a 
4 
Introduction 
non-linear activation function 
MX 0F (xt) = γmσ(bm + x αm), (1.3)t 
m=1 
where bm is a bias term and αm is a p × 1 vector of weights connecting the 
predictors to the hidden node m. Similarly γm is the weight connecting the 
hidden node m to the fnal output node. One can note the resemblance of 
the term inside the brackets to the linear model provided in (1.2). Without 
the non-linear activation function σ(·), which usually is a sigmoid function 
or hyperbolic tangent, the fnal output would be a linear combination of 
linear combinations and hence also linear. The total amount of hidden nodes 
M limits the range of functions that a single hidden layer neural network 
model can approximate. With a suffcient amount for M the model in (1.3) can 
approximate any continuous function to arbitrary accuracy (see e.g., Hornik, 
Stinchcombe and White, 1989; Cybenko, 1989). 
In the support vector machine by Vapnik (1995) non-linearity is intro-
duced to the model by transforming the original input feature space into 
an enlarged space using non-linear functions. In a classifcation context the 
data could be linearly separable in this higher dimensional feature space. 
By considering a d-dimensional set of transformation functions we have 
h(xt) = (h1(xt), . . . , hd(xt))
0 and the fnal model can be written as 
F (xt) = β0 + h(xt)
0β. (1.4) 
Note that although the classifcation problem might be linearly separable in 
the higher dimensional feature space, when transformed back to the original 
feature space it results in a non-linear decision boundary. Instead of the exact 
transformation h(xt) a kernel function, which computes inner products in 
the transformed space, is suffcient. A radial basis function and a dth-degree 
polynomial are typical choices for the kernel function (Hastie et al., 2009). 
With a linear kernel the fnal output is also a linear function of the predictors. 
Linear regression model in (1.2) assumes a global linear function whereas 
in the k-nearest neighbor method originally presented by Fix and Hodges 
(1951) the underlying function is assumed to be well approximated by a locally 
constant function. Let κ(xt) denote the indices of the k observations closest to 
5 
Lauri Nevasalmi 
xt based on some distance metric. With a continuous response the fnal model 
in k-nearest neighbor method is simply an average of the k responses 
X1 
F (xt) = yi+h, (1.5)
k 
i∈κ(xt) 
where yi+h is the response attached to index i (taking into account the fore-
casting horizon h). In the classifcation approach the fnal model is based 
on a majority vote between the k datapoints and ties are broken at random. 
Parameter k controls for the complexity of the model, where smaller values 
for k result in more fexible models (Bishop, 2006). 
In the ensemble methods random forest and gradient boosting the fnal 
model, also called an ensemble, has an additive form 
MX 
F (xt) = fm(xt), (1.6) 
m=1 
where fm(xt) is the base learner function at iteration m. Note that the single 
hidden-layer feed-forward network in (1.3) and the support vector machine in 
(1.4) can also be seen as additive models with unique base learner functions 
(Friedman, 2001). In random forest the base learner in (1.6) is a classifcation 
or regression tree, while in gradient boosting one can consider for example 
linear or spline based functions as well. 
The power of random forest is based on creating a collection of inde-
pendent trees that are minimally correlated with each other (Breiman, 2001). 
Depending on the learning problem the fnal prediction is either a majority 
vote or an average of the predictions induced by each individual tree. Gra-
dient boosting takes a slightly different approach as the fnal model is built 
in a forward stagewise manner by adding new base learner functions to the 
ensemble that best ft the negative gradient of the loss function. Provided 
with suffcient amount of data and a fexible base learner, such as regression 
trees or smoothing splines, boosting can basically approximate any kind of 
functional form. 
6 
Introduction 
1.2.2 Loss function 
In the general estimation problem the goal is to fnd the function F (xt) that 
minimizes the expected loss of some predefned loss function 
bF (xt) = arg min E [L (yt+h, F (xt))] , (1.7) 
F (xt) 
where yt+h is the response and xt is a vector of predictor variables. As was 
considered in the previous section the optimal function with each method is 
assumed to belong to some predefned class of functions. The actual functional 
form of the loss function L(·) in (1.7) depends on both the response and the 
considered machine learning method. In economic forecasting problems the 
response is typically either continuous or discrete with two or more classes. 
With a continuous response both the traditional linear regression model 
and majority of the machine learning methods aim to minimize the (sample) 
sum of squared error (SSE) 
NX� 2
LSSE = yt+h − F (xt) . (1.8) 
t=1 
In the binary two-class classifcation problem the typically considered loss 
function is the binomial deviance. Using more familiar terminology the bino-
mial deviance can simply be expressed as the negative log-likelihood (Hastie 
et al., 2009). Assuming a logistic transformation function the deviance can be 
written as 
NX�  
LDev = − yt+hF (xt) − log (1 + exp(F (xt))) . (1.9) 
t=1 
In the multinomial extension there are L classes and a separate functional 
estimate Fl(xt) for each class. By denoting each class l of the multinomial 
response with a separate binary variable yt+h,l the multinomial deviance can 
be written as 
N LXX 
LMDev = − yt+h,l log(pt,l(xt)), (1.10) 
t=1 l=1 
where pt,l(xt) is the symmetric multiple logistic transform for class l. 
7 
Lauri Nevasalmi 
The support vector machine takes a slightly different approach. The orig-
inal idea of support vector machine is presented in a classifcation context, 
where the fnal goal is to produce a margin maximizing classifer (see Vapnik, 
1995). Evgeniou, Pontil and Poggio (2000) however show that the optimiza-
tion problem in support vector machine can also be presented in terms of 
minimizing a regularized loss function 
NX 
LSVM = V (yt+h, F (xt)) + λJ(F ). (1.11) 
t=1 
In support vector machines we have a specifc form for both the general error 
measure V (·) and the penalty function J(·). The general error measure also 
depends on whether we are dealing with a regression or classifcation problem. 
Random forest and k-nearest neighbor algorithm can not be conceived as 
direct optimization procedures. The k-nearest neighbor algorithm by Fix and 
Hodges (1951) is a model-free method since the classifcation or prediction 
of a new observation is based purely on the data points of the training set. 
The search for the k nearest neighbors to each new datapoint is based on 
some distance measure say the euclidean distance, but this distance metric 
can not be considered as a conventional loss function. Similarly for random 
forest although each independent tree in the fnal ensemble is optimized based 
on the usual criterions with classifcation and regression trees, such as the 
least squares criterion or the gini index, the entire ensemble is not directly 
optimizing any loss function (Wyner et al., 2017). 
1.3 Summary of the essays 
In this section, we review three empirical applications on economic forecasting 
using machine learning methods. In Section 1.3.1, we study the predictability 
of binary economic recession periods. Section 1.3.2 considers predicting multi-
nomial stock returns, where the continuous daily stock returns are discretized 
into three classes. In the last section we take a step further from the classical 
stock return predictability studies and focus on optimizing asset allocation 
decisions directly. 
8 
Introduction 
1.3.1 Recession forecasting with big data 
In this chapter we study the ability of the gradient boosting machine to fore-
cast economic recession periods in the U.S. using high-dimensional data. 
Recessions are shorter events compared to expansion periods leading to quite 
heavily imbalanced binary class labels. The class imbalance with large amount 
of predictive variables creates a risk of forecasting only the non-recessionary 
periods well. We propose a new cost-sensitive extension to the gradient boost-
ing model using binary class weights. In this approach the sample counterpart 
of (1.7) is a weighted average instead of the arithmetic mean. We use the data-
based approach by Zhou (2012) and choose the binary weights according to 
the class imbalance observed in the dataset. To the best of our knowledge cost-
sensitive gradient boosting model using class weights has not been utilized in 
previous economic research. 
The results confrm the fnding of Blagus and Lusa (2017) who note that 
the performance of a gradient boosting model can be rather poor with high 
class imbalance, especially when a high-dimensional dataset is used. The cost-
sensitive extension to the gradient boosting model using class weights can 
take the class imbalance problem into account and produces strong warning 
signals for the U.S. recessions with different forecasting horizons. Different 
types of interest rate spreads are the most important predictors with each 
forecasting horizon. 
1.3.2 Forecasting multinomial stock returns using machine learn-
ing methods 
The fve different machine learning methods presented in Section 1.2 are 
compared in their ability to predict the daily returns of the S&P 500 stock 
market index. Majority of the previous research focus on predicting the actual 
level and then partly the direction of stock returns (i.e. the signs of stock 
returns, indicating that we are interested in a binary-dependent time series 
generated from returns). In this chapter the returns are categorized into three 
classes based on the upper and lower quartiles of the return series. The less 
informative and noisy fuctuation associated with small absolute returns is 
isolated in one class and the other two focus on predicting the large absolute 
returns. To the best of our knowledge such multinomial approach has not 
9 
Lauri Nevasalmi 
been taken into consideration in previous economic research. 
All the machine learning methods examined produce multinomial classi-
fcation results which are signifcant from both the statistical and economic 
point of view. Among the machine learning methods considered the gradient 
boosting machine turns out to be the top-performer. The results also show 
how the predictability of large absolute returns tend to cluster around certain 
periods of time. These periods are typically associated with high stock market 
volatility. The conclusion of increased predictability during market turmoil 
is in line with the fndings of Krauss, Do and Huck (2017) and Fiévet and 
Sornette (2018). 
1.3.3 Moving forward from predictive regressions: Boosting asset 
allocation decisions 
As pointed by Leitch and Tanner (1991), among many others, the statistically 
signifcant predictability of stock returns does not necessarily imply economic 
proftability and vice versa. For an individual investor or portfolio manager 
the economic value of the predictions is the ultimate goal instead of the merely 
considered statistical predictability. In this chapter we estimate the portfolio 
weights directly using gradient boosting by incorporating a custom objective 
function in the context of (1.7). Our contribution is strongly connected to a 
more general issue in (fnancial) econometrics on what is the appropriate objec-
tive (loss) function in econometric inference (see e.g., Elliott and Timmermann, 
2016, chapter 2). 
Our empirical results on the monthly U.S. market returns show that sub-
stantial and quantitatively meaningful economic value can be obtained with 
our utility boosting method. Technical indicators yield as a group the largest 
benefts in out-of-sample forecasting experiments. This is generally in line 
with the conclusions of Neely et al. (2014) and now confrmed with very 
different methodology. 
10 
Introduction 
References 
Bishop, C. M. (2006). Pattern Recognition and Machine Learning (Information 
Science and Statistics). Springer-Verlag, Berlin, Heidelberg. 
Blagus, R. and Lusa, L. (2017). Gradient boosting for high-dimensional pre-
diction of rare events. Computational Statistics & Data Analysis, 113:19 – 
37. 
Breiman, L. (2001). Random forests. Machine Learning, 45(1):5–32. 
Bühlmann, P. (2006). Boosting for high-dimensional linear models. Annals of 
Statistics, 34(2):559–583. 
Choi, H. and Varian, H. (2012). Predicting the present with google trends. 
Economic Record, 88(s1):2–9. 
Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. 
Mathematics of Control, Signals, and Systems, 2(4):303–314. 
Einav, L. and Levin, J. (2014). Economics in the age of big data. Science, 
346(6210). 
Elliott, G. and Timmermann, A. (2016). Economic Forecasting. Princeton Uni-
versity Press, 1 edition. 
Ettredge, M., Gerdes, J., and Karuga, G. (2005). Using web-based search data 
to predict macroeconomic statistics. Commun. ACM, 48(11):87–92. 
Evgeniou, T., Pontil, M., and Poggio, T. A. (2000). Regularization networks and 
support vector machines. Advances in Computational Mathematics, 13:1–50. 
Fiévet, L. and Sornette, D. (2018). Decision trees unearth return sign pre-
dictability in the s&p 500. Quantitative Finance, pages 1–18. 
Fix, E. and Hodges, J. (1951). Discriminatory Analysis: Nonparametric Discrimi-
nation: Consistency Properties. USAF School of Aviation Medicine, Randolph 
Field, TX. 
Friedman, J. H. (2001). Greedy function approximation: A gradient boosting 
machine. Annals of Statistics, 29(5):1189–1232. 
11 
Lauri Nevasalmi 
Galar, M., Fernández, A., Tartas, E. B., Bustince, H., and Herrera, F. (2012). A 
review on ensembles for the class imbalance problem: Bagging-, boosting-, 
and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cyber-
netics, Part C (Applications and Reviews), 42:463–484. 
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The elements of statistical 
learning: data mining, inference and prediction. Springer, 2 edition. 
He, H. and Garcia, E. A. (2009). Learning from imbalanced data. IEEE 
Transactions on Knowledge and Data Engineering, 21(9):1263–1284. 
Henderson, J. V., Storeygard, A., and Weil, D. N. (2012). Measuring economic 
growth from outer space. American Economic Review, 102(2):994–1028. 
Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer feedforward 
networks are universal approximators. Neural Networks, 2(5):359 – 366. 
Krauss, C., Do, X. A., and Huck, N. (2017). Deep neural networks, gradient-
boosted trees, random forests: Statistical arbitrage on the s&p 500. European 
Journal of Operational Research, 259(2):689 – 702. 
Kuan, C.-M. and Liu, T. (1995). Forecasting exchange rates using feedforward 
and recurrent neural networks. Journal of Applied Econometrics, 10(4):347–364. 
Leitch, G. and Tanner, J. E. (1991). Economic forecast evaluation: Profts 
versus the conventional error measures. The American Economic Review, 
81(3):580–590. 
Mitchell, W. C. and Burns, A. F. (1938). Statistical Indicators of Cyclical Revivals. 
NBER. 
Mullainathan, S. and Spiess, J. (2017). Machine learning: An applied econo-
metric approach. Journal of Economic Perspectives, 31(2):87–106. 
Neely, C. J., Rapach, D. E., Tu, J., and Zhou, G. (2014). Forecasting the eq-
uity risk premium: The role of technical indicators. Management Science, 
60(7):1772–1791. 
Stock, J. H. and Watson, M. W. (1999). A Comparison of Linear and Nonlinear Uni-
variate Models for Forecasting Macroeconomic Time Series, pages 1–44. Oxford 
University Press, Oxford. 
12 
Introduction 
Stock, J. H. and Watson, M. W. (2003). Forecasting output and infation: The 
role of asset prices. Journal of Economic Literature, 41(3):788–829. 
Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. Springer-Verlag, 
Berlin, Heidelberg. 
Wheelock, D. C. and Wohar, M. E. (2009). Can the term spread predict output 
growth and recessions? a survey of the literature. Federal Reserve Bank of St. 
Louis Review, 91:419–440. 
Wohlrabe, K. and Buchen, T. (2014). Assessing the macroeconomic forecasting 
performance of boosting: Evidence for the united states, the euro area and 
germany. Journal of Forecasting, 33(4):231–242. 
Wyner, A. J., Olson, M., Bleich, J., and Mease, D. (2017). Explaining the 
success of adaboost and random forests as interpolating classifers. Journal 
of Machine Learning Research, 18(48):1–33. 
Zhou, Z.-H. (2012). Ensemble Methods: Foundations and Algorithms. Chapman & 
Hall/CRC, 1st edition. 
13 

Chapter 2 
Recession forecasting with big 
data 
Abstract ∗ † 
In this chapter, a large amount of different fnancial and macroeconomic 
variables are used to predict the U.S. recession periods. We propose a new 
cost-sensitive extension to the gradient boosting model which can take into 
account the class imbalance problem of the binary response variable. The class 
imbalance, caused by the scarcity of recession periods in our application, is a 
problem that is emphasized with high-dimensional datasets. Our empirical 
results show that the introduced cost-sensitive extension outperforms the 
traditional gradient boosting model in both in-sample and out-of-sample 
forecasting. Among the large set of candidate predictors, different types 
of interest rate spreads turn out to be the most important predictors when 
forecasting U.S. recession periods. 
∗ The author would like to thank Heikki Kauppi, Henri Nyberg, Simone Maxand and 
seminar participants at the Annual Meeting of the Finnish Economic Association and FDPE 
Econometrics workshop for constructive comments. The fnancial support from the Emil 
Aaltonen Foundation is gratefully acknowledged. 
† A paper based on this chapter is available in SSRN working paper series (id3630146). 
15 
Lauri Nevasalmi 
2.1 Introduction 
Recessions are painful periods with a signifcant and widespread decline in 
economic activity. Early warning signals of recessions would be important 
for different kinds of economic agents. Households, frms, policymakers 
and central bankers could all utilize the information concerning upcoming 
economic activity in their decision making. The probability of a recession is 
fairly straightforward to interpret and can be easily taken into consideration 
in all kinds of economic decision making. 
But what are the indicators that consistently lead recessions? Since the 
early work of Estrella and Mishkin (1998) there has been a large amount of 
empirical research concerning the predictive content of different economic and 
fnancial variables (see e.g., Nyberg, 2010; Liu and Moench, 2016). The amount 
of potential recession indicators is growing rapidly as the constraints related 
to data-availability and computational power keep diminishing. Traditionally 
used binary logit and probit models can only handle small predictor sets at a 
time, which makes the search for the best predictors quite diffcult. 
Recent developments in the machine learning literature provide a solution 
to this problem. State of the art supervised learning algorithm called gradient 
boosting is able to do variable selection and model estimation simultaneously. 
Non-parametric boosting can handle huge predictor sets and the estimated 
conditional probability function can take basically any kind of form. The main 
objective of this research is to explore how we can exploit high-dimensional 
datasets when making recession forecasts with the gradient boosting model. 
The business cycle consists of positive and negative fuctuations around 
the long-run growth rate of the economy. These fuctuations are also known as 
expansions and recessions. The offcial business cycle chronology for the U.S. 
is published by the National Bureau of Economic Research (NBER). Recessions 
are shorter events compared to expansion periods leading to quite heavily 
imbalanced binary class labels. In our dataset less than 14 percent of the 
monthly observations are classifed as recessions. This class imbalance and 
the effects on classifcation is well covered in the machine learning literature 
(see e.g., Galar et al., 2012). Suprisingly the scarcity of recession periods has 
not been properly taken into consideration in previous economic research. 
Two approaches are usually considered when dealing with imbalanced 
16 
Recession forecasting with big data 
classes: resampling techniques and cost-sensitive learning methods (see e.g., 
He and Garcia, 2009). Resampling is the easiest and most commonly used 
alternative. The dataset could be balanced by drawing a random sample 
without replacement from the majority class, which is called undersampling. 
In the recession forecasting setup the size of the dataset is already very limited 
so this could create problems when estimating the model, especially with 
high-dimensional data. In the oversampling approach the idea is to sample 
with replacement from the minority class. He and Garcia (2009) argue that the 
duplicate observations from the minority class can lead to overftting. 
Instead of replicating existing observations from the minority class one 
could learn the characteristics in this class and create synthetic samples based 
on feature space similarities. This synthetic minority oversampling technique 
also known as SMOTE is a popular alternative when dealing with imbalanced 
data. Blagus and Lusa (2013) however fnd that variable selection is needed 
before running SMOTE on high-dimensional datasets. 
Cost-sensitive learning methods can take the class imbalance into account 
without artifcially manipulating the dataset. In a variety of real-life classifca-
tion problems, such as recession forecasting or fraud detection, misclassifying 
the minority class can be considered very costly. The cost-sensitivity can be 
incorporated into the model by attaching a higher penalty for misclassifying 
the minority class. Several modifed versions of the adaboost algorithm by 
Freund and Schapire (1996) exist, where the weight updating rule of the origi-
nal algorithm is modifed to better account for the class imbalance (see e.g., 
Sun et al., 2007; Fan et al., 1999; Ting, 2000). 
This is natural since weight updating is a crucial part of the adaboost 
algorithm designed purely for classifcation problems. However this is not 
the case with the more general gradient boosting algorithm presented by 
Friedman (2001) that can handle variety of problems beyond classifcation 
and the cost-sensitivity have to be incorporated otherwise. We propose a cost-
sensitive extension to the gradient boosting model by introducing a binary 
class weight to each observation in the dataset that refect the asymmetric 
misclassifcation costs. To the best of our knowledge cost-sensitive gradient 
boosting model using class weights has not been utilized in previous economic 
research. 
The traditional gradient boosting model has been utilized in previous 
17 
Lauri Nevasalmi 
economic research with mixed results. Ng (2014) uses the gradient boosting 
model with stump regression trees to predict recession periods in the U.S. 
The dataset used by Ng (2014) has a fairly large predictor set and is from the 
same source as the dataset used in this paper. With this model setup Ng (2014) 
concludes that the gradient boosting model is far from perfect in forecasting 
recessions. 
Berge (2015) uses a smaller predictor set to forecast U.S. recessions with the 
gradient boosting model. The results show how boosting outperforms other 
model selection techniques such as Bayesian model averaging. Moreover, 
the results highlight the importance of non-linearity in recession forecasting 
as boosting with non-linear smoothing splines outperforms boosting with a 
linear fnal model. Döpke, Fritsche and Pierdzioch (2017) succesfully forecast 
German recession periods with the gradient boosting model using regression 
trees. Unlike Ng (2014) they build larger trees which allow for potential 
interaction terms between predictors. This approach is used in this study as 
well. 
Our results confrm the fnding of Blagus and Lusa (2017) who note that 
the performance of a gradient boosting model can be rather poor with high 
class imbalance, especially when a high-dimensional dataset is used. The 
out-of-sample forecasting ability of the traditional gradient boosting model 
is quite heavily deteriorated compared to the in-sample results. The cost-
sensitive extension to the gradient boosting model using class weights can 
take the class imbalance problem into account and produces strong warning 
signals for the U.S. recessions with different forecasting horizons. 
The cost-sensitive gradient boosting models estimated using huge pre-
dictor sets rely heavily on different kinds of interest rate spreads. This is 
also the case with the short and medium term forecasting horizons although 
different variables related to the real economy are also available in the dataset. 
The internal model selection capability of gradient boosting confrms that 
predictors with predictive power beyond the term spread are quite hard to 
fnd (see e.g., Estrella and Mishkin, 1998; Liu and Moench, 2016). 
The results also show how the chosen lag length for a predictor can vary 
substantially from the forecasting horizon considered. A similar observation 
has been made by Kauppi and Saikkonen (2008) in the conventional probit 
model. The term spread is the dominant predictor when forecasting recessions 
18 
Recession forecasting with big data 
one year ahead, which is a common fnding in the previous literature (see e.g., 
Dueker, 1997; Estrella and Mishkin, 1998). 
The rest of the paper is organized as follows. The gradient boosting 
framework and the cost-sensitive extension to the gradient boosting model are 
introduced in Section 2.2. The dataset and the empirical analysis are presented 
in Section 2.3. Section 2.4 concludes. 
2.2 Methodology 
The following theoretical framework for the gradient boosting model follows 
closely the original work of Friedman (2001). 
2.2.1 Gradient boosting 
Considering two stochastic processes yt and xt−k of which yt is a binary 
dependent variable of form ( 
1, if economy in recession at time t 
yt = (2.1)
0, if economy in expansion at time t 
and xt−k is a p × 1 vector of predictive variables. The lag length k of each 
predictor must satisfy the condition k ≥ h, where h is the forecasting horizon. 
If Et−k(·) and Pt−k(·) denote conditional expectation and conditional proba-
bility given the information set available at time t − k and by assuming the 
logistic transform Λ(·) the conditional probability can be written as 
Et−k(yt) = Pt−k(yt = 1) = pt = Λ(F (xt−k)). (2.2) 
We can model this conditional probability by estimating the function 
F (xt−k) with the gradient boosting model. Exponential loss and binomial 
deviance are popular alternatives for the loss function to be minimized with 
binary classifcation problems. These are second order equivalent (Friedman, 
Hastie and Tibshirani, 2000). In this research the conditional probability is 
estimated with the gradient boosting model by minimizing the binomial 
deviance loss function. 
In the general estimation problem the goal is to fnd the function F (xt−k) 
19 
Lauri Nevasalmi 
that minimizes the expected loss of some predefned loss function 
bF (xt−k) = arg min E [L (yt, F (xt−k))]. (2.3) 
F (xt−k ) 
Even for a simple parametric model, where F (xt−k) is assumed to be a linear 
function of the covariates, numerical optimization techniques are usually 
needed for solving the parameter vector that minimizes the expected loss in 
equation (2.3). Steepest descent optimization technique is a simple alternative. 
The parameter search using steepest descent can be summarized with the 
following equation XM MX 
β ∗ = β = − δmgm, (2.4)m 
m=0 m=0 
where β0 is the initial guess and {βm}M are steps towards the optimalm=1 
solution. The negative gradient vector −gm determines the direction of each 
step and δm is the stepsize obtained by a line search. 
With gradient boosting the optimization takes place in the function space 
instead of the conventional parameter space. Similarly as in the parametric 
case numerical optimization methods are needed when searching for the 
optimal function. Some further assumptions are required in order to make the 
numerical optimization in the function space feasible with fnite datasets. By 
restricting the function search to some parameterized class of functions the 
solution to numerical optimization can be written as 
M MX X 
F ∗ (xt−k) = fm(xt−k) = δmb(xt−k; γ ), (2.5)m 
m=0 m=0 
where δm is the stepsize obtained by line search as in equation (2.4). Now the 
step "direction" is given by the function b(xt−k; γ ) also known as the base m 
learner function. This can be a simple linear function or highly non-linear such 
as splines or regression trees. In this paper regression trees are used and the 
parameter vector γ consists of the splitting variables and splitpoints of the m 
regression tree. Equation (2.5) also incorporates the original idea of boosting. 
The possibly very complex fnal ensemble F (xt−k) with strong predictive 
ability is a sum of the fairly simple base learner functions fm(xt−k). 
Using the sample counterpart of the loss function in equation (2.3) and 
20 

Recession forecasting with big data 
by plugging in the additive form introduced in equation (2.5) the estimation 
problem can be written as 
N  M X X1 
min L yt, δmb(xt−k; γ ) . (2.6)m {δm,γ }M N m m=1 t=1 m=0 
This minimization problem can be approximated using forward stagewise 
additive modeling technique. This is done by adding new base learner func-
tions to the expansion without altering the functions already included in the 
ensemble. At each step m the base learner function b(xt−k; γ ) which best m 
fts the negative gradient of the loss function is selected and added to the 
ensemble. Using least squares as the ftting criterion while searching for the 
optimal base learner function leads to the general gradient boosting algorithm 
by Friedman (2001): 
Algorithm 2.1 Gradient boosting 
NX 
1F0(xt−k) = arg min N L (yt, ρ) 
ρ 
t=1 
for m ← 1 to M do: 
∂L(yt,F (xt−k))y˜t = − , t = 1, . . . , N ∂F (xt−k) F (xt−k )=Fm−1(xt−k) 
NX 
γ = argmin [y˜t − δb(xt−k; γ)]2 m 
γ,δ t=1 
NX 
ρm = arg min L (yt, Fm−1(xt−k) + ρb(xt−k; γ ))m 
ρ 
t=1 
Fm(xt−k) = Fm−1(xt−k) + ρmb(xt−k; γ )m 
end for 
Friedman (2001) suggests a slight modifcation to Algorithm 2.1 when 
regression trees are used as the base learner function. Regression trees are a 
simple yet powerful tool that partition the feature space into a set of J non-
overlapping rectangles and attach a simple constant to each one. The base 
learner function of a J-terminal node regression tree can be written as 
JX 
b(xt−k; {cj , Rj }Jj=1) = cj I(xt−k ∈ Rj ), (2.7) 
j=1 
21 
Lauri Nevasalmi 
where the functional estimate is a constant cj in region Rj . According to 
Friedman (2001), the additive J-terminal node regression tree in equation 
(2.7) can be seen as a combination of J separate base learner functions. One 
base learner for each terminal node of the regression tree. Therefore after 
estimating the terminal node regions {Rjm}J at the mth iteration with least j=1 
squares on line 4 of the Algorithm 2.1 the line search step on line 5 should 
produce separate estimates for each terminal node of the regression tree. This 
minimization problem can be written as 
N  J X X 
{cˆjm}Jj=1 = arg min L yt, Fm−1(xt−k) + cj I(xt−k ∈ Rjm) . (2.8) 
{cj }J j=1 t=1 j=1 
The ensemble update on the last line of Algorithm 2.1 is then a sum of these J 
terminal node estimates obtained in equation (2.8) 
JX 
Fm(xt−k) = Fm−1(xt−k) + cˆjmI(xt−k ∈ Rjm). 
j=1 
2.2.2 Cost-sensitive gradient boosting with class weights 
With a high class imbalance there is a risk that the estimated binary classifer is 
skewed towards predicting the majority class well (He and Garcia, 2009). An 
algorithm can be made cost-sensitive by weighting the dataspace according to 
the misclassifcation costs (Branco, Torgo and Ribeiro, 2016). This weighting 
approach is sometimes referred to as rescaling in the previous literature (see 
e.g., Zhou and Liu, 2010). The asymmetric misclassifcation costs, which are 
the building block of cost-sensitive learning, are incorporated to the gradient 
boosting model by introducing a binary class weight for each observation in 
the data. In the traditional gradient boosting model the sample counterpart 
of the loss function is the sample mean and the minimization problem can be 
written as in equation (2.6). By introducing a vector of class weights we end 
up minimizing the weighted average of the sample loss function 
N  M X X1 
min wtL yt, δmb(xt−k; γm) . (2.9)N{δm,γ }M Pm m=1 t=1 m=1wt 
t=1 
22 
Recession forecasting with big data 
If the weights wt are equal for each observation the weighted average in 
equation (2.9) reduces to the sample mean. 
Elkan (2001) suggests weighting the minority class observations according 
to the ratio in misclassifcation costs. Suppose c10 denote the cost when we fail 
to predict a recession and c01 when we give a false alarm of recession. The 
optimal weight for the minority class observations is then 
∗ c10 w = . (2.10) 
c01 
In many cases the exact misclassifcation costs are unknown and we must rely 
on rules such that misclassifying the minority class is more costly (Maloof, 
2003). The class weights are basically arbitrary as they depend on the unknown 
preferences how harmful different types of misclassifcation is considered to 
be. In this paper we use the data-based approach by Zhou (2012) and choose 
the weights according to the class imbalance observed in the dataset ⎧ PN⎪⎨ t=1(1−yt)PN , if yt = 1 wt = . (2.11)t=1yt⎪⎩ 1, if yt = 0 
As can be seen from equation (2.11) the weights depend on the ratio of the 
number of datapoints in both classes. These binary weights ensure that the 
sum of weights are equal in both classes. The aim of choosing these weights 
is to force the algorithm to provide a balanced degree of predictive accuracy 
between the two classes. 
The cost-sensitive gradient boosting algorithm with class weights follows 
the steps described in Algorithm 2.1 but the binary class weights can have an 
effect on each step of the algorithm. Table 2.1 illustrates how the class weights 
alter different parts of the gradient boosting algorithm, when J-terminal node 
regression trees are used as the base learner functions and the loss function to 
be minimized is the binomial deviance. 
23 
Lauri Nevasalmi 
Table 2.1: The effect of class weights on the gradient boosting algorithm 
Step Value 
Loss function −2 
PN 
t=1 wt[ytF (xt−k)−log(1+e F (xt−k))]PN 
t=1 wt 
Initial value F0(xt−k) = log( 
PN 
t=1 wtytPN 
t=1 wt(1−yt) 
) 
Gradient y˜tm = yt − pt, where 
pt = 
1 
1+e −Fm−1(xt−k) 
Split criterion i2(Rl, Rr) = WlWr Wl+Wr (g¯l − g¯r)2 , Wl = 
P 
xt−k ∈Rl 
wt 
g¯l = 
1 
Wl 
P 
xt−k∈Rl 
wt ˜ytm 
Terminal node 
estimate 
cˆjm = 
P 
xt−k∈Rj wt(yt−pt)P 
xt−k∈Rj wtpt(1−pt) 
Note that the values for each step of the ordinary gradient boosting model 
can be obtained from Table 2.1 by setting all the weights equal to one. The 
cost-sensitive and the traditional gradient boosting algorithms differ starting 
from the initial values. As the frst gradient vector is based on the initial value 
the gradients are also different. The biggest differences between these two 
algorithms however are related to the estimation of the regression tree base 
learners at each iteration m of the algorithm. Blagus and Lusa (2017) argue 
that the class imbalance problem of the gradient boosting model with high-
dimensional data is related to the inappropriately defned terminal regions 
Rj . 
Next we will consider how class weights can have an effect on both the 
estimated terminal node regions and the terminal node estimates of the re-
gression tree base learner. When J-terminal node regression tree is used as 
the base learner function, the J − 1 recursive binary splits into regions Rl and 
Rr dividing the predictor space into J non-overlapping terminal node regions 
{Rj }Jj=1 are obtained by maximizing the least-squares improvement criterion. 
These splits are based on a slightly different criterion if class weights are used. 
For this reason the estimated terminal node regions and the terminal node 
estimates can be different between the two algorithms. 
From Table 2.1 we can see how the split criterion is based on two parts. 
24 
Recession forecasting with big data 
The frst part WlWr illustrates how each split into regions Rl and Rr inWl+Wr 
cost-sensitive gradient boosting is based on the sum of weights in these two 
categories instead of the number of observations. The latter part of the split 
criterion (g¯l − g¯ r)2 shows that instead of the average gradient we compare 
the weighted average of the gradient in the regions, when searching for the 
optimal split point. From the last row in Table 2.1 one can note how the 
terminal node estimates are functions of both the terminal node regions and 
the class weights itself and hence the fnal estimates can be different between 
the two algorithms. 
2.2.3 Regularization parameters in gradient boosting 
Friedman (2001, 2002) introduces several add-on reqularization techniques 
to reduce the risk of overftting or to improve the overall performance of 
the gradient boosting algorithm. The parameters related to these techniques 
are often called tuning parameters since it is up to the user to fnetune the 
parameter values for the particular problem at hand. Tuning parameters 
with the gradient boosting technique can be divided into two categories: 
parameters related to the overall algorithm and parameters related to the 
chosen base learner function. 
Friedman (2001) incorporates a simple shrinkage strategy to slow down 
the learning process. In this strategy each update of the algorithm is scaled 
down by a constant called learning rate. The ensemble update on the last line 
of Algorithm 2.1 can then be written as 
Fm(xt−k) = Fm−1(xt−k) + υρmb(xt−k; γm), 
where 0 < υ ≤ 1 is the learning rate. Learning rate is a crucial part of the 
gradient boosting algorithm as it controls the speed of the learning process 
by shrinking each gradient descent step towards zero. Friedman (2001) sug-
gests to set the learning rate small enough for better generalization ability. 
Bühlmann and Yu (2010) reach a similar conclusion. 
Breiman (1996) notes that introducing randomness when building each 
tree in an ensemble can lead to substantial gains in prediction accuracy. Based 
on these fndings Friedman (2002) develops stochastic gradient boosting in 
which subsampling is used to enhance the generalization ability of the gradi-
25 
Lauri Nevasalmi 
ent boosting model. At each round of the algorithm a random subsample of 
datapoints is drawn without replacement and the new base learner function is 
ftted using this random subsample. Simulation studies show that subsam-
pling fraction around one half seems to work best in most cases (Friedman, 
2002). 
The total amount of iterations M needed however moves in the opposite 
direction to learning rate and subsampling. Gradient boosting is a fexible 
technique which can approximate basically any kind of functional form with 
suffcient amount of data. This fexibility can also come with a cost. Overftting 
the training data is a risk that must be taken into consideration as it can lead 
to decreased generalization ability of the model. The optimal amount of 
iterations is usually chosen with early stopping methods such as using an 
independent test set or cross-validation. 
When the amount of observations is scarce K-fold cross-validation is often 
the only alternative since we can not afford to set aside an independent test set. 
K-fold cross-validation is based on splitting the data into K non-overlapping 
folds. Each of these folds is used as a test set once while the model is estimated 
using the remaining K −1 folds. To reduce the effect of randomness the K-fold 
cross-validation process can be repeated R times (Kim, 2009). In the repeated 
K-fold cross-validation approach the estimate for the optimal stopping point 
is based on the average validation error produced by the K folds at each of 
these R repeats. 
Instead of the traditional repeated K-fold we use a more conservative 
cross-validation approach since the risk of overftting the data in the high-
dimensional setup is fairly high. In this conservative approach only the 
validation error produced by the fold, which frst reaches its minimum and 
therefore frst starts to show signs of overftting, is selected out of the K folds 
at each repetition. By denoting the found "weakest" fold in repetition r as 
k∗ , the number of observations in this fold as Nkr ∗ and the model estimated r 
F −k∗ without this fold as ˆ r (xt−k) the conservative cross-validation estimate for 
the prediction error can be written as 
Nk ∗X X � 1 R 1 r 
Fˆ−k
∗ 
CV = L yt, r (xt−k) , (2.12)
R Nkr ∗ r=1 t=1 
26 
Recession forecasting with big data 
where binomial deviance is used as the loss function L(·). The fnal estimate 
for the amount of iterations is the point where the estimated prediction error 
in (2.12) reaches its minimum. To the best of our knowledge this simple 
conservative approach has not been used in the previous academic research. 
The complexity of the regression tree base learners is controlled by the 
number of terminal nodes J in each regression tree. The amount of inner 
nodes (J − 1) in the regression tree limit the potential amount of interaction 
between predictors as shown with the ANOVA expansion of a function X X X 
F (xt−k) = fj (xj ) + fjk(xj , xk) + fjkl(xj , xk, xl) + · · · . (2.13) 
j j,k j,k,l 
The simplest regression tree with just two terminal nodes can only capture 
the frst term in equation (2.13). Higher order interactions are needed to be 
able to capture the latter terms, which are functions of more than one variable. 
These higher-order interactions require deeper trees. Hastie, Tibshirani and 
Friedman (2009) argue that trees with more than ten terminal nodes are seldom 
needed with boosting. 
2.3 Results 
2.3.1 Data and model setup 
The dataset used in the empirical analysis is the FRED-MD monthly dataset. 
The selected timespan covers the period from January 1962 to June 2017. After 
dropping out variables that are not available for the full period the FRED-MD 
dataset consists of 130 different economic and fnancial variables related to 
different parts of the economy.1 Three different forecasting horizons h are 
studied in the empirical analysis: short (h = 3), medium (h = 6) and long 
(h = 12). 
All the available lag lengths k of the predictors up to 24 months are consid-
ered as potential predictors (assuming k ≥ h). The total amount of predictors 
in the dataset take the value of 2860, 2470 or 1690 depending on the length of 
1 All ISM-series (The Institute for Supply management) have been removed 
from the FRED-MD dataset starting from 2016/6. These series have been re-
obtained using Macrobond. For more general information about the dataset see 
https://research.stlouisfed.org/econ/mccracken/fred-databases/ 
27 
Lauri Nevasalmi 
the forecasting horizon. For example, the total amount of predictors with the 
shortest forecasting horizon is 2860, which includes 22 different lags of these 
130 variables. See Christiansen, Eriksen and Møller (2014) for a similar study 
where each lag is considered as a separate predictor. 
The term spread has been noted as the best single predictor of recessions 
and economic growth in general in the U.S. (see e.g., Dueker, 1997; Estrella and 
Mishkin, 1998; Wohar and Wheelock, 2009). To see if it is actually worthwhile 
to go through these huge predictor sets with the gradient boosting models, 
we use a simple logit model with the term spread as a benchmark model. 
Kauppi and Saikkonen (2008) note that setting the lag length k equal to the 
forecasting horizon h may not be optimal in all cases. To take this into account 
we introduce the six nearest lag lengths of the term spread as additional 
predictors. The term spread is measured as the interest rate spread between 
the 10-year government bonds and the effective federal funds rate as this is 
included in the FRED-MD dataset. 
The estimated conditional probabilities for different models are evaluated 
using the receiver operating characteristic curve (ROC). The area under the 
ROC-curve (AUC) measures the overall classifcation ability of the model 
without restricting to a certain probability threshold. AUC-values closer to 
one indicate better classifcation ability whereas values close to one half are no 
better than a simple coin toss. For a more comprehensive review of the AUC-
measure in economics context see e.g., Berge and Jordà (2011) and Nyberg and 
Pönkä (2016). 
The gradient boosting model involves internal model selection as the 
regression trees selected at each step of the algorithm may be functions of 
different predictors. Some predictors are chosen more often than others and 
can be considered more important. Breiman et al. (1984) introduce a measure 
for the relevance of a predictor xp in a single J-terminal node regression tree 
T 
J−1X 
Iˆ2(T ) = ıˆ2 j I(vj = p), (2.14)p 
j=1 
where vj is the splitting variable of inner node j and ıˆj 
2 is the empirical improve-
ment in squared error as a result of this split. The least squares improvement 
criterion was introduced in Table 2.1. 
28 
Recession forecasting with big data 
The measure in equation (2.14) is based on a single tree, but it can be 
generalized to additive tree expansions as well (Friedman, 2001). The relative 
infuence of a variable xp for the entire gradient boosting ensemble is simply 
an average over all the trees {Tm}M in the ensemble m=1 
MX1 
Iˆ  p 
2 = Iˆ  p 
2(Tm). (2.15)
M 
m=1 
The relative infuence measure in equation (2.15) is used to illustrate the 
most important recession indicators with the gradient boosting model. The 
relevance of a predictor xp in the recursive out-of-sample forecasting is the 
average Iˆ2 of the estimated models. p 
The following results are obtained using the R programming environment 
for statistical computing (R Core Team, 2017). The GBM-package (Ridgeway, 
2017) with bernoulli loss function is used to estimate the gradient boosting 
models. With such huge predictor sets it is likely that there are interaction 
between some predictors. For this reason the maximum tree depth is set to 8 
leading to regression trees with nine terminal nodes. Döpke et al. (2017) use 
6-terminal node regression trees while predicting recessions in Germany with 
a much smaller predictor set. 
The minimum number of observations required in each terminal node of a 
regression tree is set to one allowing the tree building process to be as fexible 
as possible. Similar results are obtained when setting the minimum number 
of observations to fve as is used by Döpke et al. (2017).2 Learning rate is set 
to a low value of 0.005 and the default value of 0.5 is used as the subsampling 
fraction. The conservative cross-validation approach presented in equation 
(2.12) is conducted using 5 folds and 5 repeats throughout this research to fnd 
the optimal amount of iterations. In order to keep the computational time 
feasible the maximum amount of iterations is set to 800. 
2.3.2 In-sample results 
Three different models are compared in the in-sample analysis using the full 
dataset. The benchmark model (bm) is a simple logit model with seven lags of 
the term spread as predictors. GBM is the ordinary gradient boosting model 
2 Results upon request. 
29 
Lauri Nevasalmi 
and wGBM stands for the cost-sensitive gradient boosting model with class 
weights. The class weights are formed according to equation (2.11). The binary 
response variable for each model is the business cycle chronology provided 
by the NBER. 
Table 2.2 summarizes the in-sample performance as measured with the 
area under the ROC-curve (AUC) of these three models for all the different 
forecasting horizons. The rows of the table present the different models and 
the columns stand for the forecasting horizons considered. The validation 
AUCs from the 5-fold cross-validation repeated fve times are reported in 
parenthesis. 
Table 2.2: In-sample AUC (1962/01 - 2017/06) 
Forecast horizon, Months 
Model specifcation 3 6 12 
Benchmark 0.890 (0.881) 0.910 (0.902) 0.914 (0.897) 
GBM 1.000 (0.985) 1.000 (0.980) 1.000 (0.956) 
wGBM 1.000 (0.987) 1.000 (0.981) 1.000 (0.961) 
As expected, the non-linear gradient boosting models do a better job fore-
casting recessions in-sample. The larger information set and the more fexible 
functional form of the GBM-models allow for a more detailed in-sample ft. 
The perfect in-sample AUCs for the GBM-models can raise questions of over-
ftting. As a result of using these moderate sized regression trees as base 
learner functions the GBM-models achieve nearly perfect classifcation ability 
after only a few iterations. This can be confrmed by training a shallow single 
decision tree to the full dataset. The single decision tree alone is suffcient 
to produce very high in-sample AUCs, even after restricting the predictor 
space to consider only the eight different interest rate spreads (and their lag 
lengths).3 Thereby it is not completely surprising that an ensemble of trees 
3 The in-sample AUCs with a single decision tree are close to or well above 0.95 depend-
ing on the forecasting horizon. On the other hand, restricting the GBM-models by consid-
ering only the simplest stump regression trees and / or only the interest rate spreads as 
predictors are not suffcient as models produce in-sample AUCs of one or really close to it. 
Results upon request. 
30 
Recession forecasting with big data 
yield a perfect in-sample ft as measured with AUC. For example, the cost 
sensitive GBM-model with the shortest forecasting horizon reaches an AUC 
value of 0.997 after just fve iterations. However it should be noted that the 
estimated conditional probabilites at this point range between 0.488 and 0.512, 
values that are only slightly different from the initial value of one half because 
of the shrinkage strategy described in Section 2.2.3. It could be argued that the 
AUC may not be the most suitable criterion when evaluating the in-sample 
performance in this setup. But since the main emphasis is on the out-of-sample 
performance of the models the AUCs are reported here for comparison. 
The validation AUCs reported in Table 2.2 provide additional insight into 
the potential overftting problem since large deviations between the in-sample 
and validation performance is typically seen as a sign of overftting. The vali-
dation AUCs for the GBM-models are of similar magnitude as the in-sample 
AUCs and therefore do not indicate overftting. Döpke et al. (2017) also report 
validation AUCs close to one when forecasting recessions in Germany with 
the gradient boosting model. The validity of the traditional random sampling 
techniques used in cross-validation with such a highly autocorrelated binary 
response variable should be further examined. This however is beyond the 
scope of this research. 
Table 2.2 shows how the cost sensitive GBM-model outperforms the other 
two models as measured with the validation AUC, although the difference 
between the two GBM-models is small. The gap in validation AUCs between 
the benchmark and GBM-models decreases slightly as the forecasting horizon 
grows. Graphical illustrations are an important part of recession forecasting 
since these can give a better picture of the false alarms and other potential 
problems related to the models. The estimated conditional probabilities that 
the economy is in recession h-months from now are calculated according 
to equation (2.2). These in-sample estimated conditional probabilites are 
illustrated in Figure 2.1 for each of the three models and forecasting horizons. 
31 
·~ 
,', 
' 
' 
·' 1 , 
J , 
,,, 
'' 
'' 
' ' 
' ' 
( 
/, 
" . " 111' 
:,,, J. 
, , 
, , 
: ' 
' ' , ' 
,• ~,' ~ I 
/'',o•'•' _ :,; ~ ~ I ~ • 
l'I 
I~ 
, 
,, 
, , 
i': 
'' 
Lauri Nevasalmi 
h = 3
Time
Pr
ob
ab
ilit
y
1960 1970 1980 1990 2000 2010
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0 h = 6
Time
Pr
ob
ab
ilit
y
1960 1970 1980 1990 2000 2010
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
h = 12
Time
Pr
ob
ab
ilit
y
1960 1970 1980 1990 2000 2010
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
bm
GBM
wGBM
Figure 2.1: In-sample estimated conditional probabilities 
The conditional probabilities for both GBM-models can be seen to mimic 
the shaded recession periods quite nicely. The in-sample fts for the two GBM-
models have a rather similar shape without any major differences, which is 
in line with the results in Table 2.2. However, the recession signals produced 
by the cost-sensitive GBM-model are constantly stronger compared to the 
other two models with all the forecasting horizons. It is also noteworthy 
how the benchmark logit model produces a lot weaker signals for the last 
three recessions compared to the GBM-models. Figure 2.1 also shows how the 
estimated conditional probabilities for the GBM-models are not exactly zero 
or one and the in-sample ft is not perfect in probability terms. Using forecast 
performance evaluation criterion other than AUC, such as the binomial de-
32 
Recession forecasting with big data 
viance or the quadratic probability score, would not indicate perfect in-sample 
ft. 
2.3.3 Out-of-sample results 
Good in-sample results may not always refect the out-of-sample predictive 
ability of the model. An expanding window forecasting procedure is used 
to examine the true predictive ability of the models. Both Berge (2015) and 
Ng (2014) use rolling window when forecasting U.S. recessions. To ensure the 
maximum sample size for the estimation of each model an expanding window 
approach is used in this study. 
The out-of-sample evaluation period covers the period starting from De-
cember 1988 to June 2017. Because of high computational cost the GBM-
models are re-estimated only once a year in December. The class weights 
are updated according to equation (2.11) as the proportion of zeros and ones 
change for the binary response. The business cycle recession and expansion 
periods are not available in real time. The publication lag of the NBER business 
cycle chronology is thus assumed to be 12 months. 
The results from the recursive out-of-sample forecasting procedure are 
reported in Table 2.3. The out-of-sample performance as measured with the 
area under the ROC-curve is illustrated for the different models at each of the 
three forecasting horizons. 
Table 2.3: Out-of-sample AUC (1988/12 - 2017/06) 
Forecast horizon, Months 
Model specifcation 3 6 12 
Benchmark 0.748 0.811 0.919 
GBM 0.841 0.816 0.867 
wGBM 0.915 0.861 0.928 
The out-of-sample AUCs show that the cost-sensitive GBM-model outper-
forms the other two models with all the forecasting horizons. The difference 
in AUCs between the traditional and cost-sensitive gradient boosting models 
are quite similar with all the forecasting horizons. The average difference of 
33 
Lauri Nevasalmi 
the AUCs between the two GBM-models is 0.06. 
The out-of-sample performance for the traditional GBM-model is quite 
heavily deteriorated when compared to the in-sample AUCs reported in Table 
2.2. The standard GBM-model can outperform the benchmark model only at 
the shortest forecasting horizon. This diminished out-of-sample forecasting 
ability of the traditional GBM-model could indicate problems related to the 
class imbalance of the response. Blagus and Lusa (2017) note that the tradi-
tional GBM-model can perform poorly on high-dimensional data with class 
imbalance. Figure 2.2 illustrates the out-of-sample estimated conditional prob-
abilities calculated according to equation (2.2) for all the different forecasting 
horizons and models. 
h = 3
Time
Pr
ob
ab
ilit
y
1990 1995 2000 2005 2010 2015
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
bm
GBM
wGBM
h = 6
Time
Pr
ob
ab
ilit
y
1990 1995 2000 2005 2010 2015
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
bm
GBM
wGBM
h = 12
Time
Pr
ob
ab
ilit
y
1990 1995 2000 2005 2010 2015
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
bm
GBM
wGBM
Figure 2.2: Out-of-sample estimated conditional probabilities 
34 
Recession forecasting with big data 
Figure 2.2 shows how the recession probabilities for each of the models 
with the short and medium term forecasting horizons spike just before the 
actual recession period in the early ninetees. Although these spikes are con-
sidered as false alarms and decrease the out-of-sample performance of the 
models, this hightened risk of an upcoming recession could have considerable 
practical importance. 
Figure 2.2 also illustrates the problems related to the diminished out-of-
sample performance of the traditional GBM-model. The traditional GBM-
model provides several false alarms, especially at the short and medium term 
forecasting horizons. With the longest forecasting horizon the traditional 
GBM-model give a rather weak signal of the upcoming recession period in 
the early ninetees when compared to the other two models. 
The cost-sensitive GBM-model on the other hand provides clear warnings 
of the upcoming recession periods in the short and medium term without any 
major false alarms. Although the recession signal for the second recession 
period with the shortest forecasting horizon is quite modest. It should be 
noted that the magnitude of the recession signals are diminished for each of 
the three models when compared to the in-sample probabilities in Figure 2.1. 
With the 12-month forecasting horizon the cost-sensitive GBM-model pro-
vides strong warning signals for each of the three recessions. The estimated 
recession probabilities of the cost-sensitive GBM-model bears a close resem-
blance to the benchmark model. This also includes the two false alarms that 
are typical when predicting recessions with the term spread (see e.g., Kauppi 
and Saikkonen, 2008; Nyberg, 2010). 
To further consider the composition of the estimated cost-sensitive GBM-
models Table 2.4 presents the ten most important out-of-sample predictors 
according to the relative infuence measure presented in equation (2.15). 
35 
Lauri Nevasalmi 
Table 2.4: Top-10 out-of-sample predictors for wGBM 
h = 3 h = 6 h = 12 
Variable Rel.inf Variable Rel.inf Variable Rel.inf 
6mth - FFrate_4 5.464 6mth - FFrate_6 8.722 10yr - FFrate_12 18.051 
10yr - FFrate_9 4.920 10yr - FFrate_9 4.979 5yr - FFrate_15 6.347 
6mth - FFrate_6 4.744 5yr - FFrate_15 4.026 5yr - FFrate_14 3.634 
6mth - FFrate_5 4.581 1yr - FFrate_6 3.941 10yr - FFrate_13 2.778 
6mth - FFrate_7 3.103 6mth - FFrate_7 3.739 5yr - FFrate_16 2.466 
5yr - FFrate_15 2.988 3mth - FFrate_6 3.570 10yr - FFrate_14 2.400 
10yr - FFrate_8 2.757 10yr - FFrate_8 3.111 5yr - FFrate_13 1.808 
3mth - FFrate_6 2.337 1yr - FFrate_7 2.512 AAA - FFrate_12 1.787 
1yr - FFrate_6 2.310 6mth - FFrate_8 2.175 5yr - FFrate_12 1.688 
1yr - FFrate_7 2.254 10yr - FFrate_11 2.048 PERMITS_15 1.496 
The cost-sensitive GBM-models rely heavily on different kinds of interest 
rate spreads as can be seen in Table 2.4. The only non-interest rate based 
predictor is the ffteenth lag of the new private housing permits variable 
(PERMITS_15) with the longest forecasting horizon. This is a bit surprising at 
the short and medium term forecasting horizons since variables describing the 
real economy are often found useful when predicting recessions with these 
forecasting horizons (see e.g., Berge, 2015). The heavy usage of interest rate 
spreads confrms that predictors with forecasting ability beyond the term 
spread are quite hard to fnd (see e.g., Estrella and Mishkin, 1998; Liu and 
Moench, 2016). 
Models based on different kinds of interest rate spreads can be affected 
by the problems related to the predictive power of the term spread noted 
in the previous literature. Several studies show how the term spread fore-
cast U.S. output growth less accurately after the mid 1980s (see e.g., Estrella, 
Rodrigues and Schich, 2003; Stock and Watson, 2003). The slightly lower out-
of-sample AUCs reported in Table 2.3 for each of the three models, including 
the benchmark model, are in line with this fnding. 
Table 2.4 shows how the interest rate spread between the 6-month treasury 
bill and the effective federal funds rate with the fourth lag (6mth - FFrate_4) is 
36 
Recession forecasting with big data 
the most important predictor when predicting recessions three months ahead. 
The same predictor with the sixth lag is the most important predictor with 
the medium term forecasting horizon. The composition of the top-10 out-
of-sample predictors are quite similar between the short and medium term 
horizons. 
The chosen lag lengths of the predictors with the short and medium term 
horizons can deviate quite substantially from the length of the forecasting 
horizon. For example, the spread between the 5-year treasury bond and 
the effective federal funds rate with the ffteenth lag (5yr - FFrate_15) is an 
important predictor with both of these horizons. Similar observation can be 
made with the spread between the 10-year treasury bond and the effective 
federal funds rate with the ninth lag (10yr - FFrate_9). With the longest 
forecasting horizon the term spread with lag length equal to twelve (10yr 
- FFrate_12) has a very strong impact on the models as measured with the 
relative infuence. Such dominance of a single predictor is not found with the 
short and medium term horizons. 
2.4 Conclusions 
This paper introduces a new cost-sensitive gradient boosting model which can 
take into account the class imbalance of the binary response variable. The cost-
sensitive gradient boosting model is applied to predicting binary U.S. recession 
periods with a high-dimensional dataset of fnancial and macroeconomic 
variables. The internal model selection of the cost-sensitive gradient boosting 
algorithm provides important information about the most useful recession 
indicators and chosen lag lengths with different forecasting horizons. 
The empirical results show how the cost-sensitive extension to the gradient 
boosting model produces stronger and more stable recession forecasts for the 
U.S. with each forecasting horizon compared to the traditional gradient boost-
ing model. A logit model based on the term spread is used as a benchmark 
model to see if the more complex gradient boosting models provide predic-
tive power beyond the best known simple model. The cost-sensitive model 
outperforms the benchmark model with each forecasting horizon whereas the 
traditional gradient boosting model is able to outperform the benchmark only 
at the shortest forecasting horizon. Different kinds of interest rate spreads are 
37 
Lauri Nevasalmi 
the most important predictors, even with the short and medium term forecast-
ing horizons. The term spread is the dominant predictor when forecasting 
recessions one year ahead. 
The current research can be extended in several ways. First of all, the 
binary values for the class weights were chosen so that both the minority and 
the majority class receive similar attention in the learning process. Different 
choices for the class weights could be further examined. Especially in cases 
where the class imbalance is even more radical. The cost-sensitive approach 
could also be extended to multinomial classifcation problems, where different 
types of class imbalance problems can emerge. There could be for example 
more than one minority class with a multinomial response variable. Introduc-
ing model dynamics is another potential area for future research. This would 
allow iterative forecasts to be used instead of the forecast horizon-specifc 
forecasts as in this study. 
References 
Berge, T. J. (2015). Predicting recessions with leading indicators: Model averag-
ing and selection over the business cycle. Journal of Forecasting, 34(6):455–471. 
Berge, T. J. and Jordà, Ò. (2011). Evaluating the classifcation of economic 
activity into recessions and expansions. American Economic Journal: Macroe-
conomics, 3(2):246–77. 
Blagus, R. and Lusa, L. (2013). Smote for high-dimensional class-imbalanced 
data. BMC Bioinformatics, 14(1):106. 
Blagus, R. and Lusa, L. (2017). Gradient boosting for high-dimensional pre-
diction of rare events. Computational Statistics & Data Analysis, 113:19 – 
37. 
Branco, P., Torgo, L., and Ribeiro, R. (2016). A survey of predictive modeling 
on imbalanced domains. ACM Computing Surveys, 49. 
Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2):123–140. 
Breiman, L., Friedman, J., Olshen, R., and Stone, C. J. (1984). Classifcation and 
Regression Trees. Wadsworth, New York. 
38 
Recession forecasting with big data 
Bühlmann, P. and Yu, B. (2010). Boosting. Wiley Interdisciplinary Reviews: 
Computational Statistics, 2(1):69–74. 
Christiansen, C., Eriksen, J. N., and Møller, S. V. (2014). Forecasting us reces-
sions: The role of sentiment. Journal of Banking & Finance, 49:459 – 468. 
Döpke, J., Fritsche, U., and Pierdzioch, C. (2017). Predicting recessions with 
boosted regression trees. International Journal of Forecasting, 33(4):745–759. 
Dueker, M. J. (1997). Strengthening the case for the yield curve as a predictor 
of u.s. recessions. Federal Reserve Bank of St. Louis Economic Review, 79:41–51. 
Elkan, C. (2001). The foundations of cost-sensitive learning. In In Proceedings 
of the Seventeenth International Joint Conference on Artifcial Intelligence, pages 
973–978. 
Estrella, A. and Mishkin, F. (1998). Predicting u.s. recessions: Financial vari-
ables as leading indicators. The Review of Economics and Statistics, 80(1):45–61. 
Estrella, A., Rodrigues, A. R., and Schich, S. (2003). How stable is the predictive 
power of the yield curve? evidence from germany and the united states. The 
Review of Economics and Statistics, 85(3):629–644. 
Fan, W., Stolfo, S. J., Zhang, J., and Chan, P. K. (1999). Adacost: Misclassif-
cation cost-sensitive boosting. In Proceedings of the Sixteenth International 
Conference on Machine Learning, ICML ’99, pages 97–105, San Francisco, CA, 
USA. Morgan Kaufmann Publishers Inc. 
Freund, Y. and Schapire, R. E. (1996). Experiments with a new boosting algo-
rithm. In Proceedings of the Thirteenth International Conference on International 
Conference on Machine Learning, ICML’96, pages 148–156, San Francisco, CA, 
USA. Morgan Kaufmann Publishers Inc. 
Friedman, J., Hastie, T., and Tibshirani, R. (2000). Additive logistic regression: 
A statistical view of boosting. The Annals of Statistics, 28:337–407. 
Friedman, J. H. (2001). Greedy function approximation: A gradient boosting 
machine. Annals of Statistics, 29(5):1189–1232. 
Friedman, J. H. (2002). Stochastic gradient boosting. Computational Statistics & 
Data Analysis, 38(4):367 – 378. 
39 
Lauri Nevasalmi 
Galar, M., Fernández, A., Tartas, E. B., Bustince, H., and Herrera, F. (2012). A 
review on ensembles for the class imbalance problem: Bagging-, boosting-, 
and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cyber-
netics, Part C (Applications and Reviews), 42:463–484. 
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The elements of statistical 
learning: data mining, inference and prediction. Springer, 2 edition. 
He, H. and Garcia, E. A. (2009). Learning from imbalanced data. IEEE 
Transactions on Knowledge and Data Engineering, 21(9):1263–1284. 
Kauppi, H. and Saikkonen, P. (2008). Predicting u.s. recessions with dynamic 
binary response models. The Review of Economics and Statistics, 90(4):777–791. 
Kim, J.-H. (2009). Estimating classifcation error rate: Repeated cross-
validation, repeated hold-out and bootstrap. Computational Statistics & 
Data Analysis, 53(11):3735 – 3745. 
Liu, W. and Moench, E. (2016). What predicts us recessions? International 
Journal of Forecasting, 32(4):1138 – 1150. 
Maloof, M. A. (2003). Learning when data sets are imbalanced and when 
costs are unequal and unknown. In ICML-2003 Workshop on Learning from 
Imbalanced Data Sets II. 
Ng, S. (2014). Viewpoint: Boosting recessions. Canadian Journal of Economics, 
47(1):1–34. 
Nyberg, H. (2010). Dynamic probit models and fnancial variables in recession 
forecasting. Journal of Forecasting, 29(1-2):215–230. 
Nyberg, H. and Pönkä, H. (2016). International sign predictability of stock 
returns: The role of the United States. Economic Modelling, 58(C):323–338. 
R Core Team (2017). R: A Language and Environment for Statistical Computing. R 
Foundation for Statistical Computing, Vienna, Austria. 
Ridgeway, G. (2017). gbm: Generalized Boosted Regression Models. R package 
version 2.1.3. 
40 
Recession forecasting with big data 
Stock, J. H. and Watson, M. W. (2003). Forecasting output and infation: The 
role of asset prices. Journal of Economic Literature, 41(3):788–829. 
Sun, Y., Kamel, M. S., Wong, A. K., and Wang, Y. (2007). Cost-sensitive boosting 
for classifcation of imbalanced data. Pattern Recognition, 40(12):3358 – 3378. 
Ting, K. M. (2000). A comparative study of cost-sensitive boosting algorithms. 
In Proceedings of the Seventeenth International Conference on Machine Learning, 
ICML ’00, pages 983–990, San Francisco, CA, USA. Morgan Kaufmann 
Publishers Inc. 
Wheelock, D. C. and Wohar, M. E. (2009). Can the term spread predict output 
growth and recessions? a survey of the literature. Federal Reserve Bank of St. 
Louis Review, Part 1(Sep/Oct):419–440. 
Zhou, Z.-H. (2012). Ensemble Methods: Foundations and Algorithms. Chapman & 
Hall/CRC, 1st edition. 
Zhou, Z.-H. and Liu, X.-Y. (2010). On multi-class cost-sensitive learning. 
Computational Intelligence, 26(3):232–257. 
41 

Chapter 3 
Forecasting multinomial stock 
returns using machine learning 
methods 
Abstract ∗ † 
In this chapter, the daily returns of the S&P 500 stock market index are pre-
dicted using a variety of different machine learning methods. We propose 
a new multinomial classifcation approach to forecasting stock returns. The 
multinomial approach can isolate the noisy fuctuation around zero return 
and allows us to focus on predicting the more informative large absolute re-
turns. Our in-sample and out-of-sample forecasting results indicate signifcant 
return predictability from a statistical point of view. Moreover, all the ma-
chine learning methods considered outperform the benchmark buy-and-hold 
strategy in a real-life trading simulation. The gradient boosting machine is 
the top-performer in terms of both the statistical and economic evaluation 
criteria. 
∗ The author would like to thank Henri Nyberg, Heikki Kauppi, Jaakko Peltonen, Martti 
Juhola, Matthijs Lof, Luis Alvarez Esteban and the seminar participants at Turku Finance 
Workshop, GSE Econometrics workshop and CFE-CMStatistics 2019 for excellent comments. 
This work was supported by the Emil Aaltonen Foundation and the Academy of Finland. 
† A paper based on this chapter has been accepted for publication in the Journal of Finance 
and Data Science (forthcoming). c KeAi Communications Co. Ltd. 
43 
Lauri Nevasalmi 
3.1 Introduction 
Forecasting stock returns has attracted a tremendous amount of interest ever 
since the introduction of computers to economic forecasting. Kendall (1953) 
was among the frst to reach the conclusion of no predictability in stock prices. 
Later Fama (1970) stated in the famous effcient market hypothesis that abnor-
mal returns should not be possible to make by using historical data. But has 
the ever-growing amount of information and computational power in recent 
decades changed this relationship? State of the art machine learning methods, 
which can handle large amounts of information and discover complex rela-
tionships in data, provide further insight if proftable trading strategies can be 
discovered using past information. 
The predictability of stock returns is a controversial subject. In a compre-
hensive study, Welch and Goyal (2008) argue that the predictability found 
for the level of stock returns in the previous literature is time inconsistent 
and does not hold when new data is introduced. More recent work by Neely 
et al. (2014), among others, challenge the view of Welch and Goyal (2008) 
by reporting statistically signifcant predictability using more sophisticated 
forecasting methods. 
Instead of the actual level of return another strand of literature focuses on 
predicting the binary sign of stock returns (i.e. directional predictability of 
stock returns). Leung, Daouk and Chen (2000) provide early evidence in favor 
of using the binary response variable instead of the actual level. Other studies 
reporting statistically signifcant predictability using monthly stock returns 
are, for example, Nyberg (2011) and Nyberg and Pönkä (2016). Christoffersen 
and Diebold (2006) show theoretically that sign predictability may exist even 
without the assumption of mean-predictability. 
Although the majority of the previous literature concerns predicting month-
ly returns, some more recent studies have reported sign predictability using 
daily returns as well (see e.g., Skabar, 2013; Zhong and Enke, 2017; Fiévet 
and Sornette, 2018; Karhunen, 2019). The main objective in this research is to 
predict daily stock returns of the U.S. stock market (more specifcally S&P 500 
index returns) using different machine learning methods. 
Directional prediction of stock returns is based on forecasting whether re-
turns are greater than some pre-specifed threshold. Previous research mainly 
44 
Forecasting multinomial stock returns using machine learning methods 
focuses on sign prediction, where this threshold is equal to zero (i.e. whether 
the return is positive or negative), but some other alternatives have also been 
considered. Linton and Whang (2007) use the estimated unconditional quan-
tile of the return as a threshold. Chung and Hong (2007) express the threshold 
as multiples of the estimated standard deviation when forecasting the direc-
tion of exchange rates. Both studies fnd evidence of directional predictability 
in asset returns using different statistical testing procedures. 
Directional prediction of stock returns has a close connection to the market 
timing models considered by Merton (1981) and Pesaran and Timmermann 
(1995). Directional prediction of stock returns leads to simple binary trading 
strategies which can be used to assess the economic signifcance of the fore-
casting ability. Predicting the sign of stock returns involves a large amount 
of asset allocation decisions and the costs related to these transactions can be 
problematic when compared to the benchmark buy-and-hold strategy. This 
problem is even more alleviated with daily data (Becker and Leschinski, 2018). 
By considering two different thresholds instead of just one the directional 
prediction problem becomes multinomial. The signal-to-noise ratio in stock 
returns is fairly low, especially with daily data (Becker and Leschinski, 2018). 
Chung and Hong (2007) argue that the informational content of large absolute 
returns may be more valuable whereas small returns in absolute value are 
merely noise. It is also noted that the co-movement of individual stocks 
with the market portfolio is stronger with large absolute returns (see e.g., 
Longin and Solnik, 2001; Ang and Chen, 2002; Hong, Tu and Zhou, 2007). 
The multinomial response allows us to isolate some of the noise and put 
more emphasis on predicting the large absolute returns. The multinomial 
directional prediction also enables a richer set of possible trading strategies. 
For example one could choose between buying, holding and selling stocks 
instead of merely so far considered binary buy and sell decisions. To the 
best of our knowledge multinomial stock returns have not been utilized in 
previous economic research. 
Our results confrm the previous fndings of sign predictability in stock 
returns. All the machine learning methods considered in this research pro-
duce multinomial classifcation results which are signifcant from both the 
statistical and economical point of view. Each method is able to outperform 
the benchmark buy-and-hold strategy in a real-life trading simulation when 
45 
Lauri Nevasalmi 
trading costs are taken into account. Among the machine learning methods 
considered an ensemble method called gradient boosting is the top-performer 
in terms of both the classifcation accuracy and the profts from a real-life 
trading simulation. 
The results also show how the predictability of large absolute returns tend 
to cluster around certain periods of time. This is in line with the fndings 
of Krauss, Do and Huck (2017) and Fiévet and Sornette (2018) who notice 
increased predictability during high market turmoil. A closely related conclu-
sion often reported in the fnancial literature is the higher return predictability 
during business cycle recession periods (see e.g., Henkel, Martin and Nardari, 
2011; Cujean and Hasler, 2017). Events such as the fnancial crisis or the Eu-
ropean debt crisis involve high volatility in the stock markets but also highly 
proftable trading opporturnities. Our results show that volatility in the stock 
market as measured by the VIX-index is the single most infuential predictor 
of next days’ stock returns. 
The remainder of this paper is organized as follows. The prediction prob-
lem and the different machine learning methods are presented in Section 3.2. 
The dataset and the model selection process for different machine learning 
methods are described in Section 3.3. The empirical analysis and the results 
are covered in Section 3.4. Section 3.5 concludes. 
3.2 Methodology 
3.2.1 Multinomial stock returns 
Financial literature usually focuses on the reward of holding a risky asset such 
as stocks compared to the risk-free investment. This excess return is denoted 
as 
Zt = rt − rft, (3.1) 
where rt is the logarithmic daily return of the S&P 500 stock market index at 
time t and rft is the 3-month Treasury bill yield1. In directional prediction the 
binary dependent variable is created from the return series in equation (3.1) 
1 The Federal Reserve reports annualized yields using a 360-day year also known as the 
1bank discount method. The daily yield is therefore calculated as rft = tbillt · 360 . 
46 
Forecasting multinomial stock returns using machine learning methods 
using an indicator function 
Bt(c) = I(Zt > c), (3.2) 
where c is a given threshold. The multinomial response variable with three 
classes can be derived from the continuous stock returns using two thresholds 
c1 and c2 ⎧ ⎪ 1,⎨ if Zt < c1 
Rt(c1, c2) = 2,⎪⎩ 
3, 
if c1 ≤ Zt ≤ c2 . 
if Zt > c2 
(3.3) 
A natural question is how to choose the two thresholds that are basically 
arbitrary. To the best of our knowledge the multinomial approach with two 
thresholds as in equation (3.3) has not been considered in the previous litera-
ture regarding directional prediction of stock returns. Majority of the previous 
literature with single threshold as in equation (3.2) focus on binary sign pre-
diction, where c = 0 (see e.g., Leung et al., 2000; Karhunen, 2019). Although 
previous literature on directional prediction of stock returns with a single 
non-zero threshold is quite scarce some alternatives have been considered. 
Chung and Hong (2007) argue that the choice of c can be based on the 
observed data or alternatively held fxed using the magnitude of transaction 
costs for example. In their data based approach Chung and Hong (2007) use 
multiples of the estimated standard deviation as a threshold when forecasting 
the direction of exchange rates. Linton and Whang (2007) consider different 
unconditional quantiles of the return series when testing for directional predic-
tion in stock returns. Linton and Whang (2007) report statistically signifcant 
predictability in daily returns for all but the most extreme quantiles where the 
amount of data is insuffcient. 
Maheu and McCurdy (2004) show that large price changes of individual 
stocks are driven by important news and these large changes tend to be 
clustered together. It is also noted using market level data that large absolute 
stock returns contain stronger positive autocorrelation than small absolute 
returns do and they are therefore more predictable (see e.g., Granger and Ding, 
1996). Setting the thresholds c1 and c2 in equation (3.3) further apart from zero 
may result in more predictability but also in more imbalanced classes. 
Since the main objective of this research is to compare the predictive ability 
47 
Lauri Nevasalmi 
of several different machine learning methods we have chosen to use the 
upper and lower quartiles of the return series as thresholds. This data based 
approach yields nicely balanced classes as one half of the observations are 
coming from the middle class in equation (3.3) and the other half from the 
"abnormal" classes. Well balanced classes also allow for similar rules to be 
used with each method in the classifcation process, where the probability 
estimates are transformed into classifcation. 
Consider the stochastic processes Rt and xt−1, where Rt is the multinomial 
response variable described in equation (3.3) and xt−1 is a p × 1 vector of 
predictors at time t − 1. Conditional on the information set we assume the 
response variable to follow a categorical distribution 
Rt|Ωt−1 ∼ Cat(pt), 
where Ωt−1 is the information set available at time t − 1 and pt is a k × 1 vector 
of conditional probabilities. Each element of pt is the conditional probability 
of class k being the observed class at time t. More formally the conditional 
probability for each class k can be written as 
ptk(xt−1) = P (Rt = k|Ωt−1), k = 1, . . . , K, (3.4) 
where K will concretely be K = 3 in this study. The conditional probabilitiesPKin equation (3.4) must satisfy 0 ≤ ptk ≤ 1 and k=1ptk = 1. These conditions 
are met by the symmetric multiple logistic transform. The conditional prob-
abilities for each class k can be constructed using the functional estimates 
Fk(xt−1) as 
eFk(xt−1) 
ptk(xt−1) = PK . (3.5)Fl(xt−1)
l=1e 
In the general K-class classifcation problem the goal is to fnd the function 
that minimizes the expected loss of some predefned loss function for each 
class k 
 �  {Fbk(xt−1)}K = arg min E L Rt, {Fk(xt−1)}K .k=1 k=1 
{Fk(xt−1)}K k=1 
For the majority of methods used in this research the loss function considered 
48 
Forecasting multinomial stock returns using machine learning methods 
is the multinomial deviance 
X�  KL Rt, {Fk(xt−1)}K = − I(Rt = k) log ptk(xt−1), (3.6)k=1 
k=1 
where ptk(xt−1) is the logistic transform presented in equation (3.5). 
Accuracy is used as the evaluation metric to compare the classifcation per-
formance of different machine learning methods to be introduced in the next 
section (Section 3.2.2). Accuracy is calculated as the proportion of correctly 
classifed data points in the considered sample 
NX 
Acc =
1 
I(Rˆ t = Rt),
N 
t=1 
where Rˆ t is the predicted class and Rt the true class label at time t. 
3.2.2 Machine learning methods 
k-Nearest neighbor classifer 
The k-nearest neighbor originally presented by Fix and Hodges (1951) can be 
considered a model-free classifcation method since the classifcation of a new 
observation is based purely on the data points of the training set. Our training 
set consists of N pairs {(xt−1, Rt)}Nt=1, where xt−1 is the vector of feature 
values and Rt is the multinomial response variable given in equation (3.3). In 
order to classify a new data point xN we need to fnd the k data points in the 
training data closest to the new data point based on some distance measure. 
The Euclidean distance is the most commonly used alternative. These k data 
points are called the nearest neighbors of xN . The fnal classifcation is based 
on a majority vote of the response values of these k nearest neighbors. Ties are 
broken at random. This process is repeated for each data point in the test set. 
Figure 3.1 illustrates how the k-nearest neighbor classifcation works with 
a small artifcial dataset containing three classes. The left-hand side of Figure 
3.1 illustrates the nearest neighbors of two new data points based on Euclidean 
distance. The amount of nearest neighbors k is assumed to be either 1 or 3. 
For two example locations marked with X, the solid gray circle shows the one 
nearest neighbor and dashed circle illustrates the neighbors when k equals 
49 
.... 
.... .... ■ 
.... .... .... 
--, .& .... ■ 
' 
■ ■ 
■ .... ■ 
,-'■ ■ ■ ■ 
-- - ..... 
■ ■ 
Lauri Nevasalmi 
three. The right-hand side of Figure 3.1 shows the decision boundary for this 
artifcial data when k equals one. 
l
l
l
l
l
l
l
X
X
l
l
l
l
l
l
l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
ll
lll
ll
ll
ll
ll
ll
ll
lll
ll
ll
ll
lll
ll lll
l llll l
ll ll
l lll
l lll l
ll ll
ll ll
l lll
llll l
l ll ll l
l ll ll l
l ll ll l
l ll ll l
l ll ll
l ll ll
l lll l
ll ll
ll ll
l lll
llll l
ll ll
ll lll
l llll l
l lll l
lll ll lllllll
l lll ll
lllll l
l lll ll
l lll ll
lll l
l lll l
ll ll l
l lll
l ll ll
lll l
l lll l
ll ll
lll ll
l lll ll
l lll l
l ll lll
l ll lll l
lll l
lll ll
l llll l
llll
llll
llll
llll
lll
llll
llll
lll
llll
llll
llll
l
Figure 3.1: k-Nearest neighbor classifcation 
The k-nearest neighbor classifcation is used as a benchmark method in this 
research because it is fairly easy to fnetune. The only tuning parameter of the 
method is the amount of neighbors k. Larger values of k lead to smoother and 
less detailed decision boundaries. Despite its simplicity k-nearest neighbor has 
shown success in different kinds of fnancial applications such as forecasting 
foreign exchange rates or stock market volatility (see e.g., Meade, 2002; Arroyo 
and Maté, 2009; Andrada-Félix, Fernández-Rodríguez and Fuertes, 2016). 
Since the features in the dataset could have a variety of different scales each 
feature is typically re-scaled to have mean zero and variance equal to one. 
Gradient boosting 
The classifcation algorithm called adaboost was frst introduced by Freund 
and Schapire (1996). For a long time the classifcation ability of the adaboost 
algorithm remained controversial. This was until Friedman, Hastie and Tib-
shirani (2000) created a statistical framework for the boosting procedure and 
showed how the adaboost algorithm fts an additive logistic regression model. 
The more general gradient boosting algorithm was discovered as Friedman 
(2001) introduced the connection to numerical optimization in function space. 
The more general gradient boosting algorithm can be used for both classifca-
tion and regression problems. 
50 
Forecasting multinomial stock returns using machine learning methods 
In gradient boosting the goal is to fnd the function minimizing the ex-
pected loss of some predetermined loss function 
bF (xt−1) = arg min E [L (yt, F (xt−1))]. 
F (xt−1) 
In order to keep the notation fairly simple let us consider the binary classifca-
tion problem, where yt ∈ {0, 1} and L(yt, F (xt−1)) is the binomial deviance. 
With the multinomial response variable presented in equation (3.3) a separate 
function is estimated for each class, which complicates the notation. 
Gradient boosting is an ensemble method, where the possibly very com-
plex fnal model is a combination of simple models called base learners 
MX 
FM (xt−1) = fm(xt−1). (3.7) 
m=1 
The base learners f(xt−1) are assumed to belong to some parameterized class 
of functions. These could be for example simple linear models, spline functions 
or regression trees. The base learner used in this study is the J-terminal 
node regression tree, which splits the predictor space into J disjoint regions 
and attaches a constant to each region. Mathematically the J-terminal node 
regression tree base learner can be written as 
JX 
f(xt−1; {cj , Rj }Jj=1) = cj I(xt−1 ∈ Rj ), (3.8) 
j=1 
where cj ∈ R is the functional estimate in region Rj . 
Figure 3.2 illustrates J-terminal node regression trees graphically and plots 
a 4-terminal node regression tree and the terminal node regions created by 
this tree. The left-hand side depicts the classical tree shape. Each of the three 
split points is a function of the splitting variables and split locations. The 
right-hand side of Figure 3.2 shows the terminal node regions {Rj }4 andj=1 
the split locations {tl}3 in a 2-dimensional space. l=1 
51 

Lauri Nevasalmi 
     
          
          
 
 
  
  
  
  
  
  
  
Figure 3.2: 4-terminal node regression tree 
The fnal ensemble in equation (3.7) is estimated in a greedy stagewise 
fashion using a method called forward stagewise additive modeling. The 
estimation of a gradient boosting model is described in Algorithm 3.1. The 
algorithm starts with an initial value, which is a simple constant based on 
the considered loss function. At each iteration m of the gradient boosting 
algorithm a new base learner function which best fts the negative gradient of 
the loss function is selected and added to the current ensemble Fm−1. With 
the J-terminal node regression tree in equation (3.8) as the base learner this 
corresponds to fnding the J non-overlapping terminal node regions {Rjm}J j=1 
using a least squares criterion. After fnding the terminal node regions the 
functional estimates cˆjm are obtained in a simple minimization problem. The 
current ensemble Fm−1 is then updated with the functional estimates before 
calculating the pseudo responses y˜t for the next round of the algorithm. 
Algorithm 3.1 Gradient boosting using J-terminal node regression trees 
NX 
1F0(xt−1) = arg min N L (yt, ρ) 
ρ 
t=1 
for m ← 1 to M do: 
∂L(yt,F (xt−1))y˜t = − , t = 1, . . . , N ∂F (xt−1) F (xt−1)=Fm−1(xt−1) 
estimate {Rjm}J using the least squares criterion j=1 X 
cˆjm = arg min L (yt, Fm−1(xt−1) + cjm) , j = 1, . . . , J 
cjm xt−1∈Rjm 
JP 
Fm(xt−1) = Fm−1(xt−1) + υ cˆjmI(xt−1 ∈ Rjm) 
j=1 
end for 
52 
Forecasting multinomial stock returns using machine learning methods 
Algorithm 3.1 also illustrates the tuning parameters related to the gradient 
boosting method. The amount of iterations M and the learning rate υ ∈ ]0, 1] 
control the learning process. Setting M too low can result in underftting 
whereas too many repeats can lead to overftting. Setting the learning rate 
smaller than one can be seen as a shrinkage strategy as the parameter υ shrinks 
each functional estimate towards zero and thereby controls the speed of the 
learning process. These two parameters are inversely related to each other. A 
smaller learning rate usually requires more trees to be built (Friedman, 2001). 
The amount of complexity related to the J-terminal node regression tree 
base learner function can be controlled by the amount of terminal nodes 
J and the amount of observations required at each terminal node region. 
Requiring more observations in each terminal node region narrows down the 
amount of potential split points and therefore controls the complexity of each 
tree. Building larger trees with more terminal nodes results in more complex 
models but the risk of overftting also grows. 
Note from the graphical illustration in Figure 3.2 that in order to build a 
J-terminal node regression tree J − 1 split points are needed and the size of 
the regression tree also controls the amount of interactions allowed between 
different predictors. Instead of requiring the exact amount of terminal nodes 
J some software implementations use the depth of the tree D as a tuning 
parameter. The depth of the regression tree is the maximum amount of inner 
nodes between the root and leaf nodes. The depth of the regression tree in 
Figure 3.2 for example is two since there are two split points between the root 
and each leaf node. 
Different subsampling strategies can also be used for regularization with 
the gradient boosting model. The subsampling is usually done row-wise, 
where only a certain fraction ηrow of training samples are used when estimating 
the parameters of the base learner function at each round of the algorithm. 
By using row-wise subsampling the regression trees at each round tend to be 
less similar. Additionally column-wise subsampling is also available, where 
only a certain fraction ηcol of the available predictors are used at each round 
of the gradient boosting algorithm. The exact amount of subsampling used 
both row-wise and column-wise are fnetuned using cross-validation. 
53 
Lauri Nevasalmi 
Random forest 
The random forest algorithm of Breiman (2001) has a close connection to both 
bagging and the adaboost classifcation algorithm. The fnal model with each 
of these three methods is an ensemble of simple models. The original idea of 
random forest is to improve the classifcation ability of bagging by reducing 
the correlation between each component in the fnal ensemble. This is done by 
injecting additional randomness when building each component of the fnal 
model. 
Similarly as with boosting the base learner function used at each step of 
the random forest algorithm is a tree-based model. Unlike the regression tree 
presented in equation (3.8) the base learner with random forest classifcation 
algorithm is a classifcation tree. The graphical illustration given in Figure 
3.2 holds for the classifcation tree as well, but now the functional estimate 
in each terminal node region of the J-terminal node classifcation tree is the 
predicted class 
JX 
f(xt−1; {Cj , Rj }Jj=1) = Cj I(xt−1 ∈ Rj ), (3.9) 
j=1 
where xt−1 is a vector of inputs at time t − 1 and Cj is the predicted class in 
region Rj . 
Instead of fxing the number of terminal nodes J as with gradient boosting 
the complexity of each tree in the random forest is typically controlled by 
requiring a certain number of observations at each terminal node. In the ran-
dom forest algorithm the depth of each tree is increased by adding additional 
split points for as long as the number of observations in the terminal node is 
greater than a prespecifed constant nmin. This constant is a tuning parameter 
related to the random forest algorithm as all the terminal nodes must hold at 
least nmin data points. Especially with classifcation problems the trees in the 
random forest are often grown to the full size requiring only one observation 
in each terminal node region. Hastie, Tibshirani and Friedman (2009, p. 596) 
argue that letting the trees in the random forest to grow to the maximum size 
seldom costs much and results in one less tuning parameter. 
The power of random forests comes from combining the predictions of 
many accurate individual trees that are as diverse as possible. In order to 
54 
Forecasting multinomial stock returns using machine learning methods 
make the trees in the random forest ensemble less correlated only a subset of 
features are considered when new split points are added to the classifcation 
tree. Suppose the number of features in the dataset is p then only m (m ≤ p) 
randomly chosen features are considered as candidates when selecting a new 
split point. The exact amount m depends on the problem at hand and is 
treated as a tuning parameter. Especially in problems where the proportion 
of relevant features in the whole feature set is small, setting m too low may 
result in poor performance (Hastie et al., 2009, p. 596). 
Similarly as in bagging, a bootstrap sample Z∗ of size N is drawn from the 
training data at each round b ∈ {1, ..., B} of the algorithm. A new decision 
tree fb(xt−1, Θ) is ft using this bootstrap sample, where the parameter vector 
Θ holds the parameters of the decision tree presented in equation (3.9). The 
split points of this decision tree are found recursively by considering only 
the m randomly chosen features at each step. The decision tree is grown 
to the maximum possible size controlled by the parameter nmin, which sets 
the minimum number of observations needed in each node of the tree. This 
process is summarized in Algorithm 3.2. 
Algorithm 3.2 Random forest classifcation 
for b ← 1 to B do: 
draw a bootstrap sample Z∗ of size N from the training data 
create an empty decision tree fb(xt−1 ∈ Z∗ , Θ = ∅) 
while (the number of observations in some node > nmin) do: 
randomly select m variables 
pick the best split point among these 
split the current node into two daughter nodes 
end while 
include fb(xt−1,Θ) in the ensemble Fb 
end for 
The fnal ensemble in the random forest algorithm is a combination of the 
individual trees found in each round b of the algorithm 
F (xt−1) = {fb(xt−1, Θ)}B b=1. 
55 
Lauri Nevasalmi 
The classifcation of a new data point xN is based on a majority vote between 
the classifcations induced by each individual tree 
Cˆrf (xN ) = majority vote{Cˆb(xN )}B b=1, 
where Cˆb(xN ) is the predicted class given by the bth decision tree in the random 
forest ensemble. 
Neural networks 
Neural networks were originally designed as a tool to model the information 
processing capabilities of the human brain and the earliest attempts go as far 
as the 1940s (Rojas, 1996). There are a vast amount of neural network models 
with different assumptions regarding the structure of the network and how 
information fows through the network. The model used in this research is 
one of the most commonly used neural network models called a single hidden 
layer feed-forward neural network (Bishop, 2006, p. 229). 
The network consists of three layers which are typically named as the input 
layer, hidden layer and the output layer. Each layer in a feed-forward network 
is connected with the subsequent layer through weights as is visualized in 
Figure 3.3. The directed edges represent the weights and the direction of 
information fow in the network. 
Input layer Hidden layer Output layer
  
  
  
  
  
  
  
  
  
  
.
.
.
.
.
.
.
.
.
bias bias
Figure 3.3: Artifcial neural network 
56 
Forecasting multinomial stock returns using machine learning methods 
In general there could be several hidden layers creating a deeper and 
more complex network. Each unit in the hidden layer of Figure 3.3 is called 
a hidden unit since these are typically unobserved. These hidden units are 
linear combinations of the input variables x = (x1, x2, . . . , xp) 
0 
followed by a 
non-linear activation function: 
Zm = h(α0m + αm 
0 
x), m = 1, . . . , M, (3.10) 
where α0m is the weight from the bias unit, αm is a p × 1 vector of weights 
coming into hidden unit Zm and h(·) is the activation function. The sigmoid 
function is typically used to transform the linear combinations of inputs into 
a non-linear form. Another common choice for the activation function is the 
hyperbolic tangent function. Note that the total amount of weights connecting 
the units in the input layer and the hidden layer is M × (p + 1), where M is 
the number of units in the hidden layer. The exact amount for M is treated as 
a tuning parameter of the model. 
The fnal output for each class k is formed as a linear combination of the 
hidden units Z = (Z1, Z2, . . . , ZM ) 
0 
, which is transformed in the interval [0, 1] 
using the softmax function: 
0 
β0k+βkZ 0 e 
Fk = g(β0k + βkZ) = 0 , k = 1, . . . , K. PK
l=1e
β0l+β Zl 
Similarly as in equation (3.10) β0k is the weight from the bias unit and βk is a 
M × 1 vector of weights connecting the units in the hidden layer to the output 
unit Fk. 
The optimal weights in the network minimize the considered loss function. 
In the multinomial classifcation problem the loss function to be minimized is 
the sample counterpart of the multinomial deviance shown in equation (3.6). 
By denoting the complete set of weights in the network by a weight vector θ 
the loss function can be written as 
N KXX 
L(θ,Fk) = − I(Rt = k) log Fk, (3.11) 
t=1 k=1 
where Fk is the output for class k. The set of weights in the network can 
57 
Lauri Nevasalmi 
be searched using a gradient descent based method called backpropagation. 
In backpropagation the gradient of the loss function in equation (3.11) is 
constructed at each iteration. The weights in the network are then updated 
according to the direction given by the negative gradient. For a more detailed 
description of the backpropagation algorithm see e.g. Rojas (1996). 
A simple regularization strategy called weight decay has been suggested 
to avoid overftting while estimating the optimal weights in the network. In 
weight decay an additional penalty term, which penalizes large weights, is 
added to the loss function presented in equation (3.11) 
L˜(θ,Fk) = L(θ,Fk) + λJ(θ), 
where λ is the weight decay parameter. The penalization function J(θ) can 
take various forms. A common choice is to impose quadratic penalization, 
where J(θ) = θ 
0 
θ (Bishop, 2006, p. 256). Larger values for λ thereby shrink 
the weights towards zero unless traditional backpropagation reinforces the 
weights. The exact amount of penalization needed is fnetuned using cross-
validation. 
Support vector machines 
The support vector machines originally presented by Vapnik (1995) can be 
used for both classifcation and regression problems. The basic idea and 
the terminology of support vector machines can be illustrated using a two-
class classifcation problem with a linear decision boundary. The left-hand 
side of Figure 3.4 illustrates the case with perfect separability. The right 
panel in Figure 3.4 shows the nonseparable case, where some data points are 
misclassifed by the linear decision boundary. 
58 
□ 
□ □ 
□ □ □ 
i !.-----
□ 
□ □ □ 
□ 
······• .. 
....•• D 
0 
0 
0 
0 00 0 0 "· 
0 0 
0 
0 0 
·····• .. 
·····• 
□ 
□ 
i 
0 
□ 
0 
0 
□ 
□ □ 
□ 
0 0 
Forecasting multinomial stock returns using machine learning methods 
   
    
    
margin
   
    
    
margin
Figure 3.4: Support vector machine 
The solid blue line in Figure 3.4 is the decision boundary separating the 
two classes 
0 
F (x) = β0 + x β = 0, 
where β0 is a constant term, β is a unit vector and x is a p × 1 vector of input 
variables. For notational reasons let us focus on the binary classifcation case 
and denote the binary response as yi ∈ {−1, 1}, i = 1, ..., N , where i can be 
associated to time t. The one-against-one method used in this research for 
K-class classifcation with support vector machines is a direct extension to 
the binary case as the fnal classifcation is based on a voting scheme between 
the K(K − 1)/2 binary classifers constructed for each class pair. For more 
information on the multiclass classifcation with support vector machines see 
e.g. Hsu and Lin (2002). 
The goal with support vector machines is to fnd the decision boundary 
with maximum area on both sides of the boundary also known as the margin 
2M = kβk . In Figure 3.4 the dashed lines illustrate the margin and the points 
located at the dashed lines are known as support vectors. Hastie et al. (2009, 
p. 132) show that instead of maximizing the margin the optimization problem 
can be written in terms of minimizing kβk 
min kβk s.t. yi(β0 + xi 
0 
β) ≥ 1, i = 1, . . . , N. (3.12)
β0,β 
The constraint in equation (3.12) requires each observation to be on the right 
side of the margin. This constraint does not hold for the nonseparable case 
shown on the right panel of Figure 3.4. For this reason we need to defne 
59 
Lauri Nevasalmi 
a vector of slack variables ξ = (ξ1, ..., ξN ) and the minimization problem 
becomes ( 0 
yi(β0 + x β) ≥ 1 − ξi, i = 1, . . . , N, imin kβk s.t. PN (3.13)ξi ≤ constant.β0,β ξi ≥ 0, i=1 
The convex optimization problem with quadratic objective and linear 
inequality constraints in equation (3.13) can be solved using quadratic pro-
gramming. The Lagrange primal function can be written as 
N N NX X1 X 0 
LP = kβk2 + C ξi − αi[yi(β0 + xiβ) − (1 − ξi)] − νiξi, (3.14)2 
i=1 i=1 i=1 
where C is now inplace of the predetermined constant in equation (3.13). The 
parameters αi and νi are the Lagrange multipliers. The cost parameter C 
is a tuning parameter of the procedure and controls how wide the margin 
is. A larger value of C puts more emphasis on the points near the decision 
boundary and requires a tighter margin. 
To consider non-linear decision boundaries the original input feature space 
is typically transformed into an enlarged space using e.g. polynomials or 
splines since the data could be linearly separable in this higher dimensional 
feature space. Without specifying the exact transformation the Lagrange dual 
objective function can be written using these transformed feature vectors h(xi) 
N N NX XX 
LD = αi − 1 αiαj yiyj hh(xi), h(xj )i, (3.15)
2 
i=1 i=1 j=1 
where alphas are the Lagrange multipliers and hh(xi), h(xj )i is the inner 
product of the transformed input vectors i and j. The solution to the dual 
Lagrangian in equation (3.15) depends on the transformed higher dimensional 
data only through inner products. Instead of the exact transformation h(·) 
a kernel function, which computes inner products in the transformed space, 
is suffcient. A radial basis function and a dth-degree polynomial are typical 
choices for the kernel function. The radial basis function can be written as 
K(x, xi) = hh(x), h(xi)i = exp(−γ||x − xi||2), (3.16) 
60 
Forecasting multinomial stock returns using machine learning methods 
where γ is a tuning parameter related to the radial basis function kernel. The 
dth-degree polynomial kernel function involves one extra tuning parameter 
compared to the radial basis kernel presented in equation (3.16). The degree 
of the polynomial d needs to be fnetuned in addition to a scale parameter s 
K(x, xi) = (1 + shx, xii)d . (3.17) 
The fnal classifcation in the support vector machine is produced by the 
following equation 
N�  �X  
Gˆ(x) = sign Fˆ (x) = sign αˆiyiK(x, xi) + βˆ0 , 
i=1 
where K(x, xi) is one of the kernel functions presented in equations (3.16) and 
(3.17). αˆi and βˆ0 are the solved coeffcients from the optimization problem. 
The coeffcients αˆi are non-zero only for the data points marked as support 
vectors. 
3.3 Data and model setup 
3.3.1 Data 
The dataset used in the empirical section of this paper covers daily returns 
of the S&P 500 stock market index from the beginning of the 1990s to the 
end of 2018.2 The goal of the empirical analysis is to study a wide spectrum 
of different variables that could be used for prediction using the maximum 
amount of daily data available. With such a high frequency as daily returns the 
potential predictor variables are mostly based on different types of fnancial 
market data. 
Technical analysis indicators are a common choice for the input variables 
of different machine learning methods (see e.g., Kim, 2003; Basak et al., 2019). 
In technical analysis different types of indicators are constructed using the 
historical price or return information from the stock market. Benchmark yields 
and different interest rate spreads from the corporate and government bond 
markets have also been extensively studied with both daily and monthly 
2 After constructing the predictor variables the exact time period is 12.2.1990 - 5.10.2018. 
61 
Lauri Nevasalmi 
data (see e.g., Zhong and Enke, 2017; Nyberg and Pönkä, 2016). Interest 
rates express the tightness of the monetary policy set by the Federal Reserve. 
Different types of interest rate spreads refect market expectations regarding 
the upcoming economic activity for example. 
Lagged stock returns and returns from other stock markets are another 
commonly used alternative (see e.g., Zhong and Enke, 2017). A less studied 
predictor group is the volatility in different markets. A recent study by Becker 
and Leschinski (2018) shows that the VIX-index, which is often called the 
fear factor of stock markets, can also be a viable alternative. As with Zhong 
and Enke (2017) different exchange rates and commodity indices are also 
considered as predictive features (predictors) in this study. The appreciation 
(or depreciation) of the dollar relative to other currencies affects the foreign 
trade and international fow of funds to the U.S. Variables related to the state 
of the macroeconomy are found to be important predictors when predicting 
monthly stock returns (Nyberg, 2011). Unfortunately the majority of the 
macroeconomic information is not available with daily frequency. 
Table 3.1 summarizes the input variables using seven different categories. 
A short description and an illustrative example are shown from each category. 
The full predictor set and the exact transformations for each predictor can be 
found in Appendix A. 
62 
Forecasting multinomial stock returns using machine learning methods 
Table 3.1: Predictor groups 
Group Description Example 
Stock market S&P 500 price infor-
mation, Returns from 
other stock markets 
Lagged returns, Re-
turns from DAX or 
FTSE 
Interest rates Government and cor-
porate benchmark 
yields 
3-month T-bill, Term 
spread 
Exchange rates The appreciation of 
dollar relative to other 
currencies 
Dollar/British Pound, 
major currencies index 
Commodities Information from the 
commodities market 
Copper, Oil, Gold, Sil-
ver 
Volatility Volatility in the stock 
and bond markets 
VIX-index, MOVE-
index 
Technical analysis Indicators derived 
from price or return 
information 
Relative strength index 
Macro Information regarding 
the macroeconomy 
ADS-index 
The total amount of different predictors studied in this research is 37. 
Following the approach of Krauss et al. (2017) various lag lengths of the 
predictors are also considered. Lag lengths beyond ten trading days are found 
to be uninformative in the preliminary analysis using the model selection 
capability of the gradient boosting machine.3 By considering the lagged 
predictors from the previous ten trading days the full predictor set consists 
of 370 different inputs (37 × 10). Only data points (days) with information 
available for each predictor are considered. For this reason predictor variables 
utilizing market data outside the U.S. is kept minimal because of the individual 
holiday periods in each country. After leaving out the data points with missing 
values the fnal dataset includes 6686 daily observations. 
From the machine learning methods considered in this study only the 
3 Results available upon request. 
63 
Lauri Nevasalmi 
tree-based classifers gradient boosting machine (GBM) and random forest 
(RF) are capable of handling such a large predictor set. Both of these methods 
are able to perform model selection simultaneously with estimation as new 
split points are introduced for the tree-based base learners. Support vector 
machines (SVM), neural networks (ANN) and k-nearest neighbor (k-NN) end 
up easily overftting the data with a large predictor set and therefore a reduced 
dataset is needed. A smaller predictor set is built using a combination of prior 
knowledge and the results from the tree-based methods. 
Both GBM and random forest select the VIX-index as the single most 
infuential predictor.4 For the random forest model each of the ten most 
infuential inputs are different lag lengths of the VIX-index, whereas GBM 
includes six lags of VIX in the top-10. Different technical analysis indicators are 
also considered as important predictors by both models. The best performing 
technical analysis indicators are the stochastic oscillator (StochK), moving 
average convergence divergence (MACD) and Williams %R (Rperc)5. The 
ranking of these indicators are slightly different for the two methods. 
In addition to these the spread between the daily high and low stock prices 
is selected by both models. GBM also ranks the corporate interest rate spread 
among the ten most infuential predictors. The results from the principal 
component analysis by Zhong and Enke (2017) are in favor of using the lagged 
stock returns and international stock returns. Lagged stock returns from 
the S&P 500 and the returns from the German stock index are thereby also 
included in the reduced dataset. 
The reduced dataset includes six inputs (predictors) which are the VIX-
index, MACD-indicator, spread between daily high and low prices, corporate 
interest rate spread, lagged stock returns and returns from the DAX-index. 
The data from the previous three trading days are used for each predictor. The 
total amount of inputs in the reduced dataset is thereby 18 (6 × 3). Several 
other choices for both the composition of predictors and lag lengths were also 
considered. 
4 More detailed model selection results can be found in Appendix B. 
5 See Appendix A for further details about the indicators. 
64 
Forecasting multinomial stock returns using machine learning methods 
3.3.2 Tuning parameter optimization 
Each of the machine learning methods considered involve free parameters 
that affect the fnal output of the model. Often the parameters can affect the 
results quite dramatically as is the case with neural networks for example 
(Zhang, Patuwo and Hu, 1998). These parameters are usually called tuning 
parameters since it is up to the end-user to fnetune the optimal parameters 
for the particular learning task. 
Table 3.2 summarizes the tuning parameters of the machine learning meth-
ods studied in this research. A brief description and the notation used for the 
tuning parameter in the methodological part of this paper is shown in Table 
3.2. The last column illustrates the considered parameter values. It should be 
noted that for each method a wider grid search has been conducted in order to 
fnd a suitable range for each parameter. Only this smaller interval is depicted 
in Table 3.2. 
Table 3.2: Tuning parameters for each method 
Method Description Notation Values 
k-NN Number of neighbors k 1, 11, 21, ..., 461 
GBM Number of iterations M 1, ..., 1000 
Tree depth D 1, 2, 3 
Fraction of training points ηrow 0.5, 0.7, 0.9 
Fraction of predictors ηcol 0.7, 0.9 
RF Number of trees B 100, 300, 500 
Number of predictors m 10, 50, 100, 200, 300 
Observations in each node nmin 10, 50, 100, 200, ..., 600 
ANN Number of hidden units M 3, 5, 8, 10, 12, 15, 20 
Weight decay λ 0.1, 0.2, 0.4, 0.6, 0.8, 1 
SVM Cost parameter C 0.01, 0.1, 0.2, ...9 
Radial kernel parameter γ 0.005, 0.01 
Polynomial kernel, scale s 0.005, 0.01 
Polynomial kernel, degree d 2, 3 
Because the number of different machine learning methods considered 
in this paper is quite large some simplifcations have been done in order to 
65 
Lauri Nevasalmi 
keep the parameter search feasible. Additional fnetuning would be available 
for several methods. There are for example alternative distance measures 
for the k-NN method and different learning algorithms for the ANN model. 
Following the approach of Friedman (2001) the learning rate in the gradient 
boosting machine algorithm is set as small as possible. The learning rate is 
held fxed at a value of 0.001. For computational reasons the parameter search 
is restricted to the parameter values presented in Table 3.2. 
The performance of each model specifcation is evaluated using the valida-
tion accuracy produced by the L-fold cross-validation procedure. In L-fold 
cross-validation the training sample is split into L independent folds. Each 
of these L folds is used as a hold-out test set once, while the remaining L − 1 
folds are used for estimating the model. This process is repeated for each 
fold and the validation accuracy is the average accuracy produced by the L 
independent folds 
LX1 
CVAcc = Accl,
L 
l=1 
where L is the amount of folds and Accl is the obtained accuracy when the 
data points of fold l are used as an independent test set. 
10-fold cross-validation is used to estimate the optimal tuning parameters 
for each method. Table 3.3 shows the parameter optimization results and 
depicts the optimal tuning parameter combination for each method. 
Table 3.3: Final model specifcations 
Method Parameters 
k-NN k = 331 
GBM M = 931, D = 2, ηrow = 0.9, ηcol = 0.7 
RF B = 500, m = 50, nmin = 300 
ANN M = 5, λ = 0.8 
SVM d = 3, s = 0.005, C = 6.9 
Relatively restricted models are chosen in the cross-validation procedure 
as can be seen in Table 3.3. This is not very surprising as the noisy stock 
market data combined with a complex model can easily lead to overlearning 
66 
Forecasting multinomial stock returns using machine learning methods 
the training data. The limitations in allowed fexibility can be seen for each 
method. For example, quite a large number of nearest neighbors are used 
while classifying each data point and the tree-based methods GBM and RF 
seem to favor quite shallow trees. A fairly small amount of neurons are used in 
the hidden layer of an ANN model combined with quite a heavy penalization 
through the weight decay parameter. The polynomial kernel of degree 3 is 
used for the support vector machine and the cost parameter is rather small, 
which results in a wider margin and also supports the fnding of restricted 
models. 
The low amount of hidden neurons reported in Table 3.3 for the ANN is 
of similar magnitude as the parameter value selected using trial-and-error in 
Zhong and Enke (2017). The results from a tuning parameter optimization 
procedure in Kara, Boyacioglu and Baykan (2011) also favor the use of a 
polynomial kernel over the radial basis function for the SVM. The polynomial 
kernel of degree three is the optimal choice when forecasting the direction of 
the Turkish stock market as well (Kara et al., 2011). Based on prior knowledge 
Krauss et al. (2017) end up with the same tree depth for the GBM model as 
reported in Table 3.3. 
3.4 Empirical results 
3.4.1 Statistical predictive performance 
The complete data sample covering the time period from 12.2.1990 to 5.10.2018 
is split into two parts. The training set contains data before the year 2007 and 
is used for training and validating the models. The test set covering the rest of 
the data is used as an independent test set to evaluate how well each method 
performs on a completely unseen dataset. Therefore roughly 58 percent of the 
complete dataset is used for training and the remaining 42 percent for testing. 
The test set thus contains 2787 daily observations. 
Figure 3.5 visualizes the daily returns of the S&P 500 index. The horizontal 
dashed lines show the upper and lower quartiles of returns, which are used 
as the fxed thresholds in equation (3.3) to create the multinomial response 
variable. The vertical gray dashed line illustrates the split into training and 
testing datasets. 
67 
Lauri Nevasalmi 
1990 1995 2000 2005 2010 2015
−
10
−
5
0
5
10
Time
R
et
ur
n
 (%
)
Training set Test set
Figure 3.5: Daily returns of the S&P 500 index 
In order to keep the test set completely independent, the upper and lower 
quartiles of returns are calculated using only the training data.6 The bench-
mark accuracy for the training and validation is one half as the majority class 
contains ffty percent of the observations. In the test set there are slightly 
more observations coming from the majority class and the strategy of always 
predicting the majority class yields the accuracy of 0.526. This should be used 
as the benchmark accuracy when evaluating the prediction results for the test 
set. 
The prediction results for each machine learning method are presented in 
Table 3.4. The frst column shows the considered method while the next three 
columns give the classifcation accuracies for the training, validation and test 
sets. 
6 The upper and lower quartiles for the training set are {−0.4989%, 0.5755%} whereas the 
quartiles for the entire dataset are {−0.4656%, 0.5723%}. 
68 
Forecasting multinomial stock returns using machine learning methods 
Table 3.4: Prediction results 
Method Train Validation Test 
k-NN 0.5204 0.5194 0.5457 
GBM 0.5560 0.5294 0.5597 
RF 0.6194 0.5304 0.5536 
ANN 0.5247 0.5224 0.5558 
SVM 0.5463 0.5224 0.5504 
The results indicate signifcant return forecastability as the classifcation 
accuracies for the training and validation sets are well above the benchmark 
of one half for each method. In terms of the training and validation accuracies 
the results are in favor of using the tree-based methods gradient boosting 
and random forest, which can utilize the full predictor set. The relatively 
high training accuracy for the random forest model stands out from the rest, 
however the validation accuracy is only slightly higher than for GBM. ANN 
and SVM have a similar validation performance. The simplicity of the nearest 
neighbor algorithm has led to the lowest training and validation accuracies. 
The classifcation results for the test set indicate how well the models 
generalize to new data. The conclusion of return predictability seems to hold 
even when testing on an unseen dataset as the accuracies for each method 
are well above the test set benchmark performance. Overall, the level of 
predictability reported in Table 3.4 is in line with recent literature on sign 
predictability of daily returns (see e.g., Krauss et al., 2017; Zhong and Enke, 
2017). 
The ranking between the machine learning methods based on the classif-
cation accuracies for the test set is slightly different from the ranking obtained 
using the accuracies for the training set. While GBM still outperforms the 
other machine learning methods random forest reaches only the third highest 
test accuracy as neural networks show better generalization ability. The fairly 
large deviance between the training and testing results for the random forest 
raises questions of potential overftting. A similar concluding remark can 
be made when comparing the generalization capabilities of ANN and SVM. 
The SVM model has better training accuracy but results in slightly lower 
69 
Lauri Nevasalmi 
generalization performance. 
3.4.2 Economic predictive performance 
Leitch and Tanner (1991) argue that a model performing well from a statistical 
point of view does not necessarily imply economic proftability, especially 
when trading costs are taken into account. In order to evaluate the ability to 
gain economic profts, in this section a real-life trading simulation is conducted. 
Our trading simulation is similar to those in Pesaran and Timmermann (1995) 
and Leung et al. (2000) for example. The classifcation patterns are turned 
into a trading strategy, which depends on the current (Rˆ t+1) and previous 
forecasted class (Rˆ t), as can be seen in Table 3.5. 
Table 3.5: Trading strategy 
Rˆt 
Rˆt+1 
1 2 3 
1 Stay out Sell Sell 
2 Buy Hold Hold 
3 Buy Hold Hold 
Table 3.5 shows how the multinomial response variable presented in equa-
tion (3.3) enables a richer set of possible trading strategies compared to the 
more commonly studied binary response case. Here the possibility to keep 
the position unchanged (Hold) is used to reduce excessive trading activity. In 
a traditional market timing setup as presented in Pesaran and Timmermann 
(1995) an asset allocation decision is made between investing in stocks or in 
bonds. With the daily trading frequency the returns from investing in the 
bond market are fairly low and we only consider the options of staying fully 
invested in stocks or not. 
The asset allocation decision between stocks and bonds could easily be 
incorporated to the trading strategy in Table 3.5. The complexity of the asset 
allocation strategy based on the multinomial response could be increased even 
further. Depending on the predicted class one could allocate 0,100 or say 70 
percent of the wealth in stocks and the remaining in the bond market. These 
more complex strategies are left for further research. 
70 
Forecasting multinomial stock returns using machine learning methods 
In the benchmark buy-and-hold strategy the entire initial wealth is invested 
in stocks at the beginning of the period and sold at the end of the period. To 
get a fair comparison with the buy-and-hold strategy additional wealth can 
not be invested in stocks. This limits the ability to beneft from large positive 
returns. In reality the investor would certainly like to increase the amount of 
wealth invested in stocks if large positive stock returns are expected. 
Instead of just staying out from the market it is also possible to proft 
from the large negative price changes through shortselling. Allowing for 
shortselling as in Becker and Leschinski (2018) can be seen problematic for 
several reasons. The risks involved in shortselling are large as the possible 
losses are basically unlimited. The potential restrictions placed on shortselling 
by the Securities and Exchange comission (SEC) during high market turmoil 
can also be seen problematic. One such event took place during the fnancial 
crisis as shortselling restrictions were imposed on fnancial companies (Becker 
and Leschinski, 2018). Although such restrictions are less of an issue when 
considering investments in a major stock index the main analysis on economic 
proftability is conducted without shortselling. 
The trading cost is assumed to be a fxed percentage rate, which has been a 
common choice in the previous literature (see e.g., Pesaran and Timmermann, 
1995; Fiévet and Sornette, 2018). A trading cost of 0.1 percent is used in this 
study. This could be regarded as relatively high since an active individual 
investor can achieve much lower costs when trading the stocks of a private 
company instead of the index. Becker and Leschinski (2018) show that the 
average bid-ask spread on individual stocks in the U.S during the period of 
2004-2017 is around 0.05 percent. This is also the level of transaction costs 
used by Fiévet and Sornette (2018). Naturally the average bid-ask spreads are 
much lower for the more liquid exchange traded funds (ETF) following the 
S&P 500 (Hsu, Hsu and Kuan, 2010). Over the past year the bid-ask spread of 
the worlds largest ETF has ranged between 0.003 and 0.005 percent.7 
The results from the trading simulation are presented in Table 3.6. The frst 
column shows the considered machine learning method. The second column 
illustrates the fnal wealth level in dollars after following the trading strategy 
presented in Table 3.5. The initial wealth of 100 is grown to a fnal wealth of 
7 See https://www.etf.com/SPY for more information on the oldest ETF following the 
S&P 500. 
71 
Lauri Nevasalmi 
195.86 using the benchmark buy-and-hold strategy. 
Table 3.6: Results from a trading simulation 
Method Trading 
k-NN 202.95 
GBM 351.54 
RF 225.15 
ANN 244.51 
SVM 210.22 
All the considered methods beat the buy-and-hold benchmark. The rank-
ing between the machine learning methods remain the same as based on the 
test set accuracies. The worst performing nearest neighbor algorithm pro-
duces a fnal wealth that is only slightly higher than the benchmark. The 
wealth levels of SVM and random forest are also fairly modest compared to 
the benchmark. The best performing methods based on the test set accuracy 
outperform the benchmark quite substantially. The fnal wealth produced by 
the neural network model is 25 percent higher than the benchmark. The fnal 
wealth based on the predictions of the gradient boosting machine yields a 
fnal wealth 80 percent higher than the benchmark. 
If we allow for shortselling to be used as a tool to proft from the correctly 
predicted negative returns the fnal wealth gained using the predictions from 
a GBM model is 181 percent higher than the benchmark. In shortselling the 
assets sold short are borrowed from a broker and then returned when closing 
the short position. It should however be noted that because of these borrowing 
costs the actual trading costs involved with shortselling can be higher than the 
considered level of 0.1 percent. The results based on shortselling can therefore 
be overestimated. We take the approach of Becker and Leschinski (2018) and 
ignore these additional costs related to shortselling. 
In order to see how the wealth is accumulated throughout the test period 
Figure 3.6 shows the wealth patterns for the benchmark buy-and-hold strategy 
and for each of the machine learning methods considered. Note how each 
method starts with the initial wealth of 100 dollars and ends up with the fnal 
72 
Forecasting multinomial stock returns using machine learning methods 
wealth level reported in Table 3.6. 
2008 2010 2012 2014 2016 2018
50
10
0
15
0
20
0
25
0
30
0
35
0
Date
W
e
a
lth
Buy&Hold
GBM
RF
k−NN
SVM
ANN
Figure 3.6: Trading simulation results for each method 
There are certain periods of time when the wealth produced by the machine 
learning methods deviate from the benchmark. All the machine learning 
methods are able to avoid some of the heavy losses involved in the fnancial 
crisis. Some do so better than others as k-NN can avoid majority of the losses 
while SVM is quite close to the benchmark. Another such period is the debt 
crisis in the Euro zone, which took place between the years 2010 and 2012. In 
the year 2010 the wealth level of the GBM model starts to deviate from the rest 
of the feld. A similar observation can be made in the end of the year 2011. 
To have a closer look at the daily returns during these events Figure 3.7 
illustrates the correctly predicted large positive and negative returns for the 
GBM model. 
73 
Lauri Nevasalmi 
2008 2010 2012 2014 2016 2018
−
10
−
5
0
5
10
Time
R
et
ur
n
 (%
)
ll
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
ll
ll
ll
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
ll
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
lll
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
ll l
l
l lll
l l
l
l
ll
l
l
l
l
l
ll
l
l
l
ll
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
ll
ll
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
ll
ll
ll
l
l
l
l
l
l
ll
l
ll
ll
l
l
l
l
l
l
l
l
ll
l
ll
l
ll
l
l
l
l
Correct large positive
Correct large negative
Return threshold
Figure 3.7: Correctly predicted large absolute returns for the GBM model 
Several interesting observations can be noted from Figure 3.7. First of all 
the predictability of large positive and negative stock returns seem to cluster 
together. The fnancial crisis and the European debt crisis are periods of higher 
predictability. This is in line with the results of Krauss et al. (2017) and Fiévet 
and Sornette (2018). These were also the periods, where the wealth levels of 
the machine learning methods started to deviate from the benchmark, as was 
seen in Figure 3.6. Some predictability is also observed at the beginning of 
year 2016 when the low oil prices caused worries in the market. All of these 
events involve high volatility. 
Figure 3.7 also illustrates how the individual correct predictions for the 
large positive and negative returns can be of very different magnitude. There 
are correct predictions with return level right above or below the fxed thresh-
olds depicted using the black dashed lines in Figure 3.7. On the other hand 
there are positive and negative daily returns close to ten percent. Naturally in 
terms of the profts obtained from the trading simulation the further apart the 
correctly predicted observation is from the fxed threshold levels the better. 
And vice versa for the incorrectly classifed observations. 
Regarding the classifcation accuracy all the correctly predicted observa-
74 
Forecasting multinomial stock returns using machine learning methods 
tions are considered equally important and thus receive the same weight. This 
could be a possible explanation why the fnal wealth levels reported in Table 
3.6 and further illustrated in Figure 3.6 deviate quite substantially between the 
different methods. The trading strategy and the return levels could be more 
closely incorporated to the actual model estimation process. The level of re-
turns or the preferences regarding different correctly and incorrectly predicted 
classes could be taken into account using caseweights for example. These 
alternative approaches are left for further research. 
It is also interesting to see that despite the daily trading frequency the 
trading strategy presented in Table 3.5 could be considered relatively passive. 
Higher trading activity is observed only during short periods of time involving 
high volatility. Becker and Leschinski (2018) argue that the assumed fxed 
trading cost can overestimate the gained profts as the actual bid-ask spreads 
tend to rise during high market turmoil. The highest bid-ask spread observed 
by Becker and Leschinski (2018) is around 0.2 percent during a short period 
of time in the fnancial crisis. As a sanity check the fnal wealth gained using 
the predictions from the GBM model and the trading cost of 0.2 percent 
throughout the whole period is still 40 percent higher than the fnal wealth 
with the benchmark strategy. The maximum cost level yielding the same fnal 
wealth as with the buy-and-hold strategy is 0.33 percent. This is signifcantly 
higher compared to the 0.21 percent break even cost reported by Fiévet and 
Sornette (2018). 
3.5 Conclusions 
This paper introduces a new multinomial classifcation approach to predict 
daily stock returns of the S&P 500 stock market index. The multinomial 
approach puts more emphasis on predicting large absolute stock returns 
instead of the noisy variation around zero. The multinomial approach also 
provides a larger set of possible trading strategies compared to the more 
commonly used binary response variable. The classifcation ability of fve 
different machine learning methods are compared both from the viewpoints 
of classifcation accuracy and from the ability to generate economic profts in 
a real-life trading simulation. 
The empirical results show how the gradient boosting model is the top-
75 
Lauri Nevasalmi 
performer among the machine learning methods based on the classifcation 
accuracies for both the validation and test sets. The model selection capability 
of the gradient boosting model also provides important information about 
the useful predictor variables. The volatility in the stock market as measured 
by the VIX-index turns out to be the best single predictor. Several technical 
analysis indicators are also useful when predicting multinomial stock returns. 
The validity of the effcient market hypothesis (EMH) is typically tested 
in a real-life trading simulation. The ability to generate economic profts 
beyond the passive buy-and-hold strategy when transaction costs are taken 
into account is seen as a violation of the EMH. All the machine learning 
methods considered in this research are able to beat the benchmark buy-and-
hold strategy after accounting for the transaction cost of 0.1 percent. The best 
performing gradient boosting model produces returns 80 percent higher than 
the buy-and-hold strategy. The predictability is highest during the market 
turmoil of the fnancial crisis and the European debt crisis, which is in line 
with recent literature. 
The current research can be extended in several directions. Now the two 
fxed return thresholds used to create the multinomial response variable were 
based on the upper and lower quartiles of the return series. This choice 
was based on creating well balanced classes, which simplify the comparison 
between different machine learning methods. Several other choices are also 
possible and are left for further research. Setting the return thresholds further 
away from zero may result in increased statistical predictability but the amount 
of predictions for the large absolute returns decrease. Thereby the trading 
strategy would become increasingly passive and hence there might be no 
additional economic value despite statistically superior predictions over the 
ones presented in this research. 
There are also various alternative trading strategies that could be used to 
assess the economic profts generated by different methods. One could for 
example beneft more from the predictions indicating large positive or negative 
returns by shortselling or by increasing the wealth invested. Alternatively 
modern fnancial products such as the bull and bear certifcates could be used 
to exploit the correctly predicted large absolute returns. Furthermore, the 
linkage between forecastability based on statistical evaluation criteria and the 
economic proftability of the trading strategy should be more closely examined. 
76 
Forecasting multinomial stock returns using machine learning methods 
The trading strategy could even be incorporated to the actual model estimation 
process. 
References 
Andrada-Félix, J., Fernández-Rodríguez, F., and Fuertes, A.-M. (2016). Com-
bining nearest neighbor predictions and model-based predictions of realized 
variance: Does it pay? International Journal of Forecasting, 32(3):695 – 715. 
Ang, A. and Chen, J. (2002). Asymmetric correlations of equity portfolios. 
Journal of Financial Economics, 63(3):443–494. 
Arroyo, J. and Maté, C. (2009). Forecasting histogram time series with k-
nearest neighbours methods. International Journal of Forecasting, 25(1):192 – 
207. 
Basak, S., Kar, S., Saha, S., Khaidem, L., and Dey, S. R. (2019). Predicting 
the direction of stock market prices using tree-based classifers. The North 
American Journal of Economics and Finance, 47:552 – 567. 
Becker, J. and Leschinski, C. (2018). Directional Predictability of Daily Stock 
Returns. Hannover Economic Papers (HEP) dp-624, Leibniz Universität 
Hannover, Wirtschaftswissenschaftliche Fakultät. 
Bishop, C. M. (2006). Pattern Recognition and Machine Learning (Information 
Science and Statistics). Springer-Verlag, Berlin, Heidelberg. 
Breiman, L. (2001). Random forests. Machine Learning, 45(1):5–32. 
Christoffersen, P. F. and Diebold, F. X. (2006). Financial asset returns, 
direction-of-change forecasting, and volatility dynamics. Management Sci-
ence, 52(8):1273–1287. 
Chung, J. and Hong, Y. (2007). Model-free evaluation of directional pre-
dictability in foreign exchange markets. Journal of Applied Econometrics, 
22(5):855–889. 
Cujean, J. and Hasler, M. (2017). Why does return predictability concentrate 
in bad times? The Journal of Finance, 72(6):2717–2758. 
77 
Lauri Nevasalmi 
Fama, E. F. (1970). Effcient capital markets: A review of theory and empirical 
work. The Journal of Finance, 25(2):383–417. 
Fiévet, L. and Sornette, D. (2018). Decision trees unearth return sign pre-
dictability in the s&p 500. Quantitative Finance, pages 1–18. 
Fix, E. and Hodges, J. (1951). Discriminatory Analysis: Nonparametric Discrimi-
nation: Consistency Properties. USAF School of Aviation Medicine, Randolph 
Field, TX. 
Freund, Y. and Schapire, R. E. (1996). Experiments with a new boosting algo-
rithm. In Proceedings of the Thirteenth International Conference on International 
Conference on Machine Learning, ICML’96, pages 148–156, San Francisco, CA, 
USA. Morgan Kaufmann Publishers Inc. 
Friedman, J., Hastie, T., and Tibshirani, R. (2000). Additive logistic regression: 
A statistical view of boosting. The Annals of Statistics, 28:337–407. 
Friedman, J. H. (2001). Greedy function approximation: A gradient boosting 
machine. Annals of Statistics, 29(5):1189–1232. 
Granger, C. W. and Ding, Z. (1996). Varieties of long memory models. Journal 
of Econometrics, 73(1):61 – 77. 
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The elements of statistical 
learning: data mining, inference and prediction. Springer, 2 edition. 
Henkel, S., Martin, J. S., and Nardari, F. (2011). Time-varying short-horizon 
predictability. Journal of Financial Economics, 99(3):560–580. 
Hong, Y., Tu, J., and Zhou, G. (2007). Asymmetries in stock returns: Statistical 
tests and economic evaluation. The Review of Financial Studies, 20(5):1547– 
1581. 
Hsu, C.-W. and Lin, C.-J. (2002). A comparison of methods for multiclass 
support vector machines. IEEE Transactions on Neural Networks, 13(2):415– 
425. 
Hsu, P.-H., Hsu, Y.-C., and Kuan, C.-M. (2010). Testing the predictive ability 
of technical analysis using a new stepwise test without data snooping bias. 
Journal of Empirical Finance, 17(3):471 – 484. 
78 
Forecasting multinomial stock returns using machine learning methods 
Kara, Y., Boyacioglu, M. A., and Baykan, Ö. K. (2011). Predicting direction of 
stock price index movement using artifcial neural networks and support 
vector machines: The sample of the istanbul stock exchange. Expert Systems 
with Applications, 38(5):5311 – 5319. 
Karhunen, M. (2019). Algorithmic sign prediction and covariate selection 
across eleven international stock markets. Expert Systems with Applications, 
115:256 – 263. 
Kendall, M. G. (1953). The analysis of economic time series, part I: Prices. 
Journal of the Royal Statistical Society, 96. 
Kim, K.-j. (2003). Financial time series forecasting using support vector ma-
chines. Neurocomputing, 55(1):307 – 319. 
Krauss, C., Do, X. A., and Huck, N. (2017). Deep neural networks, gradient-
boosted trees, random forests: Statistical arbitrage on the s&p 500. European 
Journal of Operational Research, 259(2):689 – 702. 
Leitch, G. and Tanner, J. E. (1991). Economic forecast evaluation: Profts 
versus the conventional error measures. The American Economic Review, 
81(3):580–590. 
Leung, M. T., Daouk, H., and Chen, A.-S. (2000). Forecasting stock indices: 
a comparison of classifcation and level estimation models. International 
Journal of Forecasting, 16(2):173 – 190. 
Linton, O. and Whang, Y.-J. (2007). The quantilogram: With an application to 
evaluating directional predictability. Journal of Econometrics, 141(1):250–282. 
Longin, F. and Solnik, B. (2001). Extreme correlation of international equity 
markets. The Journal of Finance, 56(2):649–676. 
Maheu, J. M. and McCurdy, T. H. (2004). News arrival, jump dynamics, and 
volatility components for individual stock returns. The Journal of Finance, 
59(2):755–793. 
Meade, N. (2002). A comparison of the accuracy of short term foreign exchange 
forecasting methods. International Journal of Forecasting, 18(1):67 – 83. 
79 
Lauri Nevasalmi 
Merton, R. (1981). On market timing and investment performance. i. an 
equilibrium theory of value for market forecasts. The Journal of Business, 
54:363–406. 
Neely, C. J., Rapach, D. E., Tu, J., and Zhou, G. (2014). Forecasting the eq-
uity risk premium: The role of technical indicators. Management Science, 
60(7):1772–1791. 
Nyberg, H. (2011). Forecasting the direction of the us stock market with 
dynamic binary probit models. International Journal of Forecasting, 27:561– 
578. 
Nyberg, H. and Pönkä, H. (2016). International sign predictability of stock 
returns: The role of the United States. Economic Modelling, 58(C):323–338. 
Pesaran, M. H. and Timmermann, A. (1995). Predictability of stock returns: 
Robustness and economic signifcance. The Journal of Finance, 50(4):1201– 
1228. 
Rojas, R. (1996). Neural Networks: A Systematic Introduction. Springer-Verlag 
New York, Inc., New York, NY, USA. 
Skabar, A. (2013). Direction-of-change fnancial time series forecasting using a 
similarity-based classifcation model. Journal of Forecasting, 32(5):409–422. 
Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. Springer-Verlag, 
Berlin, Heidelberg. 
Welch, I. and Goyal, A. (2008). A comprehensive look at the empirical per-
formance of equity premium prediction. The Review of Financial Studies, 
21(4):1455–1508. 
Zhang, G., Patuwo, B. E., and Hu, M. Y. (1998). Forecasting with artifcial 
neural networks:: The state of the art. International Journal of Forecasting, 
14(1):35 – 62. 
Zhong, X. and Enke, D. (2017). Forecasting daily stock market return using 
dimensionality reduction. Expert Systems with Applications, 67:126 – 139. 
80 
Forecasting multinomial stock returns using machine learning methods 
Appendix A: Full predictor set 
Table 3.7: Full predictor set 
Category Predictor Transformation 
Stock market Lagged return 
DAX 
FTSE 
Psp500,trsp500,t = log( )Psp500,t−1 
Pdax,trdax,t = log( )Pdax,t−1 
Pf tse,t rftse,t = log( )Pf tse,t−1 
High-Low hilowt = Phigh,t − Plow,t 
Trade volume V olttvolt = log( )V olt−1 
Market capitalization mcapt = log( Capt )Capt−1 
Lagged response -
Squared return 2 2r = rt sp500,t 
Skew 3skewt = rsp500,t 
Kurtosis 4kurtt = rsp500,t 
Interest rates Fed funds rate fft = fedft − fedft−1 
3-mth Tbill 3mtht = 3mt − 3mt−1 
10-yr bond 10yrt = 10yt − 10yt−1 
30-yr bond 30yrt = 30yt − 30yt−1 
Moody’s Aaa Aaat = yAaat − yAaat−1 
Moody’s Baa Baat = yBaat − yBaat−1 
Term spread tst = 10yt − 3mt 
Long spread lst = 30yt − 10yt 
Moody’s spread corpt = yBaat − yAaat 
Corporate vs Govern-
ment 
Aaa10yt = yAaat − 10yt 
TED-spread -
*Table is continued on the next page. 
81 
Lauri Nevasalmi 
Table 3.7: Full predictor set (continued) 
Category Predictor Transformation 
Exchange rates 
Commodities 
Volatility 
Macro 
Technical analysis 
Major currencies index 
USD / GBP 
Copper 
Oil 
Gold 
Silver 
VIX-index 
MOVE-index 
ADS-index 
Moving average 
Momentum 
Stochastic %K 
Stochastic %D 
Relative strength index 
LW %R 
MACD 
= log( Majortcurrt )Majort−1 
$to£tusdgbpt = log( )$to£t−1 
Pcop,tcoppert = log( )Pcop,t−1 
Poil,toilt = log( )Poil,t−1 
Pgold,t goldt = log( )Pgold,t−1 
Pslver,t silvert = log( )Pslver,t−1 
-
-
-
−1 P n−1 mat = n Psp500,t−ii=0 
momt = Psp500,t−Psp500,t−(n−1) 
Pt−Lnkt = × 100Hn−LnP−1 n−1dt = n i=0 kt−1 
rsit = 
up × 100 up+down 
Hn−Ptrperct = × 100Hn−Ln 
see Kara et al. (2011) 
*For simplicity n is assumed to be 10 for all the indicators. Hnis the highest high in the previous n-1 
days whereas Ln is the lowest low. up is the sum of positive price changes and low the sum of negative 
price changes in the previous n-1 days. 
82 
Forecasting multinomial stock returns using machine learning methods 
Appendix B: Model selection results of the tree-based 
methods 
Table 3.8: Top-15 most important predictors for RF and GBM 
RF GBM 
Variable Relative infuence Variable Relative infuence 
1 VIX_1 100.000 VIX_1 100.000 
2 VIX_2 83.542 VIX_2 64.060 
3 VIX_5 63.202 VIX_7 21.815 
4 VIX_3 59.166 VIX_3 18.444 
5 VIX_4 53.447 AAA10y_1 13.928 
6 VIX_6 53.221 StochK_1 13.351 
7 VIX_8 37.275 VIX_5 13.039 
8 VIX_7 33.654 VIX_4 12.232 
9 VIX_10 32.415 HiLow_4 11.204 
10 VIX_9 27.538 MACD_3 4.368 
11 StochK_1 26.352 StochK_2 4.152 
12 RPerc_1 23.745 RPerc_1 3.740 
13 HiLow_4 23.323 MCap_5 3.085 
14 HiLow_5 17.188 HiLow_10 2.635 
15 HiLow_6 12.935 SP500_5 2.217 
*For more information about the relative infuence measure, see Breiman (2001) and Friedman (2001). 
83 

Chapter 4 
Moving forward from predictive 
regressions: Boosting asset 
allocation decisions 
Abstract ∗ † ‡ 
We introduce a fexible utility-based empirical approach to directly determine 
asset allocation decisions between risky and risk-free assets. This is in con-
trast to the commonly used two-step approach where least squares optimal 
statistical equity premium predictions are frst constructed to form portfolio 
weights before economic criteria are used to evaluate resulting portfolio perfor-
mance. Our single-step customized gradient boosting method is specifcally 
designed to fnd optimal portfolio weights in a direct utility maximization. 
Empirical results of the monthly U.S. data show the superiority of boosted 
portfolio weights over several benchmarks, generating interpretable results 
and proftable asset allocation decisions. 
∗ This chapter is based on a manuscript written jointly with Henri Nyberg. 
† We thank Mika Hannula, Heikki Kauppi, Matthijs Lof, Mika Vaihekoski and seminar 
participants at the University of Turku for useful comments. We gratefully acknowledge the 
fnancial support from the Emil Aaltonen Foundation and the Academy of Finland (grant 
321968). 
‡ A paper based on this chapter is available in SSRN working paper series (id3623956). 
85 
Lauri Nevasalmi 
4.1 Introduction 
Asset allocation decisions have always been at the heart of fnance and asset 
pricing research. The focus in the existing empirical and econometric imple-
mentations has been to predict stock returns to generate proftable portfolio 
decisions (see, e.g., the survey of Brandt, 2010). These have fundamentally 
been two-step approaches where especially linear predictive regressions, opti-
mized via ordinary least squares, are frst used to predict stock returns and 
resulting predictions are subsequently utilized to generate portfolio weights 
and portfolio returns evaluated with economic goodness-of-ft criteria. How-
ever, Leitch and Tanner (1991), Kandel and Stambaugh (1996), Xu (2004), 
Guidolin and Timmermann (2007) and Cenesizoglu and Timmermann (2012), 
among others, have shown from various perspectives that statistically optimal 
return predictions do not automatically imply economic gains and that even 
weak return predictability can have important economic value in terms of 
utility- and proft-based metrics. These views challenge the foundations of 
conventional two-step modelling as eventually investors are interested in 
the received economic gains on their investments, not arguably convenient 
statistical criteria such as least squares. 
In this study, we consider a classic and simple asset allocation problem 
where an investor trades between risky market return and risk-free rate. Our 
contribution is to introduce a fexible single-step empirical approach built 
upon direct utility maximization in fnding portfolio weights. That is, we 
set our objective function in accordance with theoretical utility maximization 
premises also in empirical implementation instead of statistical criteria such 
as least squares used in linear predictive regressions. We show that this 
advanced direct portfolio weight determination can interestingly be combined 
with modern machine learning, specifcally a customized gradient boosting 
algorithm, enabling various advantages over the past approaches. Our results 
can help investors with different risk aversion preferences to determine which 
predictive variables and their combinations are useful in optimizing their asset 
allocation decisions. 
This study follows the footsteps of the seminal works by Welch and Goyal 
(2008) and Campbell and Thompson (2008), but importantly it is not the 
next attempt in already extensive literature modifying (linear) predictive re-
86 
Moving forward from predictive regressions: Boosting asset allocation decisions 
gressions to fnd statistical (out-of-sample) predictability in stock returns. 
Instead, our single-step approach sets portfolio weights directly by maximiz-
ing the underlying empirical utility function. In addition to the fundamental 
difference in the underlying objective function, we incorporate typically pre-
determined lower and upper bounds of the portfolio weights as a part of 
the method, whereas the weights obtained with predictive regressions are 
typically truncated subsequently to lie between these bounds (see, e.g., Camp-
bell and Thompson, 2008; Rapach and Zhou, 2013; Neely et al., 2014). This 
‘post-truncation’ step has been found successful in out-of-sample stock re-
turn forecasting, but it does not give much economic intuition why certain 
state variables are important predictors to portfolio weight decisions. Our 
single-step approach circumvents these complications and produces easily 
interpretable results. 
Our contribution is strongly connected to a more general issue in (fnancial) 
econometrics on what is the appropriate objective (loss) function in economet-
ric inference: See general discussion and arguments for much more detailed 
examination in this respect in Elliott and Timmermann (2016, Chapter 2). 
Moreover, as argued by Brandt (2010) in his literature review (see also Brandt, 
1999; Aït-Sahalia and Brandt, 2001; Brandt and Santa-Clara, 2006; Brandt, 
Santa-Clara and Valkanov, 2009), besides the obvious intuitive fact that asset 
allocation is the ultimate object of interest, there are several benefts when 
focusing portfolio weights directly with available predictive information. Ulti-
mately, in contrast to direct utility maximization, the usual two-step statistical 
approach requires frst to learn the notoriously complicated data generating 
process of stock returns, with various misspecifcation possibilities, before 
obtaining necessary input predictions to be plugged in (ad-hoc) trading rules. 
To evaluate the resulting asset allocation decisions with several economic cri-
teria does not remove the fact that the underlying inference is not directed to 
optimize economic performance. To address this challenge, we develop a cus-
tomized gradient boosting algorithm to empirically optimize asset allocation 
decisions in a fexible and single-step manner. 
In economics and fnance, machine learning-based methods have so far 
been (rightly) somewhat criticized about lack of intuition and interpretability. 
Due to its fexibility to allow custom objective functions, among ‘textbook’ 
machine learning algorithms often binded with their specifc objectives, gradi-
87 
Lauri Nevasalmi 
ent boosting is an excellent workhorse for our purposes. The usual ‘statistical’ 
gradient boosting algorithm with regression trees has generally been one of 
the most successful machine learning algorithms so far in different applica-
tions: See Rossi and Timmermann (2015) and Rossi (2018) as recent examples 
in fnance to predict stock returns. 
Our customized version diverges from the usual gradient boosting by 
utilizing the (negative) utility function as the objective function. We are 
hence able to establish an innovative synthesis between fnancial economics 
and machine learning practices instead of following everlasting least squares 
(mean square error) and hypothesis testing-based inference. That is, we get 
interpretable results on which predictive variables are truly important for 
asset allocation decisions when optimizing portfolio performance. Intuitively, 
customized gradient boosting is iteratively learning from training sample 
asset allocation mistakes, measured by the gradient of the objective function 
(i.e. now the selected utility function), before fnally resulting in superior 
portfolio weight forecasts for the next period. This is intuitively in line with 
investors’ attempts in practice to continuously update their trading strategy 
before taking future positions. 
In empirical analysis, following the past (out-of-sample) return predictabil-
ity studies for a comparison sake, we use a general form of the utility function 
that resembles to connect quadratic preferences to investors’ decision making. 
This is the evaluation context typically considered in the past studies with an 
emphasis on out-of-sample forecasting performance, providing thus a natural 
building block for our approach. With the large updated dataset of macroe-
conomic and technical indicator predictive variables, originally compiled 
by Welch and Goyal (2008) and Neely et al. (2014), we are able to examine 
which predictors are truly important to asset allocation decisions rather than 
concentrating on the usual statistical predictability of stock returns. 
Our empirical results on the monthly U.S. market returns and predictors 
show that substantial and quantitatively meaningful economic value can be 
obtained with our utility boosting method. This is the case even despite the 
fact that monthly stock returns are from the statistical perspective at most only 
weakly predictable. Technical indicators yield as a group the largest benefts 
in out-of-sample forecasting experiments. This is generally in line with the 
conclusions of Neely et al. (2014) and now confrmed with very different 
88 
Moving forward from predictive regressions: Boosting asset allocation decisions 
methodology. In the full sample estimation and model selection results, the 
gains obtained with the utility boosting are broad, also containing some 
specifc macroeconomic variables. Interestingly, theoretically well-motivated 
infation and partly also the dividend-price ratio stand out both in- and out-of-
sample predictions for which the past empirical fndings on their usefulness, 
obtained with statistical predictive regressions, have been quite inconclusive. 
The rest of the paper is organized as follows. In Section 4.2, we frame and 
present the starting point of our contribution, determined by the related past 
out-of-sample predictive regression studies, before setting up our utility boost-
ing approach. Empirical results are reported in Section 4.3, before discussion 
on the main general fndings and fnal conclusions in Sections 4.4–4.5. Various 
additional and robustness analyses are compiled into the attached Appendix 
A. 
4.2 Methodology 
4.2.1 Starting point and two-step statistical approach 
Consider a classic and commonly examined simple asset allocation decision 
problem for an investor with a single-period horizon aiming to optimally 
compose portfolio value between risky asset rm,t (market return) and risk-free 
asset return rft.1 For a given level of (initial) wealth, the classic asset pricing 
perspective to the investor’s optimization problem is to fnd portfolio weights 
maximizing the underlying utility function. Let wt denote the proportion of 
the portfolio value allocated to the risky asset at time t, which is based on the 
predictive information available at time t − 1 and contained in the vector xt−1. 
The resulting portfolio return (at time t) is hence 
rp,t = rf,t + wtre,t, (4.1) 
where re,t is the excess return on the (broad) stock market index in excess of 
the risk-free rate rf,t from the period t − 1 to t. 
Our aim is to determine (empirical) portfolio weights wt in (4.1) directly 
1 An extension to multiple asset case requires a separate treatment to extend our methods 
to multiple-equation case (see details in Discussion in Section 4.4). This is yet out of scope of 
our advancement and left for the future research. 
89 
Lauri Nevasalmi 
using one or multiple state variables (predictors) contained in xt−1 at time t−1 
(i.e. wt ≡ w(xt−1)). By focusing directly on the weights wt, we aim to capture 
predictable and possibly time-varying patterns in weights as opposed to just 
relying predictive regressions on expected returns. Despite this objective and 
the fact that we are not explicitly interested in stock return predictions, we 
connect and frame our approach to the context of (out-of-sample) return pre-
dictability examination originating from the seminal contributions by Welch 
and Goyal (2008) and Campbell and Thompson (2008). 
Throughout this study and as a necessary selection to concretely setting 
up our approach, we consider a general utility function which resembles to 
attach quadratic (mean-variance) preferences to investor decision making. 
Following (4.1) and the formulation of Marquering and Verbeek (2004) and 
Fleming, Kirby, and Ostdiek (2001), among others, we consider to maximize 
the ex-ante (quadratic) expected utility of form n o1 
max Et−1(rp,t) − γ Vart−1(rp,t) 
wt n 2 o (4.2)1 2 =max Et−1(rp,t) − γ wt Vart−1(re,t) , wt 2 
where γ > 0 is the investor’s risk aversion coeffcient, representing the degree 
of risk aversion, and Et−1(·) and Vart−1(·) denote the conditional expectation 
and conditional variance, given the information set at time t − 1. The aim is 
to maximize (4.2) by determining wt, using the information included in xt−1, 
resulting to portfolio returns rp,t. 
The utility scheme (4.2) is directly built upon the large majority of past (out-
of-sample) return predictability studies, relevant to our attempt, and portfolio 
performance evaluation therein (see, e.g., Campbell and Thompson (2008), 
Rapach et al. (2010), Neely et al. (2014), Rossi (2018), among others, including 
the survey of Rapach and Zhou (2013)). They, likewise we, are not explicitly 
claiming that quadratic (mean-variance) utility function is necessarily exactly 
correct utility confguration in their two-step approaches. Our goal is simply 
to use an empirical counterpart of (4.2) as the selected underlying objective 
function in a single-step approach, to be developed in Sections 4.2.2–4.2.3, to 
determine asset allocation decisions. 
Solving the maximization problem (4.2) leads to the solution of the optimal 
90 
Moving forward from predictive regressions: Boosting asset allocation decisions 
weights 
∗ Et−1(re,t) Et−1(rm,t) − rf,t w = = , (4.3)t γ Vart−1(re,t) γ Vart−1(re,t) 
which is here written in terms of the expected excess stock return Et−1(re,t). If 
the expected return on the risky asset increases (ceteris paribus), an investor in-
creases his/her weight on the risky asset, whereas increasing risk (conditional 
variance) involved is negatively related to the optimal weights. However 
and importantly, the optimal theoretical solution (4.3) does not yet tell how to 
obtain weights empirically. 
As the weights wt are functions of predictors xt−1, our goal is to determine 
empirical weights and asset allocation decisions in direct utility maximiza-
tion using the ‘training’ (estimation) data {xt−1, re,t}T where T denotes the t=1 
sample size. Our utility boosting approach in Sections 4.2.2–4.2.3 follows 
the steps opened by Brandt (1999), Aït-Sahalia and Brandt (2001) and Brandt 
and Santa-Clara (2006) also looking beyond predicting conditional moments 
of excess stock returns when setting portfolio weights. Our utility boosting, 
however, emphasizes somewhat different fnal goals than their contributions 
by integrating recent advancement in machine learning to genuine prediction 
(forecasting) purposes in asset allocation decisions. 
We develop our empirical approach as a complement to the dominant 
two-step empirical practice followed in the past (in and out-of-sample) return 
predictability research. It is built upon a simple linear predictive regression 
0 re,t = zt−1β + εt, t = 1, . . . , T, (4.4) 
0 , E(zt−1εt)where zt = [1 x ]0 = 0 and β is the vector of unknown param-t 
eters, containing also a constant term. The parameters β will be estimated 
by the method of ordinary least squares (OLS) where the underlying objec-
tive function is statistical minimizing the least squares criterion, given the 
estimation data {xt−1, re,t}T (including the (known) initial value x0):t=1 
TX� 2b 0βOLS = arg min re,t − zt−1β . (4.5) 
β t=1 
91 
Lauri Nevasalmi 
The resulting expected (predicted) excess returns 
0 brbe,t = zt−1βOLS (4.6) 
are then used as the empirical proxy for Et−1(re,t) needed in (4.3). 
In a similar fashion, an empirical proxy for the conditional variance of 
excess returns (hereafter denoted by σ2 ≡ Vart−1(re,t)) is required given the t 
information available at time t − 1. To allow for well-documented volatility 
clustering, we follow Campbell and Thompson (2008) and the subsequent 
studies (see, e.g., Rapach et al., 2010; Neely et al., 2014) by using a fve-year 
rolling window variance of historical excess returns. This simplifcation can 
be relaxed by using, for example, the GARCH model or another realized 
volatility-based proxy (see the Appendix A.4). Therefore, given the (pre-
determined) risk aversion coeffcient γ and the result (4.3), an investor allocates 
the following share of the portfolio value to risky equity for the period t: 
z0 brbe,t t−1βOLS wbt = = . (4.7)
γ σb2 γ σb2 t t 
Expression (4.7) summarizes how the portfolio weights can empirically be 
obtained as a function of predictors xt−1 in two steps relying on the commonly 
considered linear predictive regressions (4.4) (i.e. frst constructing rbe,t (and 
σb2) before plugging them into (4.7) in the second stage).2 This approach has, t 
however, several critical issues and complications from the fnal and arguable 
the most important portfolio performance perspective that we aim to tackle in 
this study: 
(i) The resulting estimated weights wbt in (4.7) are not necessarily nowhere 
near between typically pre-determined (assumed) bounds 
min max].wt ∈ [w , w (4.8) 
The common practice (see the above-mentioned return predictability studies) 
2 Several (statistical) nonlinear predictive regressions and systems have also been con-
sidered to predict stock returns with subsequent asset allocation decision objectives. See, 
e.g., the regime switching models surveyed by Guidolin (2011), including Guidolin and Tim-
mermann (2007) who fnd that asset allocations guided by the forecasts of stock and bond 
returns from the Markov switching models yield utility gains relative to constant expected 
excess return predictions as described in equations (4.4)–(4.6). 
92 
Moving forward from predictive regressions: Boosting asset allocation decisions 
minhas been to set the lower bound to zero (w = 0), i.e. no short selling 
is allowed, and to consider mainly two different selections for the upper 
bound (wmax): In Campbell and Thompson (2008), Neely et al. (2014) and 
maxZhu (2015), wt lies between 0 and 1.5 (i.e. w = 1.5). They argue that these 
impose realistic portfolio constraints by precluding short sales and preventing 
more than 50% leverage. On the contrary, Aït-Sahalia and Brandt (2001), 
Marquering and Verbeek (2004) and Rossi (2018), among others, set the upper 
max minbound to w = 1 (with w = 0). This selection is also essentially the 
maxsame as w = 0.99 in, e.g., Kandel and Stambaugh (1996) and Cenesizoglu 
and Timmermann (2012). All in all, the empirical weights constructed as in 
(4.7) by no means guarantee the bounds (4.8) without strict additional and 
complicated restrictions on the predictive regression (4.4). 
(ii) The empirical fndings of Campbell and Thompson (2008), Rapach et al. 
(2010), Neely et al. (2014) and Pettenuzzo et al. (2014) emphasize the impor-
tance of restrictions (4.8) on the weights (4.7) to improve both statistical and 
economic out-of-sample predictive performance. To impose these subsequent 
constraints (4.8) turn out to modify the initial weights (4.7) substantially. It 
is important to realize that this post-truncation is not part of the predictive 
regression (4.4) by any means, and hence all the usual conventional statistical 
interpretations, such as t-test statistics, goodness-of-ft measures and model 
selection conclusions on useful predictors xt are lost for further interpreta-
tions. In other words, even if a predictive variable is deemed statistically 
useful (i.e. ‘statistically signifcant’), this does not necessarily mean that it 
really has important predictive information on asset allocation decisions when 
respecting the bounds (4.8) in the end. This logic works also other way round 
so that statistically poor predictors can be useful in direct portfolio weight 
determination. 
(iii) Consider a commonly used one predictor (xi,t) special case of (4.4) 
re,t = βi,0 + xi,t−1βi + εi,t. (4.9) 
In the vast (in-sample) return predictability research, this is the common spec-
ifcation where testing the null hypothesis βi = 0 of no (conditional) mean 
predictability in stock returns with a t-test statistic has been of particular in-
terest. This is important because if (correctly) rejecting the null βi = 0, the 
93 
Lauri Nevasalmi 
resulting statistical predictability is widely interpreted to imply also useful 
predictive power for systematic asset allocation decisions via equations (4.3)– 
(4.7). This ‘signifcance testing’ setting is, however, commonly reported to 
suffer ‘Stambaugh bias’, i.e. substantial size distortions when xi,t is highly 
persistent and correlated with return innovations εi,t: See Stambaugh (1999), 
Rapach and Zhou (2013, Section 3.1) and the recent contributions of Deme-
trescu et al. (2020) as representative references of voluminous in-sample return 
predictability results and techniques aiming to improve statistical inference in 
model (4.9). 
(iv) Finally, and importantly as also argued by Brandt (2010) and Elliott 
and Timmermann (2016, Section 4.2) in their surveys, the two-step predictive 
regressions-based approach to set the weights (4.7) belongs to the ‘plug-in’ 
methods: The statistical least squares criterion (4.5) in econometric modelling 
does not (most likely) line up with investors’ true preferences and the fnal 
objective which is in asset allocation decisions. Even though the two-step 
approach is arguably a simple way to fnd required predictions to be plugged 
in (4.7), it generally leads to a discrepancy between the original goal (utility 
maximization) and the objective function in econometric inference.3 
4.2.2 Objective function 
To acknowledge all the diffculties (i)–(iv) reviewed in Section 4.2.1 and con-
nected to the two-step approach around (linear) predictive regressions, we 
introduce a fexible non-parametric and nonlinear approach which is strictly 
building upon utility maximization also in empirical implementation to obtain 
weights wt. We are thus integrating asset allocation decision making and ma-
chine learning via a customized gradient boosting algorithm with specifc and 
important modifcations over the mechanical use of existing machine learning 
algorithms. Section 4.2.3 presents details of the ‘utility boosting’ algorithm, 
with various favourable properties. 
Our utility boosting approach follows the general lines and arguments 
made by Brandt (1999), Aït-Sahalia and Brandt (2001), Brandt and Santa-Clara 
(2006) and Brandt et al. (2009) in the early literature, arguing the importance of 
3 Sentana (2005) considers formal conditions and assumptions under which least squares-
based predictions and mean-variance analyses are connected in associated market timing 
strategies. 
94 
Moving forward from predictive regressions: Boosting asset allocation decisions 
direct portfolio decisions instead of the two-step ‘plug-in’ statistical approach. 
Along their studies, we are not explicitly relying on specifc assumptions on 
the excess stock return data generating mechanism, such as the arbitrage 
pricing theory (APT) or the capital asset pricing model (CAPM). Instead, our 
(empirical) view is that the portfolio weight wt is a direct, potentially highly 
nonlinear, function of state variables xt−1 maximizing the investor’s utility. 
This linkage incorporates all the predictive information contained in xt−1 to 
determine the weights, irrespective of the conclusions on statistical mean 
return predictability as discussed in (ii) and (iii) in Section 4.2.1, including 
also the possible impact of higher conditional moments than just mean and 
variance. 
On these past utility-based approaches, the closest to our approach seems 
Brandt and Santa-Clara (2006) where they parametrize the portfolio weight 
as a linear function of predictors (state variables) xt−1 and solve the opti-
mal values of the present parameters maximizing expected quadratic utility 
function similar to (4.2). Their approach is empirically, however, much more 
restrictive than ours (see details in the Appendix B) and designed more closely 
on portfolio choice problems with a genuine cross-sectional dimension as well 
(i.e. multiple risky assets). Moreover, their resulting portfolio weights can be 
interpreted as being proportional to the standard OLS regression of a vector of 
ones on the excess returns and, importantly, additional subsequent constraints 
to maintain the bounds (4.8) are required to address the point (i) in Section 
4.2.1.4 
To set a tractable empirical counterpart of (4.2), we frst convert the max-
imization problem to a minimization problem. We will train our boosting 
algorithm with the same training (estimation) data {xt−1, re,t}T (includingt=1 
also the initial values for the volatility proxy) as in the least squares-optimal 
statistical two-step approach in equations (4.4)–(4.7) where re,t implicitly con-
tains required information on both rm,t and rf,t. Our utility-based empirical 
4 In addition to the OLS interpretation of the portfolio weights as obtained in Brandt 
and Santa-Clara (2006), Brandt (1999) and Brandt et al. (2009) build upon on the (statistical) 
method of moments and ‘maximum utility estimator’, respectively. At the end, they also em-
phasize statistical hypothesis testing with the aim to obtain ‘statistically signifcant’ results 
in the evaluation stage, whereas our perspective is different and specifcally in prediction 
(forecasting) performance after direct utility-based modelling. 
95 
Lauri Nevasalmi 
objective function (cf. (4.2)) is ( XT ) ( T  )X1 1 1 2 σ2arg min −ut = arg min − rp,t − γ w , (4.10)t t 
wt T wt T 2 t=1 t=1 
where rp,t denotes the resulting portfolio returns (see (4.1)). The (negative) 
utility contribution of the tth observation, −ut, is crucially dependent on the 
weights wt constructed with the information contained in xt−1. The selected 
volatility proxy σ2 is already introduced in connection to (4.7): It is the same as t 
in (4.10) and in the two-step approaches throughout this study for comparison 
reasons (cf. a different formulation in this respect in Brandt and Santa-Clara 
(2006)). The form (4.10) is the one examined in various return predictability 
studies as an evaluation diagnostic tool measuring portfolio performance (see 
Marquering and Verbeek, 2004; Campbell and Thompson, 2008; Rapach et al., 
2010; Rossi, 2018) and hence it acts as a natural choice for our advancement. 
For simplicity and in accordance with past closely related studies, through-
minout this study, we set the lower bound in (4.8) as w = 0 and hence excluding 
maxshort selling.5 Moreover, to respect the pre-determined maximum weight w 
explicitly as a part of our procedure (cf. the subsequent weight truncation 
needed in (4.7)), we set 
maxλt,wt = w (4.11) 
where the portfolio weight is essentially specifed by the logistic function 
λt =  1  . (4.12)
11 + exp − F (xt−1)γ σ2 t 
Combined with (4.11), the logistic growth curve form (4.12) guarantees that the 
weights (4.11) are all the time inside the interval wt ∈ [0, wmax]. In (4.12), the 
essential ingredient to determine portfolio weights (cf. (4.4)) is the component 
F (xt−1), which is possibly a complex function of the predictive information 
xt−1 that we aim to teach with our training data. To this end, in Section 4.2.3, 
we specifcally develop a customized version of the gradient boosting, which 
is in principle only one but seemingly highly relevant empirical algorithm over 
5 An extension allowing also for short selling (i.e. negative weights) is possible but means 
a different and slightly more complicated parametrization than the one in (4.11)–(4.12). 
96 
Moving forward from predictive regressions: Boosting asset allocation decisions 
alternatives to determine F (xt−1) when aiming to maximize the empirical 
utility. 
In specifcation (4.12), the impact of F (xt−1) is adjusted by the risk proxy 
σ2 and the risk aversion coeffcient γ along the expression (4.3).6 We can alsot 
strengthen the linkage to (4.7) by the following ‘payoff to stake’ representation     
maxλtλt w F (xt−1)
log = log = . 
wmax(1 − λt)1 − λt γ σ2 t 
This is the log odds ratio of λt where higher λt refects the likelihood of high 
portfolio weight should take place. In addition to this representation, the 
inclusion of the volatility proxy σ2 in (4.12) is strongly motivated by the early t 
empirical evidence of Fleming et al. (2001) and Marquering and Verbeek (2004) 
on the importance of an explicit volatility component to determine portfolio 
weights. 
In practice, the utility boosting algorithm, to be presented more detail 
in Section 4.2.3, provides the practical method to determine the weights. It 
can intuitively be interpreted so that an investor, with given risk aversion 
preferences and constraints (4.8), aim to continuously optimize and update his 
or her asset allocation decision mechanism with the available past predictive 
information targeting to fnd optimal portfolio weight for the next period. This 
emphasizes the training step of our algorithm before predicting the weight for 
the next period out of sample. This thinking again somewhat diverges from 
the goals of Brandt and Santa-Clara (2006), and a few closely related studies, 
concentrating on full sample portfolio policies. 
4.2.3 Customized gradient boosting 
To maximize the empirical utility (i.e. to minimize (4.10)), we determine the 
optimal weights directly by using a customized gradient boosting algorithm, 
acknowledging all the complications (i)–(iv) of the current two-step statistical 
approach presented in Section 4.2.1. Specifcally, this means to specify the 
ingredient F (xt−1) in (4.12), leading to the ‘boosted’ portfolio weights max-
6 The identity Et−1(re,t) = wt ∗ γ σt 2 in (4.3) shows the linkage between the estimated 
weights and implied expected excess returns. Even though it is not of our main interest, the 
extracted weights (4.11), obtained with (4.12), can thus also be interpreted to imply a proxy 
for expected excess stock returns. 
97 
Lauri Nevasalmi 
imizing the empirical utility as the objective function instead of aiming to 
predict excess stock returns with predictive regressions. 
Boosting is a powerful technique originating from the machine learning 
community. The general idea is that simple weak models, also known as ‘base 
learners’, are combined in a stagewise manner to form a ‘boosting ensemble’ 
with strong predictive performance, leading to superior portfolio weights in 
our context. Friedman, Hastie and Tibshirani (2000) introduced the statistical 
framework for boosting which enables additional theoretical insights into the 
success of boosting and has led to a variety of new boosting algorithms. 
Provided with suffcient amount of data and fexible base learners, such as 
regression trees or smoothing splines, boosting can basically approximate any 
kind of functional form to determine weights wt (cf. the linearity assumption 
in (4.4) before the weight truncation (4.8)). Boosting also performs model 
selection simultaneously with estimation as each new optimal base learner 
function is found by conducting an extensive search involving all the predictor 
variables. This makes boosting a viable algorithm for (relatively) large predic-
tor sets, such as the one of interest in Section 4.3. Another major advantage is 
the interpretability of the fnal model: Unlike the mechanical use of existing 
machine learning algorithms, as we are now building the method explicitly 
on the fnancial economics and asset pricing bases, the fnal outcome provides 
important insights on the most relevant predictors (state variables) specifcally 
for portfolio weight determination. 
Boosting has shown considerable success in different felds including 
robotics, medical statistics and economics. Past fnancial applications range 
from stock return predictions (Rossi and Timmermann, 2015; Rossi, 2018), 
volatility forecasting (Mittnik, Robinzonov and Spindler, 2015), yield curve 
modelling (Audrino and Trojani, 2007) and failures in banking sector (Car-
mona, Climent and Momparler, 2019). Our context is, however, different than 
allowed by the basic setup of gradient boosting. That is, here the boosting 
algorithm contains the customized objective function, the (negative) utility 
function (4.10), while the usual boosting algorithms are fully statistical contain-
ing mean square error (MSE) and likelihood-based ingredients. This shows 
that instead of the mechanical use of gradient boosting, our goal is to specif-
cally integrate a fexible modern machine learning algorithm with the asset 
allocation objective. 
98 
Moving forward from predictive regressions: Boosting asset allocation decisions 
Following Bühlmann and Hothorn (2007) and the bulk of the boosting 
estimators, our customized algorithm can be described as follows: 
Input: The training data {(xt−1, re,t)}T is the same as in linear predictive t=1 
regressions (4.4)–(4.7). The essential part is to determine F (xt−1) in (4.12) 
using a differentiable objective function (4.10), given the fxed risk aversion 
coeffcient γ, selected volatility proxy σ2 and the number of iterations M (seet 
below). 
Algorithm7: 
1. Initialize the algorithm by a constant value: We select, for simplicity, 
F0(xt−1) = 0. 
2. For m = 1, . . . ,M : 
(a) Compute ‘pseudo-residuals’:   
∂(−ut) 
yet,m = − for t = 1, . . . , T. 
∂F (xt−1) F (xt−1)=Fm−1(xt−1) 
The gradient is obtained by the chain rule. Using the expression 
(4.12), given the selected volatility proxy σ2 and noticing the nega-t 
tive sign in (4.10), we get: 
∂(−ut) ∂(−ut) ∂wt ∂λt 
= 
∂F (xt−1) ∂wt ∂λt ∂F (xt−1)  max λt(1 − λt) (4.13) w 
= − re,t − γ wt σ2 .t γ σ2 t 
(b) Fit a base learner hm(xt−1) to pseudo-residuals, i.e. train it using 
the training set {(xt−1, yet,m)}Tt=1. 
(c) Update the prediction: Fm(xt−1) = Fm−1(xt−1) + υhm(xt−1). 
3. Output FM (xt−1), leading to the solution of (4.10), given (4.11) and 
(4.12). 
As a whole, mathematically the above algorithm can be understood to 
minimize (4.10) by iterative steepest descent in function space. Bühlman 
7 A slightly modifed version will be considered in the Appendix A.3. to obtain evidence 
for the robustness of our main fndings. 
99 
Lauri Nevasalmi 
and Hothorn (2007) call it the functional gradient descent algorithm. After 
initializing the algorithm in step 1 with a simple scalar value the pseudo-
residuals in step 2.(a) can be constructed. These are the negative gradients 
of the objective function evaluated with the boosting ensemble learnt so far 
with Fm−1(xt−1). The base learner model that best fts the negative gradient is 
then selected and added to the ensemble. The impact of each update is shrunk 
towards zero using a step-length factor υ ∈ {0, 1}, which we set to 0.001 
according to the past boosting studies. The whole process (step 2) is repeated 
500 times (i.e. M = 500). For the performance of the boosting estimator, the 
base procedure and the stopping rule in step 2 are the most important ones. 
These and other tuning parts of the estimator will be presented and discussed 
in more detail in the Appendix C.8 
With the above gradient boosting, we can, in theory, reach the ultimate 
equilibrium with excessive boosting iterations M where the algorithm is 
overftted to the training (in-sample) data. However, it should be emphasized 
that especially in out-of-sample predictions, we only use the data that we 
have at hand.9 We will control the boosting iterations and the underlying 
speed of learning in estimation (learning) stage to terminate the iterative 
ftting (steps 1–3) at the right time, to get portfolio weights based on their 
genuinely predictable patterns of interest rather than ending up to overftted 
weights. This means that the resulting weights and asset allocation decisions 
are obtained directly with the aim to maximize (4.2) but at the same time 
circumventing overftting of the training data. These selections enable also 
meaningful full sample analyses, as generally examined in empirical fnance 
research so far, on the importance of different predictors (state variables) in 
portfolio weight determination. 
8 Following various utility confgurations (see, e.g., by Aït-Sahalia and Brandt (2001)), 
also examined within the numerical ‘full-scale optimization’ search algorithms (see Adler 
and Kritzman, 2007), our method can also, in principle, be extended to various other (dif-
ferentiable) utility functions than the one in (4.2) and (4.10). The quadratic form (4.2) is 
expected to provide at least a good approximation for most variations of power utility (see 
also, e.g., Campbell and Thompson, 2008, footnote 11) and is, in particular, in line with the 
past return predictability studies that we are mainly linking our approach and empirical 
analysis in Section 4.3. 
9 To clarify our notation, as our estimation (training) data contains the observations 
{(xt−1, re,t)}Tt=1, in forecasting situation portfolio weight forecasts are made for the time 
T +1, i.e. wT +1, using the information up to time T . This explains the difference in ‘forecasting’ 
and more general ‘prediction’ goals including full sample (in-sample) results. 
100 
Moving forward from predictive regressions: Boosting asset allocation decisions 
4.3 Empirical results 
In empirical analysis below, we compare our utility boosting method to the 
existing return predictability studies, with out-of-sample forecasting emphasis, 
using the same dataset (Section 4.3.1) and evaluation criteria (Section 4.3.2) as 
in Neely et al. (2014), Welch and Goyal (2008), Campbell and Thompson (2008), 
Rapach et al. (2010) and Rapach and Zhou (2013). In Sections 4.3.3–4.3.4, we 
report the full sample (in-sample) estimation results, with the special interest 
on comparing the best predictors for direct portfolio weight determination. 
Section 4.3.5 reports the out-of-sample asset allocation results. In Sections 
4.3.3–4.3.5, we concentrate on reporting the results, whereas in Section 4.4 
we summarize and discuss the obtained empirical results for more general 
conclusions. Various additional results and robustness checks for our main 
fndings are compiled into the attached Appendix A. 
4.3.1 Dataset 
Following the closely related return predictability studies, we consider an 
updated monthly dataset compiled by Welch and Goyal (2008) and extended 
by Neely et al. (2014), containing various predictive variables to determine 
portfolio weights directly in our single-step and indirectly in the conventional 
statistical two-step plug-in methods. As in Neely et al. (2014), our sample 
period starts from December 1950 and is now updated until December 2018.10 
Excess stock returns (i.e. the equity risk premium) is obtained as the differ-
ence between the return on the S&P 500 index (including dividends) and the 
risk-free interest rate. The set of macroeconomic predictive variables contains 
14 variables as originally in Welch and Goyal (2008). Detailed descriptions of 
all predictors are presented in Table 4.1. Table 4.2 reports the summary statis-
tics for monthly excess stock returns (Panel A) and macroeconomic variables 
(Panel B). The average monthly equity risk premium is 0.60% with monthly 
standard deviation of 4.14%. This produces a monthly Sharpe ratio of 0.144. 
10 The Welch-Goyal dataset, containing macroeconomic variables and the S&P 500 index 
(returns), is from Amit Goyal’s website at http:/www.hec.unil.ch/agoyal. Technical indica-
tors are obtained with the S&P 500 data. For a general and extensive summary of the past 
return prediction fndings obtained with different predictors and existing methodological 
approaches (see Section 4.2.1), see the survey of Rapach and Zhou (2013). 
101 
Lauri Nevasalmi 
As the Panel B shows, most of the macroeconomic variables are rather persis-
tent (cf. the challenge (iii) in Section 4.2.1) including, as expected, the valuation 
ratios, nominal interest rates and partly also interest rate spreads. 
Together with the macroeconomic variables, Table 4.1 also lists 14 binary-
valued technical indicators as predictors. These are based on the same three 
representative trend-following technical indicators as in Neely et al. (2014). 
As they summarize it (see also, e.g., Rapach and Zhou, 2013), instead of 
macroeconomic fundamentals, technical indicators rely on past price and 
volume patterns with the idea that they will identify future price trends. The 
theoretical models explaining the predictive power of technical indicators are 
based on how investors can be heterogeneous regarding the availability of 
new information, their response to this information and how they view the 
overall investor sentiment (see Neely et al., 2014, and the references therein). 
The frst indicators are moving average (MA) buy and sell signal rules 
MA(s, l) = I(MAs,t ≥ MAl,t), (4.14) 
where I(·) is an indicator function and 
j−1X1 
MAj,t = Pt−i, j = {s, l}, s = {1, 2, 3}, l = {9, 12},
j 
i=0 
where Pt is the level of the S&P 500 index. These MA rules refect the short-
and long-run trends in stock price movements. Moreover, the second set of 
technical indicators are based on the momentum signals 
MOM(m) = I(Pt ≥ Pt−m), m = {9, 12}, (4.15) 
where a positive (negative) momentum signal means that the current value 
of the index is higher (smaller) than m periods ago. Finally, the third set 
of technical indicators incorporates also the volume data to identify market 
trends. Defne 
tX 
OBVt = VOLkDk, Dk = 2 I(Pk − Pk−1) − 1, 
k=1 
where VOLk is the trading volume during the period k. The trading signal is 
102 
Moving forward from predictive regressions: Boosting asset allocation decisions 
then 
≥ MAOBVVOL(s, l) = I(MAOBV ), (4.16)s,t l,t 
where the volume data of the S&P 500 index is obtained from Finance Yahoo 
and 
j−1X1 
MAOBV = OBVt−i, j = {s, l}, s = {1, 2, 3}, l = {9, 12}.j,t j 
i=0 
Intuitively, say, relatively high recent trading volume combined with recent 
price increases indicates a strong positive market trend. Like with macroeco-
nomic variables, the predictive information of the technical indicators (4.14)– 
(4.16) at time t − 1 is used in xt−1 and in the weight determination for the 
period t. Due to the binary nature of (4.14)–(4.16), we do not report the de-
scriptive statistics of technical indicators in Table 4.2. 
4.3.2 Evaluation and benchmarks 
Before reporting our in- and out-of-sample prediction results in Sections 4.3.3– 
4.3.5, we introduce our main evaluation criteria and benchmark approaches. 
In the past research, economic evaluation criteria, such as resulting utilities 
and portfolio returns after ftting the linear predictive regression (4.4) as in 
(4.5)–(4.6), have typically been treated as secondary over the statistical ones 
and return predictability considerations. As briefy reviewed in Section 4.2.1, 
model (4.4), and (4.9) as a typically considered special case, has potentially 
diffcult econometric issues related to the evaluation of statistical signifcance 
of persistent predictors in xt. Importantly from our point of view, we are 
stressing that the usefulness of a certain model, with one or more predictors, 
is not determined by the conventional statistical signifcance of individual 
coeffcients. Instead, it is directly its usefulness in asset allocation decisions 
what matters.11 
The frst economic and proft-based evaluation criteria is naturally the 
resulting average utility (cf. the objective function (4.10), now percentages per 
11 As in the footnote 9, it is important to point out that below the notation t = 1, . . . , T 
refers to the in-sample (training) performance. However, the same criteria will be used to 
evaluate out-of-sample portfolio performance with necessary changes in notation. In other 
words, portfolio weight forecasts for the period T + 1 are constructed using the information 
available at time T . 
103 
Lauri Nevasalmi 
month) 
XT T oXn1 100 1 2 σ2 u¯(wb ) = 100 × ubt = rbp,t − γ wbt bt , (4.17)T T 2 
t=1 t=1 
where wb = (wb1, . . . , wbT ) is the vector of estimated weights, ubt is the utility 
contribution of tth observation and rbp,t is the resulting portfolio return (i.e. 
expression (4.1) evaluated with wb ). To enable explicit comparison between 
different methods, throughout this study, we utilize the same conditional 
variance estimate σb2 (i.e. the 60-month rolling window volatility) constructed t 
using the information at time t−1. Robustness checks with the GARCH model 
and realized variance (RVOL, see Table 4.1) based volatility proxies, reported 
in the Appendix A.4, lead to largely the same main empirical conclusions. The 
risk aversion coeffcient γ is fxed to the typical and commonly used value 
γ = 5 (see, e.g., Neely et al., 2014; Cenesizoglu and Timmermann, 2012; Brandt 
and Santa-Clara, 2006) or γ = 3 (see, e.g., Campbell and Thompson, 2008; 
Rapach et al. 2010; Zhu, 2015).12 
An important and commonly used benchmark to our utility boosting 
method, and also for the predictive regressions with the subsequent restric-
tions (4.8), is that predictors xt−1 do not have useful predictive power to 
determine portfolio weights. For the linear predictive regressions, this implies 
that the expected excess return (4.6) is constant over time and (4.4) contains 
only the constant term. In the two-step approach, this leads to the use of 
historical average (‘HA’) return, r¯  e,t, and weights 
r¯  e,t 
w¯ t = . (4.18)
γ σb2 t 
This is the simple benchmark commonly employed in the past (out-of-sample) 
return predictability literature and subsequent portfolio performance evalua-
tion. As argued by Welch and Goyal (2008), and various subsequent studies, it 
12 Selections γ = 4 and γ = 6 have also commonly been used (see, e.g., Marquering and 
Verbeek, 2004; Rossi, 2018) and hence γ = 5 turns up a suitable compromise between them. 
Moreover, following the reasoning of Rossi (2018), moving from γ = 3 to γ = 5 provides 
guard and a reasonable check against the impact of estimation uncertainty by increasing 
the value of γ. Additional arguments favouring the importance of incorporating estimation 
uncertainty in asset allocation decisions can be found, e.g., in Kandel and Stambaugh (1996), 
Barberis (2000) and Kan and Zhou (2007). 
104 
Moving forward from predictive regressions: Boosting asset allocation decisions 
is a highly adequate statistical description for excess stock returns. Notice that 
despite the constant expected return, the weights (4.18) are time-varying and 
governed by the estimated volatility proxy σbt 2 . Building upon the thinking 
of (4.18), we consider a restricted utility boosting approach where xt−1 con-
tains only the constant (i.e. xt−1 = 1) for all t. That is ‘Const’ approach (cf. 
(4.12)) hereafter, where F (xt−1) in the customized gradient boosting algorithm 
does not include any predictive information over the conditional volatility σt 2 , 
which is in line with the idea of volatility timing (see the general arguments 
favouring such approach in Fleming et al. (2001)). 
A closely related evaluation criterion to the average realized utility (4.17) is 
the certainty equivalent return gain over the historical average (HA) weights 
(4.18): 
CER gain = 1200 × (CER − CERHA), (4.19) 
where 
TX 
CER = r¯  p,t − γ Var(rbp,t), r¯  p,t = 1 rbp,t, (4.20)
2 T 
t=1 
and CERHA is obtained when replacing wbt by w¯ t in rbp,t construction. The 
difference to (4.17) is that in (4.20) the volatility estimate is the resulting 
portfolio variance. This change does not turn out to change our main empirical 
results at all, providing important robustness for our fndings. The CER gain 
(4.19), likewise differences in the estimated average utilities (4.17), can be 
interpreted as the received additional economic value of the utility boosting 
method over the historical average benchmark (4.18). We report the CER gain 
(4.19) multiplied by 1200 so that it can be interpreted as the annual percentage 
portfolio management fee that an investor would be willing to pay to have an 
access to the utility boosting-based asset allocations instead of (4.18). 
In addition to the above criteria, we also consider the traditional (monthly) 
Sharpe ratio PT1 
T t=1 rbep,tSharpe = p , (4.21)
Var(rbep,t) 
where rbep,t = rbp,t − rf,t denotes the resulting portfolio return in excess of 
the risk-free rate. The Sharpe ratio (4.21) is hence the mean portfolio return 
in excess of the risk-free rate divided by the standard deviation of excess 
105 
Lauri Nevasalmi 
portfolio returns. As a reward-to-variability ratio, the Sharpe ratio measures 
the additional amount of excess return that an investor receives per unit 
of increase in risk. Asset allocation decisions with a high Sharpe ratio are 
preferable to those with a low Sharpe ratio. 
4.3.3 In-sample (full sample) results 
Following the common practice in empirical fnance and specifcally in stock 
return predictability studies, before examining out-of-sample forecasting re-
sults in Section 4.3.5, we consider the full sample results for the sample period 
1951:1–2018:12 (816 observations). The main interest in this and Section 4.3.4 
is to examine differences between the utility boosting method and various 
benchmarks to explore which predictive variables are the most useful ones for 
portfolio weight determination when the maximum amount of information 
(the longest data availability) is employed in the analysis. 
As explained in Section 4.2.3, it should be emphasized that our utility 
boosting approach contains shrinkage type of elements and hence provides 
guard against potential overftting concerns, even for in-sample analysis. This 
means, together with the fact that the utility boosting and the two-stage 
predictive regression-based weighting are not generally nested approaches 
(i.e. either one is not obtained as a special case of the other one), utilities 
resulting from the former are not automatically higher than in the latter in 
and out of sample. This should happen especially if the predictive power of a 
certain variable xi,t for portfolio weight determination is indeed non-existent 
or negligible. Moreover, in line with the arguments of Inoue and Kilian 
(2005), it should be kept in mind that there are also potential complications in 
out-of-sample forecasting evaluation, such as the selection of the evaluation 
period and other potentially random (outlier-type) events, which may obscure 
forecasting results as well. 
Table 4.3 presents the in-sample (full sample) results for the single-predictor 
models, and the benchmarks described in Section 4.3.2, for the investor with 
the risk aversion coeffcient of γ = 5. We report the average utilities (4.17) and 
Sharpe ratios (4.21) for the single-step utility boosting and the conventional 
two-step (statistical, ‘linear’) approaches. The resulting portfolio weights fulfl 
min maxthe bounds (4.8), which are w = 0 and w = 1 in this table. For illustra-
106 
Moving forward from predictive regressions: Boosting asset allocation decisions 
tive reasons, we present both the average utilities (4.17) (util%) and CER gains 
(4.19) for the boosting method, leading to essentially the same conclusions 
here and also in other analysis settings. Moreover, even though the statistical 
signifcance of the individual predictors is indeed secondary in this study, to 
establish comparisons to the past linear predictive regressions studies, we 
also report the typical heteroskedasticity-autocorrelation consistent (HAC) 
t-statistics13 and adjusted-R2s (‘adj-R2’) to measure the degree of statistical 
return predictability. As pointed out in Section 4.2.1, these statistics provide 
only partial evidence when the fnal objective is in asset allocation decisions 
and economic value of predictions. 
Starting with the conventional statistical criteria, given the inherently 
substantial unpredictable component in monthly stock returns, Campbell 
and Thompson (2008) and Neely et al. (2014) concluded that a monthly 
adjusted-R2 near 0.5% might represent economically signifcant degree of 
equity risk premium predictability. Together with using, e.g., the t-value close 
to 2 in absolute value as an indicative threshold for statistically signifcant 
predictors (at the 5% signifcance level), Table 4.3 shows as a whole that 
technical indicators seem to express useful but small statistical predictive 
power. There are also a few macroeconomic predictors, mainly TBL, LTR 
and RVOL, with statistically signifcant predictive content. When moving 
to realized utilities (util% and CER gain), we can see that utility boosting 
yields consistently higher average utility levels over the two-step ‘linear’ 
approach and benchmarks (Const and HA).14 It is not surprising that in the full 
sample results our fexible non-parametric approach can fnd higher economic 
predictive power in the best (mostly real-valued) macroeconomic variables 
with potentially larger amount of information than the binary-valued technical 
13 Throughout this study, we systematically report the HAC t-statistics by means of the 
Newey-West (1987) estimator with lags determined by the rule (integer) floor(4 × (T/100)(2/9). 
In other words, we do not take any specifc standpoint on the statistical signifcance of the 
estimated βi coeffcients in (4.9), or the appropriateness of the HAC standard errors for all 
the predictors, due to the potential Stambaugh bias as discussed in the point (iii) in Section 
4.2.1. 
14 There is one exception, MA(2,12), with a very marginal difference. As pointed out 
above, this is possible due to the elements controlling overftting of the boosting algorithm. 
Similarly, for DE and DFY the utility boosting does not fnd substantial additional value in 
this setting (γ = 5 and w max = 1). These cases importantly show that the fexible utility 
boosting do not automatically outperform the linear two-step approach in sample and hence 
some overftting concerns can already be eased. 
107 
Lauri Nevasalmi 
indicators. This fact already shows, together with the analysis of Neely et al. 
(2014), that it is reasonable to treat these two categories of predictors partly 
separately at least in this section. 
The best macroeconomic predictors in terms of average utility for the in-
maxvestor (when γ = 5 and w = 1) are DY, LTR and RVOL. In contrast to other 
best predictions in terms of statistical criteria, DFR (default return spread) and 
infation (INFL) have in relative sense much more useful information content 
for the direct portfolio weight determination than obtained with the statistical 
predictive regressions and statistical goodness-of-ft measures. Moving to the 
results of the technical indicators, the shortest MA rules (MA(1,9)–MA(3,9)), 
together with VOL(3,12), have the highest predictive power. These are largely 
the best ones also in terms of statistical criteria, even though some minor 
differences occur. 
In Table 4.4, we provide the frst of many alternative specifcations to the 
one considered in Table 4.3. The risk aversion parameter is now lower (γ = 3) 
but still a common selection in the past related studies (see Section 4.3.2), 
implying less risk aversive investor profle. The main results are, however, 
largely the same as in Table 4.3 where γ = 5. The biggest differences are the 
rise of TMS and TBL in terms of received utilities whereas specifcally LTR 
performs well in predictive regressions but does not stand out as a strong 
predictor in the utility boosting relative to other best predictors. As in Table 
4.3, DFR and INFL are again examples of the best performing state variables 
in which the utility boosting fnds much more useful predictive content than 
in the past two-step statistical approach. As the level of risk aversion, and 
hence the impact of conditional variance in (4.17), decreases due to the lower 
value of γ, the realized utility levels are throughout somewhat higher than in 
Table 4.3. 
One of the main empirical results of this study is already evident in Ta-
bles 4.3–4.4: In terms of the Sharpe ratio (4.21), the utility boosting method 
strongly outperforms the conventional two-step statistical approach. That is 
the received risk-adjusted portfolio returns are substantially and throughout 
higher in the boosting method. The exact additional value varies between the 
predictors, but the Sharpe ratios of the boosted weights and resulting portfolio 
returns are mostly about 10–40% higher than in the two-step approach, but still 
within the realistic values in sample. This superior risk-return compromise 
108 
Moving forward from predictive regressions: Boosting asset allocation decisions 
is coming largely from the smaller portfolio return variance: The mitigated 
volatility in the utility boosting applies also to the average utility and portfolio 
weights over the sample period (t = 1, . . . , T ).15 
Another interesting predictor-specifc empirical fnding, together with 
the rise of infation (INFL) and default return spread (DFR), is related to the 
dividend-price ratio (DP and the dividend yield (DY)). This is the variable 
which has especially been examined a lot in the past return predictability re-
search due to its tight linkage to asset pricing theory and the relationship with 
the present value model. It is also probably the most commonly considered 
predictive variable in connection to the potential Stambaugh bias (point (iii) in 
Section 4.2.1), with ambiguous empirical conclusions. Goyal and Welch (2003, 
2008) specifcally argue that despite of their wide attempts, they could not fnd 
robust statistical predictive ability in the dividend-price ratio (dividend yield). 
The utility boosting approach clearly supports the usefulness of DY and DP 
much more when the objective is in the asset allocation decisions instead of 
the statistical performance. 
As an illustrative example, in Figure 4.1 we depict the estimated portfolio 
weights wbt in the utility boosting and the linear predictive regression-based 
two-step approaches using the dividend-price ratio (DP) as a predictor. There 
are several periods with clearly different weights between the methods. The 
deviation in the estimated weights is especially large during the 1960s and 
1970s and from the year 2000 onwards. The impact of weight truncation to 
fulfl bounds (4.8) is also evident in the two-step approach, while the utility 
boosting keeps the weights automatically inside the selected interval. Weight 
truncation in the predictive regression-based approach is especially needed 
maxfor the upper limit (w = 1) and can be seen as multiple fat sections in 
Figure 4.1. 
Similarly as in various return predictability studies, instead of single-
predictor analyses, next we utilize multiple predictive variables in xt−1 si-
multaneously. As in Neely et al. (2014), we frst consider the macro variables 
(MACRO) and technical indicators (TECH) separately and then fnally jointly 
(ALL, i.e. combining MACRO and TECH) to determine asset allocation de-
15 The excess portfolio returns are also higher in the utility boosting approach and the best 
performing models are basically exactly the same as obtained with presented criteria. More-
over, the returns are in line with the conventional levels (about between 5-8% in annualized 
terms). 
109 
Lauri Nevasalmi 
cisions. One of the advantages of the utility boosting method is that no 
pre-selection is required as the algorithm performs model selection internally. 
In contrast, following Neely et al. (2014), among others, in the two-step 
predictive regression approach we frst extract the principal components of 
the candidate set of predictors (MACRO, TECH and ALL). Notice that the 
exact multicollinearity between some of the macro variables (see Table 4.1) 
implies that the direct use of OLS, as in (4.5), is not even possible without 
these additional steps. 
Table 4.5 reports the results of multivariate predictor models in otherwise 
the same setting as above in predictor-specifc analyses in Tables 4.3–4.4. In the 
Panels A and B, the risk aversion coeffcients are γ = 5 and γ = 3, respectively. 
In the linear predictive regressions, we utilize the same principal component 
approach as in Neely et al. (2014).16 This comparison strengthens the superior 
performance of the utility boosting: All the combinations outperform the 
benchmark cases reported in Tables 4.3–4.4 as well as the two-step ‘linear’ 
alternatives. 
Neely et al. (2014) conclude that the macroeconomic variables and tech-
nical indicators provide almost completely complementary predictor sets to 
predict equity risk premium. When moving to direct asset allocation deci-
sions, we can confrm that this seems to be the case when evaluating linear 
predictive regressions with economic goodness-of-ft measures while in the 
utility boosting macroeconomic variables dominate technical indicators in 
sample. That is, in Table 4.5, the case ALL is driven by fuctuations in the 
macroeconomic variables with only minor role for technical indicators. This is 
not surprising at all: As discussed above, this is largely due to the nature of 
macroeconomic variables almost necessarily containing more information in 
sample than binary-valued technical indicators. Tree-based models are often 
reported to favor continuous-type predictors (see e.g., Loh and Shih, 1997). It 
is hence also important to consider out-of-sample forecasting results before 
further conclusions. 
In accordance with the above views, Table 4.6 presents the top-10 pre-
dictors chosen by the internal model selection capability of the customized 
16 This means that the maximum amount of principal components is set to 3 for predictor 
group MACRO, 1 for TECH and 4 for ALL, and the fnal selection is made with the adjusted-
R2 (adj-R2). 
110 
Moving forward from predictive regressions: Boosting asset allocation decisions 
gradient boosting. The relative infuence criterion gives the normalized em-
pirical improvement as a result of including a particular predictor in the fnal 
model. The most contributing macroeconomic and technical analysis pre-
dictors are in line with the univariate results in Tables 4.3–4.4. As in Table 
4.5, it is noteworthy how the fnal model is essentially based on different 
macroeconomic predictors. 
4.3.4 In-sample extensions 
As described in Section 4.2, our utility boosting method builds upon empirical 
utility maximization with the explicit linkage to the investor’s preferences 
implemented via the sample objective function (4.10) and bounds (4.8). This 
leads to an important general point already validated in Section 4.3.3: Utility 
boosting is more relevant over the past two-step approach when the fnal 
interest is in asset allocation decisions. Empirically, this all of course indicate 
that the exact numbers, as already seen in Section 4.3.3 over the selections γ = 5 
and γ = 3, might naturally be somewhat dependent on the risk aversion level, 
the maximum weight bound wmax and the volatility proxy σt 2 . Therefore, in 
this section we still briefy present various additional and robustness analyses 
with detailed results compiled to the Appendix A. 
In Section 4.3.3, we considered the setting where taking a leveraged posi-
maxtion w > 1 was not possible. Therefore, it is meaningful to also consider 
maxempirical results when the maximum weight is w = 1.5. This has also been 
a rather common selection in the past predictive regression studies (see (ii) in 
Section 4.2.1). It turns out that the main fndings on the best single predictors 
and differences between the utility boosting and the two-step approaches are 
essentially intact (see the Appendix A.1). Mainly DY, LTR and TMS seem 
to perform in relative sense even somewhat better when an investor has an 
maxaccess to 50% leverage (vs. the case w = 1). 
One important extension to the analysis presented in Section 4.3.3 is to im-
pose transaction costs as a part of the analysis. It is, however, largely an open 
issue how much emphasis we should put on this view. In addition to the fact 
that transaction costs are seemingly becoming smaller and smaller all the time, 
from the methodological point of view imposing transaction costs means that 
otherwise optimal portfolio weight determination might be severely disrupted 
111 
Lauri Nevasalmi 
by effectively unnecessary (continuous) portfolio rebalancing activity. As an 
example, is it optimal to rebalance the portfolio at all when moving from, say, 
wt = 0.75 to wt+1 = 0.7, given the loss of portfolio value due to transaction 
costs? Following this lead, in the optimal case transaction costs should be part 
of the econometric procedure and utility maximization. This is not, however, 
clear cut to implement and requires additional complicated steps extending 
the utility boosting introduced in Sections 4.2.2–4.2.3. Therefore, in this study 
we content ourselves to the same view as in past return prediction studies: We 
evaluate the received portfolio returns when transaction costs are imposed on 
the evaluation stage, after constructing frst the same ‘optimal’ asset allocation 
decisions as above. 
Following Marquering and Verbeek (2004), in the Appendix A.2 we present 
the corresponding results as in Tables 4.3–4.4 but now also incorporating 
low and high transaction costs scenarios given as percentage points (low 
(0.1%) and high (0.5%)) of the value traded (see also Rossi, 2018, and the 
references therein). As in the various previous studies, utility gains naturally 
decrease due to transaction costs, but especially in the low transaction costs 
scenario the empirical results are still favourable for the utility boosting over 
the benchmarks. All in all, the results in Tables 4.3–4.4 can be interpreted as 
upper bound estimates for the received economic gains. 
To get additional robustness for our fndings, we also consider alternative 
volatility proxies to the well-established 5-year rolling window estimate. In 
the Appendix A.4., we consider the natural alternatives where the conditional 
mean of excess stock returns is constant and the GARCH(1,1) model equa-
tion is assumed for the conditional variance providing the volatility proxy σt 2 . 
Another check is performed with the realized volatility series RVOL as de-
scribed in Table 4.1 and Mele (2007). Again the exact numbers in the economic 
goodness-of-ft measures naturally differ and certain variables (mainly the 
term spread (TMS) and earnings-price ratio (EP)) perform somewhat better 
than in our main analysis, but the main conclusions are still intact. 
Finally, the results in Tables 4.3–4.4, and more generally the coming Section 
4.3.5, show that less persistent macroeconomic predictors perform relatively 
well in the utility boosting method. This suggests still to consider whether 
taking the frst differences of the (highly) persistent predictors (i.e. all the vari-
ables except DFR, LTR and INFL) changes the big picture. It turns out that (see 
112 
Moving forward from predictive regressions: Boosting asset allocation decisions 
the Appendix A.5), when concentrating on out-of-sample forecasting results 
as in the next section, especially the lagged changes in the dividend price ratio 
contain even higher predictive power than the level of the DP, emphasizing 
the conclusion that this theoretically well-motivated state variable contains 
useful information in direct portfolio weight determination. 
4.3.5 Out-of-sample forecasting results 
In this section, we report out-of-sample asset allocation results based on port-
folio weights obtained with the (single-step) utility boosting and (two-step) 
linear predictive regression-based methods using the same macroeconomic 
and technical indicator predictors as in Sections 4.3.3–4.3.4. The portfolio 
weight for the month t is thus constructed using the information contained 
in xt−1 and the volatility proxy σt 2 . This applies to both methods where, after 
training the respective weighting algorithms, the portfolio weight forecasts 
are based on the information available at time t − 1 only. 
Due to additional fexibility allowed by our utility boosting method over 
the simple linear predictive regressions, we believe that the initial and expand-
ing estimation sample in out-of-sample forecasting should be slightly longer 
than in various past studies. Therefore, we use January 1951 to December 1989 
as the frst initial estimation training sample window to generate portfolio 
weights for January 1990. When moving the forecast origin ahead the initial 
estimation sample is expanding by one new observation when constructing 
portfolio weights one month ahead. The forecasting evaluation period hence 
contains the predicted weights from January 1990 to December 2018 (348 
observations). 
We analyze portfolio (asset allocation) performance in terms of received 
average utility, CER gain and the Sharpe ratio (see equations (4.17), (4.19) and 
(4.21)), all now defned for out-of-sample forecasting evaluation. These are 
the economic evaluation criteria of interest, enabling a comparison between 
different approaches to determine portfolio weights. In line with the past out-
of-sample return predictability studies, we do not claim that a representative 
investor would have ended up exactly to the best performing model at each 
step. Instead and completely in line with the past studies, we are interested in 
general fndings on the usefulness of our new method over the past ones in a 
113 
Lauri Nevasalmi 
relatively large set of predictors. 
The main interest and views to be considered in this section are at least the 
following ones, together with a comparison to the full sample results: 
(a) To which extent our utility boosting method can outperform the bench-
marks (’Const’ and ’HA’), essentially claiming that there is no useful predictive 
information in the state variable (variables) xt−1? Specifcally the historical 
average (’HA’), i.e. the constant expected equity premium forecast (see (4.18)), 
has been a popular and very stringent out-of-sample benchmark (see, e.g., 
Welch and Goyal, 2003, 2008; Campbell and Thompson, 2008; Neely et al., 
2014). Linear predictive regressions containing individual (macroeconomic) 
variables typically fail to outperform this historical average statistically. More-
over, there is also a quite unanimous stylized fact in empirical fnance that the 
time period since 1990 (as examined here) is the most challenging among the 
past decades in terms of out-of-sample predictability of stock returns (see, e.g., 
Campbell and Thompson, 2008; Lettau and Van Nieuwerburgh, 2008), pre-
sumably making it diffcult to any new method to fnd meaningful additional 
predictive value. 
(b) The post-truncation of the weights obtained with linear predictive 
regressions (4.4) to fulfl the bounds (4.8) has been found very useful in the 
past out-of-sample forecasting studies (see (ii) in Section 4.2.1). Therefore, this 
past ‘truncated linear’ approach is expected to be a tough and well-trained 
competitor to beat when the out-of-sample economic forecasting performance 
is of interest. 
(c) Following the main fndings of Neely et al. (2014), obtained with 
the past two-step method, technical indicators turned out to outperform 
macroeconomic predictors in terms of both statistical and economic out-of-
sample evaluation criteria. Whether this is the case in our approach is also of 
particular interest. 
Tables 4.7–4.8 report the out-of-sample forecasting results for the bench-
marks and single-predictor models. In these results, we have assumed that 
the relative risk aversion coeffcient γ is fxed to either γ = 5 (Table 4.7) or 
maxγ = 3 (Table 4.8) with the upper bound weight constraint set as w = 1. As 
expected, the utility levels and Sharpe ratios are generally somewhat lower 
than in the in-sample results and that the historical average (HA) and the 
two-step linear approaches (with weight truncation to fulfl (4.8)) are indeed, 
114 
Moving forward from predictive regressions: Boosting asset allocation decisions 
in accordance with the past studies, performing well and tough competitors 
to the utility boosting. There are also several candidate predictors that either 
the utility boosting or the two-step approach cannot fnd useful out-of-sample 
predictive power over the historical average weights (4.18). 
As a whole, the results confrm that when a certain predictor has (some) 
reasonable out-of-sample asset allocation predictive power in this sample 
period, the Sharpe ratios are almost uniformly higher in the boosting approach 
over the alternative methods. That is, basing the econometric method on 
economic criteria produces higher risk-adjusted returns than the conventional 
statistical approaches. The results are also largely intact to the change of risk 
aversion level (γ = 5 and γ = 3). 
In accordance with Neely et al. (2014), technical indicators perform as 
a group better than macroeconomic variables in out-of-sample forecasting 
purposes and the best (single) indicators are the same ones (i.e. MA(1,9), 
MA(2,12), VOL(1,12) and VOL(3,12)) as in the in-sample analysis, showing 
their robustness as predictors. In these cases, economic gains over the bench-
marks are clear. For the MA(2,12) and VOL(3,12), the linear approach performs 
equally well as the utility boosting, which again points out the successful-
ness of the past truncated two-step approach. The simple binary nature of 
technical indicators is likely an important contributing factor for seemingly 
better out-of-sample performance over macroeconomic variables.17 The latter 
ones are much more vulnerable to possible, and also arguably happened, 
time-varying instability issues in this period of interest, including, among 
others, the IT-bubble around the millennium and low interest rates at the end 
of the evaluation sample period. 
Figure 4.2 illustrates the estimated out-of-sample portfolio weights for both 
methods using the well-performing technical indicator MA(1,9) (see Table 
4.1). Like in the in-sample results (cf. Figure 4.1), there are periods where the 
weight truncation is indeed strongly taking place in the two-step statistical 
approach. It is noteworthy how these truncation periods can last even for 
several years. 
As mentioned above and in line with past return predictability fndings, 
17 Due to the binary nature of the variables, the superiority of the utility boosting is com-
ing solely from the different objective function as non-linearities behind the construction of 
F (xt−1) in the boosting estimator play no role for binary-valued predictors. 
115 
Lauri Nevasalmi 
there are only a few macroeconomic variables which seem to have reason-
able out-of-sample predictive power in this sample period. From the utility 
boosting perspective, in line with the full sample results, infation (INFL) and 
default return spread (DFR) are the best ones and perform relatively much bet-
ter than in the ‘linear’ statistical approach. On the other hand, supposedly the 
well-documented erratic behaviour of the dividend-price ratio (DP) around 
the end of the 1990s and the beginning 2000s costs us in its performance when 
compared with the simple benchmarks. However, the utility boosting clearly 
outperforms the linear approach also with these predictors, as concluded 
already in the in-sample results. Moreover and importantly, when considering 
the monthly changes of the dividend-price ratio (ΔDP), instead of the levels, 
the utility boosting is able to fnd substantial out-of-sample predictive power 
(see details in the Appendix A.5). 
Tables 4.7–4.8 present and concentrate on univariate (single-predictor) 
out-of-sample forecasting results. As mentioned, we do not claim that the 
best-performing predictors were known in advance. Therefore, multivariate 
out-of-sample analysis provides important information on the internal model 
selection capability of the utility boosting to fnd useful predictors at a time 
and without any prior knowledge. Table 4.9 reports the multivariate results 
with both cases of γ = 5 and γ = 3. The ‘linear’ multivariate benchmark is 
built upon the principal component analysis (PCA) as in Neely et al. (2014). 
The multivariate out-of-sample results are very much in line with the single-
predictor as well as the in-sample results: Each of the utility boosting-based 
predictions from different groups of predictors outperform the benchmarks. 
Models based on macroeconomic variables (MACRO) yield only slightly 
higher utilities compared to the benchmarks, whereas technical indicators 
(TECH) is the best performing group of predictors also in this multivariate 
setting. 
When comparing the multivariate utility boosting to the two-step predic-
tive regressions (combined with the PCA), we can see that the former out-
performs the latter for both MACRO and ALL (all the predictors combined). 
For the technical indicators (TECH), the performances of the two methods 
are essentially the same. This largely the same performance is also evident in 
Figure 4.3 where the estimated portfolio weights using TECH are graphically 
illustrated for both methods (given γ = 5). Despite the slightly higher utility 
116 
Moving forward from predictive regressions: Boosting asset allocation decisions 
levels in the two-step approach, the risk-adjusted portfolio returns (and the 
Sharpe ratios) are higher in the utility boosting. 
4.4 Discussion 
The current stance in empirical fnance appears that stock returns are time-
varying and at least at times somewhat predictable. Whether these predictable 
statistical patterns translate to useful asset allocation decisions is arguably of 
the main interest for investors in practice instead of commonly used statistical 
criteria. However, the past empirical fndings have been quite ambiguous in 
this respect. 
Most academic interest and professional investment advice is so far di-
rected to two-step plug-in approaches to fnd frst useful (macroeconomic) 
predictive variables such as the dividend-price ratio and information con-
tained in the term structure of interest rates, to predict excess stock returns 
with linear predictive regressions before constructing portfolio weights. In 
contrast to these premises, the utility boosting approach developed in this 
study establishes a direct relationship between the predictive variables and 
portfolio weights. This linkage is econometrically feasible when combined 
with a specifcally designed customized gradient boosting algorithm to this 
objective, resulting in superior and less noisy portfolio performance than ob-
tained with the conventional two-step statistical approach relying on attempts 
to predict weakly predictable stock returns. Albeit only experience on even 
more extensive empirical examinations will determine the extent to which util-
ity boosting, and related advanced econometric approaches, actually improve 
investment decisions, we believe it offers a valuable prospect for dealing with 
a number of complicated aspects in asset allocation decisions in an integrated 
and single-step manner. 
Brandt (1999), Aït-Sahalia and Brandt (2001), Brandt and Santa-Clara 
(2006), and Brandt et al. (2009) have already emphasized the advantages 
of focusing portfolio weights directly. Our fndings, obtained with a very 
different econometric method and emphasizing prediction (forecasting) as-
pects, point out the advantages of this type of general thinking as well. The 
customized gradient boosting algorithm learns, likewise investors in their 
practical decisions, from portfolio weighting mistakes in the training stage 
117 
Lauri Nevasalmi 
before selecting the portfolio weight for the next period. This importantly 
bypasses the intermediate construction of expected stock returns, which is 
the fundamental part of the two-step approach, involving hence greater risk 
for misspecifcation from the fnal asset allocation objective due to likely diff-
culties related to statistical determination of expected returns at most weakly 
predictable monthly excess stock returns of interest in this study. 
The fact that the promising utility boosting approach also internally re-
spects the pre-determined lower and upper bounds of the portfolio weights is 
continuum of the past out-of-sample return predictability studies pointing out 
the usefulness of theoretically motivated subsequent restrictions on the linear 
predictive regressions (see Campbell and Thompson, 2008; Rapach et al., 2010; 
Pettenuzzo et al., 2014, and the survey of Rapach and Zhou 2013). Imposing 
such post restrictions turn out to modify the equity premium and portfolio 
weight predictions substantially: See Figures 4.1–4.3 on both in- and out-of-
sample evidence in this respect. This all implies that, even successful from the 
prediction perspective, the conventional statistical interpretations of predictive 
regressions are lost, while the utility boosting produces interpretable results 
on the importance of different predictors and their combinations. It is also 
noteworthy that our out-of-sample forecasting evaluation sample, starting 
from the beginning of 1990s, is the one that has been found very challenging 
from the prediction perspective (see, e.g., Campbell and Thompson, 2008; Let-
tau and Van Nieuwerburgh, 2008). This makes it particularly notable that the 
utility boosting produces substantial additional value over the past methods 
and benchmarks. 
Our empirical results generally suggest that there is some but nowhere 
near one-to-one connection between conventional statistical criteria, such as 
t-values or adjusted-R2, and the realized utility levels and economic goodness-
of-ft criteria. This is in line with the arguments of Elliott and Timmermann 
(2016, chapter 2, and the references therein) and their call for a closer look to 
strengthen the linkage of the appropriate objective function and the employed 
econometric method. Our general fndings coincide also with the conclusions 
of Leitch and Tanner (1991), Kandel and Stambaugh (1996), Xu (2004) and 
Cenesizoglu and Timmermann (2012), among others, that some poor return 
prediction models, producing even worse statistical (out-of-sample) forecasts 
than simple benchmarks, may add economic value when used to guide portfo-
118 
Moving forward from predictive regressions: Boosting asset allocation decisions 
lio decisions. In accordance with this view, our fndings generally emphasize 
that the constant statistical mean predictability of equity premium over time 
is not necessary that some advanced econometric methods can be useful for 
investors’ decision making. Similar argumentation is made when predicting 
the direction (sign) of stock returns (see, e.g., Pesaran and Timmermann, 1995; 
Christoffersen and Diebold, 2006) or in the recent fndings of only episodic 
(‘local’ and time-varying) return predictability (‘pockets of predictability’) in 
stock returns and its relationship to asset pricing (see Farmer, Schmidt and 
Timmermann, 2018; Demetrescu et al., 2020). 
We obtain several robust empirical conclusions in different empirical set-
tings specifed by the investor risk aversion levels, portfolio weight upper 
bounds and volatility proxies. First, as in Neely et al. (2014), technical in-
dicators perform the best in out-of-sample forecasting and provide more 
stable performance in different settings than macroeconomic variables. Sec-
ond, utility boosting mitigates excessive volatility in portfolio weights and 
resulting portfolio performance, leading to generally superior return-risk com-
promise over the past two-step statistical approach. Third, utility boosting 
backs certain macroeconomic variables in their relative performance versus 
the evidence in the predictive regression models. In particular, theoretically 
motivated infation and the dividend-price ratio (cf. Welch and Goyal (2008) 
and the subsequent studies, surveyed by Rapach and Zhou (2013)), beneft 
substantially from the direct portfolio weight determination. Notably in the 
latter case, we fnd that specifcally the changes, rather than the level, of the 
dividend-price ratio provides important information for portfolio weights. 
As emphasized by, e.g., Ban, El Karoui and Lim (2018), and the references 
therein, several modern academic portfolio optimization models, with large 
cross-sections of assets, are intractable when applied to real data due to diffcul-
ties in estimation although they present solid theoretical properties. Ban et al. 
(2018) address this by adapting regularization and cross-validation approaches 
for portfolio optimization. We don’t concentrate on large cross-sections of 
assets in this study: We are instead integrating ongoing advancement in ma-
chine learning and fnancial economics practice to the customized gradient 
boosting approach that can be extended to handle multiple risky assets, af-
ter some modifcations and requiring a separate attempt. This advancement 
is also partly dependent on the development of boosting-based methods to 
119 
Lauri Nevasalmi 
multiple-equation cases which are still largely non-existent in the machine 
learning literature. 
Finally, to enable the utility boosting method, we need to assume some 
form for the underlying utility function. We rely on the quadratic form com-
monly used in forecast evaluation of relevant past (out-of-sample) return 
predictability studies. We are well aware of its limitations and we do not claim 
that the conditions behind the mean-variance analysis are strictly satisfed in 
practice. In the future research, not just to explore alternative utility schemes, 
another important connected point is to consider whether it is optimal to 
rebalance the portfolio at all in certain time points. This view is linked with 
the impact of transaction costs, larger cross-sections of assets and longer than 
one period investment horizon. These all extensions require additional steps 
to be taken with the utility boosting approach opened in this study. 
4.5 Conclusions 
In contrast to commonly used linear predictive regressions, we introduce a 
fexible utility-based empirical approach to directly determine asset allocation 
decisions in a simple setting between risky and risk-free assets. From our 
standpoint and diverging substantially from the usual fnancial economics 
perspective, whether stock returns are statistically predictable is not of main 
interest. Instead, we focus directly on the portfolio weights and their depen-
dence on the predictive variables by maximizing a sample analogy of the 
utility function characterizing investors’ preferences in their asset allocation 
decisions. 
Our utility boosting approach arises from the synthesis between practices 
in fnancial economics and the recent advancements in machine learning. It 
builds upon a customized gradient boosting introduced in this study, select-
ing and combining predictive variables to form optimal portfolio weights 
designed specifcally to that objective instead of using general textbook ma-
chine learning algorithms. Methodologically our approach contains built-in 
mechanisms circumventing overftting, keeping the portfolio weights inside 
pre-specifed bounds and not basing the method on the usual statistical sig-
nifcance testing framework to determine the usefulness of certain predictive 
variables in asset allocation decisions. 
120 
Moving forward from predictive regressions: Boosting asset allocation decisions 
When applied to monthly U.S. excess stock returns, the utility boosting 
method generates superior and economically meaningful utility gains over the 
typical and commonly used benchmarks. These gains apply both full sample 
and out-of-sample predictions, providing systematically higher risk-adjusted 
portfolio returns than the past two-step statistical approach based on linear 
predictive regressions. We fnd that especially various technical indicators and 
some specifc macroeconomic variables perform better in terms of portfolio 
performance than the benchmarks based on historical average stock returns 
when the objective is in the economic gains of asset allocations decisions. 
References 
Adler, T., and M. Kritzman (2007). Mean-variance versus full-scale optimisa-
tion: In and out of sample. Journal of Asset Management 7, 302–311. 
Aït-Sahalia, Y., and M.W. Brandt (2001). Variable selection for portfolio choice. 
Journal of Finance 56(4), 1297–1351. 
Audrino, F., and F. Trojani (2007). Accurate short-term yield curve forecasting 
using functional gradient descent. Journal of Financial Econometrics 5, 591– 
623 
Ban, G-Y., El Karoui, N., and A.E.B. Lim (2018). Machine learning and portfolio 
optimization. Management Science 64(3), 1136–1154. 
Barberis, N. (2000). Investing for the long run when returns are predictable. 
Journal of Finance 55, 225–264. 
Brandt, M.W. (1999). Estimating portfolio and consumption choice: A condi-
tional Euler equations approach. Journal of Finance 54(5), 1609–1645. 
Brandt, M.W. (2010). Portfolio Choice Problems. Handbook of Financial 
Econometrics, Yacine Aït-Sahalia and Lars Hansen (eds), Vol. 1, Chapter 5. 
Brandt, M.W., and P. Santa-Clara (2006). Dynamic portfolio selection by 
augmenting the asset space. Journal of Finance 61, 2187–2217. 
Brandt, M.W., Santa-Clara, P., and R. Valkanov (2009). Parametric portfolio 
policies: Exploiting characteristics in the cross-section of equity returns. 
Review of Financial Studies 22(9), 3411–3447. 
121 
Lauri Nevasalmi 
Bühlmann, P., and T. Hothorn (2007). Boosting algorithms: Regularization, 
prediction and model ftting. Statistical Science 22, 477–505. 
Campbell J.Y., and S.B. Thompson (2008). Predicting excess stock returns out 
of sample: Can anything beat the historical average? Review of Financial 
Studies 21, 1509–1531. 
Carmona, P., Climent, F., and A. Momparler (2019). Predicting failure in the 
U.S. banking sector: An extreme gradient boosting approach. International 
Review of Economics & Finance 61, 304–323. 
Cenesizoglu T., and A. Timmermann (2012). Do return prediction models add 
economic value? Journal of Banking & Finance 36, 2974–2987. 
Christoffersen, P.F., and F.X. Diebold (2006). Financial asset returns, direction-
of-change forecasting, and volatility dynamics. Management Science 52, 
1273–1287. 
Demetrescu, M., Georgiev, I., Rodrigues, P.M.M., and A.M.R. Taylor (2020). 
Testing for episodic predictability in stock returns. Journal of Econometrics, 
in press. 
Elliott, G., and A. Timmermann (2016). Economic Forecasting. Princeton Uni-
versity Press, Princeton and Oxford. 
Farmer, L., Schmidt, L., and A. Timmermann (2018). Pockets of predictability. 
CEPR Discussion Paper 12885. 
Fleming, J., Kirby, C., and B. Ostdiek (2001): The economic value of volatility 
timing. Journal of Finance 56, 329–352. 
Friedman, J.H., Hastie, T., and R. Tibshirani (2000). Additive logistic regres-
sion: a statistical view of boosting (With discussion and a rejoinder by the 
authors). Annals of Statistics 28(2), 337–407. 
Friedman, J.H. (2001). Greedy function approximation: A gradient boosting 
machine. Annals of Statistics 29(5), 1189–1232. 
Friedman, J.H. (2002). Stochastic gradient boosting. Computational Statistics & 
Data Analysis 38(4), 367–378. 
122 
Moving forward from predictive regressions: Boosting asset allocation decisions 
Goyal, A., and I. Welch (2003). Predicting the equity premium with dividend 
ratios. Management Science 49(5), 639–654. 
Guidolin, M. (2011). Markov switching in portfolio choice and asset pricing 
models: A survey. Advances in Econometrics 27B, 87–178. 
Guidolin, M., and A. Timmermann (2007). Asset allocation under multivariate 
regime switching. Journal of Economic Dynamics and Control 31, 3503–3544. 
Inoue, A., and L. Kilian (2005). In-sample or out-of-sample tests of predictabil-
ity: Which one should we use? Econometric Reviews 23, 371–402. 
Kan, R., and G. Zhou (2007). Optimal portfolio choice with parameter uncer-
tainty. Journal of Financial and Quantitative Analysis 42, 621–656. 
Kandel, S., and R.F. Stambaugh (1996). On the predictability of stock returns: 
An asset-allocation perspective. Journal of Finance 51, 385–424. 
Leitch, G., and J. Tanner (1991). Economic forecast evaluation: Profts versus 
the conventional error measures. American Economic Review 81, 580–590. 
Lettau, M., and S. Van Nieuwerburgh (2008). Reconciling the return pre-
dictability evidence. Review of Financial Studies 21(4), 1607–1652. 
Loh, W-Y., and Y-S. Shih (1997). Split selection methods for classifcation trees. 
Statistica Sinica 7(4), 815–840. 
Marquering, W., and M. Verbeek (2004). The economic value of predicting 
stock index returns and volatility. Journal of Financial and Quantitative Analy-
sis 39, 407–429. 
Mele, A. (2007). Asymmetric stock market volatility and the cyclical behavior 
of expected returns. Journal of Financial Economics 86, 446–478. 
Mittnik, S., Robinzonov, N., and M. Spindler (2015). Stock market volatility: 
Identifying major drivers and the nature of their impact. Journal of Banking 
& Finance 58, 1–14. 
Neely, C.J., Rapach, D.E., Tu, J., and G. Zhou (2014). Forecasting the equity 
risk premium: The role of technical indicators. Management Science 60, 
1772–1791. 
123 
Lauri Nevasalmi 
Pesaran, H., and A. Timmermann (1995). Predictability of stock returns: 
Robustness and economic signifcance. Journal of Finance 50, 1201–1228. 
Pettenuzzo D., Timmermann, A., and R. Valkanov (2014). Forecasting stock 
returns under economic constraints. Journal of Financial Economics 114, 
517–553. 
Rapach D., and G. Zhou (2013). Forecasting stock returns. In Handbook 
of Economic Forecasting, G. Elliott and A. Timmermann (eds), vol. 2A, 
North-Holland. 
Rapach, D.E., Strauss, J.K., and G. Zhou (2010). Out-of-sample equity premium 
prediction: Combination forecasts and links to the real economy. Review of 
Financial Studies 23, 821–862. 
Rossi, A. (2018). Predicting stock market returns with machine learning. 
Manuscript, University of Maryland (August 2018). 
Rossi, A.G., and A. Timmermann (2015). Modeling covariance risk in Merton’s 
ICAPM. Review of Financial Studies 28, 1429–1461. 
Sentana, E. (2005). Least squares predictions and mean-variance analysis. 
Journal of Financial Econometrics 5, 56–78. 
Stambaugh, R.F. (1999). Predictive regressions. Journal of Financial Economics 
54, 375–421. 
Welch, I., and A. Goyal (2008). A comprehensive look at the empirical per-
formance of equity premium prediction. Review of Financial Studies 21, 
1455–1508. 
Xu, Y. (2004). Small levels of predictability and large economic gains. Journal 
of Empirical Finance 11, 247–275. 
Zhu, X. (2015). Tug-of-war: Time-varying predictability of stock returns and 
dividend growth. Review of Finance 19, 2317–2358. 
124 
Moving forward from predictive regressions: Boosting asset allocation decisions 
Tables and Figures 
Table 4.1: Predictive variables. 
Panel A: Macroeconomic variables 
DP Log dividend-price ratio log (D/P ) 
DY Log dividend yield log (D/Y ), where Y is the lagged P 
EP Log earnings-price ratio log (E/P ) 
DE Log dividend-payout ratio log (D/E) 
RVOL Equity risk premium volatility, 12-month moving standard deviation (Mele, 2007) 
BM Book-to-market value ratio for the DJIA (Dow Jones Industrial Average) 
NTIS Net equity expansion 
TBL Treasury bill rate (three-month Treasury bill, secondary market) 
LTY Long-term government bond yield 
LTR Return on long-term government bonds 
TMS Term spread: LTY-TBL 
DFY Default yield spread 
DFR Default return spread 
INFL Infation (CPI infation), lagged by one period due to the delay in CPI releases. 
Panel B: Technical indicators 
MA(1,9) Moving average indicator (4.14) with s = 1 and l = 9 
MA(1,12) Moving average indicator (4.14) with s = 1 and l = 12 
MA(2,9) Moving average indicator (4.14) with s = 2 and l = 9 
MA(2,12) Moving average indicator (4.14) with s = 2 and l = 12 
MA(3,9) Moving average indicator (4.14) with s = 3 and l = 9 
MA(3,12) Moving average indicator (4.14) with s = 3 and l = 12 
MOM(9) Momentum indicator (4.15) with m = 9 
MOM(12) Momentum indicator (4.15) with m = 12 
VOL(1,9) Volume indicator (4.16) with s = 1 and l = 9 
VOL(1,12) Volume indicator (4.16) with s = 1 and l = 12 
VOL(2,9) Volume indicator (4.16) with s = 2 and l = 9 
VOL(2,12) Volume indicator (4.16) with s = 2 and l = 12 
VOL(3,9) Volume indicator (4.16) with s = 3 and l = 9 
VOL(3,12). Volume indicator (4.16) with s = 3 and l = 12 
Notes: D and E refer to log of a 12-month moving sum of dividends paid (D) and earnings (E) on the 
S&P 500 index (P ). Net equity expansion (NTIS) is a ratio of a 12-month moving sum of net equity 
issues by NYSE-listed stocks to the total end-of-year market capitalization of NYSE stocks. Default yield 
spread (DFY) is the difference between BAA and AAA-rated corporate bond yields, whereas the default 
spread (DFR) is defned as the difference between corporate bond return minus LTR. Following Welch 
and Goyal (2008), and subsequent follow-up studies, infation is the Consumer Price Index (All Urban 
Consumers) and we consider its frst lag as infation information is released only in the following month. 
Binary-valued technical indicators are defned in (4.14)–(4.16) with different price window lengths. 
125 
Lauri Nevasalmi 
Table 4.2: Descriptive statistics of excess stock returns (re,t) and macroeco-
nomic predictive variables. 
Variable Mean Std Min Max bρ1 bρ2 bρ3 
Panel A: Excess stock returns 
re,t 0.60 4.14 -22.11 16.14 0.04 -0.03 0.04 
Panel B: Macroeconomic variables 
DP -3.53 0.42 -4.52 -2.60 0.99 0.98 0.97 
DY -3.53 0.42 -4.53 -2.59 0.99 0.98 0.97 
EP -2.80 0.42 -4.84 -1.90 0.99 0.97 0.94 
DE -0.73 0.29 -1.24 1.38 0.99 0.95 0.90 
RVOL 0.002 0.001 0.0002 0.008 0.96 0.91 0.86 
BM 0.51 0.25 0.12 1.21 0.99 0.99 0.98 
NTIS 0.01 0.02 -0.06 0.05 0.98 0.95 0.93 
TBL 4.25 3.08 0.01 16.30 0.99 0.98 0.96 
LTY 5.95 2.76 1.75 14.82 0.99 0.99 0.98 
LTR 0.52 2.75 -11.24 15.23 0.04 -0.07 -0.02 
TMS 1.70 1.38 -3.65 4.55 0.96 0.91 0.86 
DFY 0.96 0.44 0.32 3.38 0.97 0.92 0.88 
DFR 0.02 1.40 -9.75 7.37 -0.08 -0.09 -0.02 
INFL 0.29 0.36 -1.92 1.81 0.55 0.39 0.27 
Notes: This table presents descriptive statistics for the excess stock returns (Panel A, 
i.e. simple equity risk premium) in percent (i.e. returns multiplied by 100) and 14 
macroeconomic variables (Panel B) where LTR, DFR, and INFL (TBL, LTY, TMS, 
and DFY) are measured in percent (annual percent). The sample period is 
December 1950–December 2018. The frst three sample autocorrelation coeffcients 
of each variable are denoted by ρb1, ρb2 and ρb3. 
126 
Moving forward from predictive regressions: Boosting asset allocation decisions 
maxTable 4.3: In-sample results for different predictors with γ = 5 and w = 1. 
Utility boosting Linear, constrained weights 
Variable Util(%) CER(%) Sharpe Util(%) Sharpe t-val adj-R2 
Panel A: Macroeconomic variables 
DP 0.684 1.706 0.185 0.581 0.137 1.76 0.28 
DY 0.731 2.296 0.201 0.581 0.138 1.87 0.33 
EP 0.656 1.573 0.181 0.569 0.135 0.77 0.04 
DE 0.593 0.727 0.159 0.587 0.139 0.69 -0.02 
RVOL 0.730 2.355 0.206 0.601 0.137 2.80 0.63 
BM 0.718 2.101 0.202 0.568 0.132 0.58 -0.07 
NTIS 0.661 1.546 0.180 0.583 0.138 0.27 -0.10 
TBL 0.668 1.617 0.185 0.634 0.164 2.26 0.55 
LTY 0.634 1.268 0.181 0.606 0.150 1.49 0.18 
LTR 0.727 2.249 0.197 0.659 0.164 2.68 0.69 
TMS 0.695 1.914 0.192 0.657 0.169 1.90 0.41 
DFY 0.603 0.816 0.163 0.587 0.138 0.38 -0.08 
DFR 0.715 2.455 0.208 0.602 0.152 0.96 0.08 
INFL 0.687 1.945 0.192 0.578 0.139 0.35 -0.09 
Panel B: Technical indicators 
MA(1,9) 0.634 1.557 0.181 0.613 0.166 1.63 0.29 
MA(1,12) 0.656 1.895 0.190 0.653 0.184 1.97 0.55 
MA(2,9) 0.642 1.631 0.184 0.628 0.172 1.83 0.39 
MA(2,12) 0.676 2.149 0.197 0.677 0.193 2.34 0.76 
MA(3,9) 0.651 1.745 0.185 0.642 0.176 1.78 0.41 
MA(3,12) 0.613 1.217 0.170 0.594 0.153 1.02 0.08 
MOM(9) 0.618 1.294 0.173 0.598 0.156 1.11 0.11 
MOM(12) 0.618 1.273 0.171 0.596 0.156 1.10 0.12 
VOL(1,9) 0.624 1.358 0.175 0.593 0.154 1.30 0.16 
VOL(1,12) 0.640 1.593 0.181 0.623 0.170 1.73 0.39 
VOL(2,9) 0.613 1.236 0.171 0.598 0.157 1.31 0.18 
VOL(2,12) 0.619 1.331 0.174 0.610 0.164 1.49 0.29 
VOL(3,9) 0.606 1.112 0.167 0.585 0.149 0.95 0.04 
VOL(3,12) 0.651 1.692 0.183 0.643 0.175 1.93 0.52 
Panel C: Benchmarks 
Const 0.570 0.264 0.141 
HA 0.570 0.135 
Notes: This table reports portfolio performance measures for an investor, with the utility 
function (4.2) and the relative risk-aversion coeffcient of fve (γ = 5), who allocates portfolio 
value between risky market portfolio and risk-free rate in each month. The bounds for 
min maxportfolio weights are w = 0 and w = 1. The reported predictions are from the 
customized gradient boosting (Section 4.2.3) with different single macroeconomic (Panel A) 
or technical indicators (Panel B) in xt−1 at a time and two benchmarks (‘Const’ refers to the 
case, where xt−1 = 1 and ’HA’ to the historical average weights (4.18)). On the right, we 
report the results of the linear predictive regressions (‘Linear’) with the subsequent weight 
constraints (4.8). ‘Util(%)’ refers to the realized utility (4.17), CER(%) is the CER gain (4.19). 
Finally, adj-R2 is the adjusted-R2 and ‘t-val’ is the heteroskedasticity-autocorrelation 
consistent (robust) t-statistic for the null hypothesis βi = 0 in the predictive regression (4.9). 
127 
Lauri Nevasalmi 
maxTable 4.4: In-sample results for different predictors with γ = 3 and w = 1. 
Utility boosting Linear, constrained weights 
Variable Util(%) CER(%) Sharpe Util(%) Sharpe t-val adj-R2 
Panel A: Macroeconomic variables 
DP 0.814 1.526 0.187 0.711 0.147 1.76 0.28 
DY 0.858 2.084 0.202 0.705 0.146 1.87 0.33 
EP 0.739 0.760 0.173 0.698 0.145 0.77 0.04 
DE 0.701 0.209 0.159 0.687 0.139 0.69 -0.02 
RVOL 0.829 1.801 0.201 0.698 0.139 2.80 0.63 
BM 0.823 1.628 0.183 0.702 0.143 0.58 -0.07 
NTIS 0.779 1.209 0.178 0.702 0.143 0.27 -0.10 
TBL 0.897 2.694 0.212 0.728 0.160 2.26 0.55 
LTY 0.789 1.340 0.192 0.698 0.148 1.49 0.18 
LTR 0.870 2.290 0.205 0.771 0.167 2.68 0.69 
TMS 0.921 2.958 0.223 0.777 0.171 1.90 0.41 
DFY 0.706 0.255 0.163 0.705 0.144 0.38 -0.08 
DFR 0.879 2.654 0.221 0.721 0.155 0.96 0.08 
INFL 0.881 2.577 0.210 0.693 0.142 0.35 -0.09 
Panel B: Technical indicators 
MA(1,9) 0.739 1.061 0.181 0.700 0.161 1.63 0.29 
MA(1,12) 0.766 1.405 0.190 0.751 0.185 1.97 0.55 
MA(2,9) 0.747 1.110 0.183 0.717 0.169 1.83 0.39 
MA(2,12) 0.786 1.650 0.197 0.780 0.195 2.34 0.76 
MA(3,9) 0.756 1.211 0.184 0.722 0.169 1.78 0.41 
MA(3,12) 0.716 0.680 0.169 0.685 0.147 1.02 0.08 
MOM(9) 0.721 0.795 0.172 0.683 0.148 1.11 0.11 
MOM(12) 0.723 0.778 0.171 0.687 0.149 1.10 0.12 
VOL(1,9) 0.724 0.788 0.174 0.679 0.148 1.30 0.16 
VOL(1,12) 0.746 1.106 0.181 0.719 0.169 1.73 0.39 
VOL(2,9) 0.712 0.590 0.168 0.699 0.154 1.31 0.18 
VOL(2,12) 0.729 0.892 0.175 0.717 0.165 1.49 0.29 
VOL(3,9) 0.707 0.581 0.167 0.684 0.145 0.95 0.04 
VOL(3,12) 0.759 1.230 0.184 0.745 0.178 1.93 0.52 
Panel C: Benchmarks 
Const 0.698 0.182 0.144 
HA 0.695 0.141 
Notes: See the notes to Table 4.3. 
128 
Moving forward from predictive regressions: Boosting asset allocation decisions 
Table 4.5: In-sample results with different predictor groups. 
Utility boosting Linear (PCA), 
constrained weights 
Variables Util(%) CER(%) Sharpe Util(%) Sharpe adj-R2 
maxPanel A: Selections γ = 5 and w = 1 
MACRO 0.975 5.480 0.303 0.648 0.166 0.54 
TECH 0.680 2.144 0.199 0.640 0.178 0.45 
ALL 0.978 5.541 0.306 0.671 0.184 0.92 
maxPanel B: Selections γ = 3 and w = 1 
MACRO 1.066 4.882 0.293 0.742 0.160 0.54 
TECH 0.835 2.215 0.211 0.768 0.187 0.45 
ALL 1.070 4.895 0.293 0.782 0.187 0.92 
Notes: In the utility boosting, all the macroeconomic variables (MACRO) and 
technical indicators (TECH) or both of them (ALL) are simultaneously included in 
xt−1 and the customized algorithm includes only the best of them in the fnal 
predictive models (see also Table (4.6)). In the linear predictive regressions, the 
principal components of the predictor groups are frst constructed (e.g., due to the 
perfect multicollinearity between the variables), combined with weight restrictions 
(4.8). The amount of principal components is chosen according to adjusted R2 as in 
Neely et al. (2014). 
Table 4.6: Top-10 in-sample predictors in the multivariate utility boosting 
maxmodels with γ = 5 and w = 1. 
MACRO TECH ALL 
Variable Rel.inf(%) Variable Rel.inf(%) Variable Rel.inf(%) 
LTR 0.158 MA(2,12) 0.432 LTR 0.159 
TMS 0.148 VOL(1,12) 0.120 TMS 0.134 
NTIS 0.130 VOL(3,12) 0.108 NTIS 0.111 
DFR 0.101 VOL(1,9) 0.051 DFR 0.093 
RVOL 0.090 MA(2,9) 0.050 RVOL 0.089 
EP 0.074 MA(3,12) 0.042 EP 0.068 
DY 0.060 MA(1,9) 0.041 DY 0.065 
BM 0.051 MA(3,9) 0.036 BM 0.054 
TBL 0.046 VOL(3,9) 0.035 TBL 0.048 
DE 0.034 VOL(2,12) 0.026 DE 0.031 
Notes: For more information on the relative infuence measure, see Friedman 
(2001). 
129 
Lauri Nevasalmi 
maxTable 4.7: Out of-sample predictive results with γ = 5 and w = 1. 
Utility boosting Linear, 
constrained weights 
Variable Util(%) CER(%) Sharpe Util(%) CER(%) Sharpe 
Panel A: Macroeconomic variables 
DP 0.427 0.058 0.133 0.245 -2.325 0.032 
DY 0.420 -0.017 0.131 0.259 -2.032 0.041 
EP 0.412 0.358 0.150 0.499 0.948 0.167 
DE 0.371 -0.236 0.124 0.299 -1.600 0.085 
RVOL 0.444 0.302 0.141 0.382 -1.123 0.104 
BM 0.376 -0.644 0.110 0.412 -0.562 0.115 
NTIS 0.451 0.656 0.152 0.372 -0.751 0.114 
TBL 0.423 0.082 0.135 0.472 0.746 0.155 
LTY 0.433 0.507 0.148 0.463 0.298 0.142 
LTR 0.421 0.019 0.133 0.413 -0.453 0.122 
TMS 0.420 0.193 0.138 0.430 0.268 0.144 
DFY 0.405 -0.191 0.125 0.417 -0.399 0.122 
DFR 0.472 1.095 0.172 0.462 0.553 0.148 
INFL 0.450 0.445 0.145 0.391 -0.647 0.117 
Panel B: Technical indicators 
MA(1,9) 0.511 1.805 0.193 0.474 1.056 0.164 
MA(1,12) 0.496 1.769 0.187 0.528 2.065 0.197 
MA(2,9) 0.494 1.745 0.189 0.478 1.349 0.174 
MA(2,12) 0.542 2.437 0.209 0.565 2.580 0.214 
MA(3,9) 0.470 1.319 0.171 0.446 0.869 0.158 
MA(3,12) 0.470 1.184 0.172 0.453 0.765 0.155 
MOM(9) 0.484 1.402 0.178 0.487 1.211 0.169 
MOM(12) 0.485 1.392 0.178 0.486 1.130 0.166 
VOL(1,9) 0.492 1.513 0.181 0.442 0.654 0.152 
VOL(1,12) 0.527 2.004 0.195 0.484 1.372 0.175 
VOL(2,9) 0.461 1.137 0.169 0.446 0.759 0.155 
VOL(2,12) 0.495 1.507 0.181 0.504 1.417 0.175 
VOL(3,9) 0.474 1.177 0.171 0.448 0.500 0.147 
VOL(3,12) 0.535 2.056 0.198 0.547 2.043 0.195 
Panel C: Benchmarks 
Const 0.435 0.222 0.139 
HA 0.445 0.133 
Notes: See the notes to Table 4.3. 
130 
Moving forward from predictive regressions: Boosting asset allocation decisions 
maxTable 4.8: Out of-sample predictive results with γ = 3 and w = 1. 
Utility boosting Linear, 
constrained weights 
Variable Util(%) CER(%) Sharpe Util(%) CER(%) Sharpe 
Panel A: Macroeconomic variables 
DP 0.525 -0.408 0.137 0.296 -3.324 0.051 
DY 0.478 -0.973 0.122 0.289 -3.389 0.048 
EP 0.526 -0.118 0.155 0.589 0.514 0.165 
DE 0.426 -1.394 0.112 0.438 -1.423 0.109 
RVOL 0.509 -0.641 0.129 0.523 -0.595 0.127 
BM 0.507 -0.681 0.134 0.508 -0.704 0.127 
NTIS 0.545 0.073 0.150 0.511 -0.428 0.131 
TBL 0.561 0.222 0.147 0.590 0.689 0.153 
LTY 0.519 -0.326 0.140 0.581 0.484 0.150 
LTR 0.533 -0.157 0.140 0.505 -0.603 0.128 
TMS 0.528 -0.108 0.141 0.608 0.869 0.158 
DFY 0.456 -1.275 0.114 0.541 -0.122 0.138 
DFR 0.594 0.847 0.179 0.562 0.441 0.154 
INFL 0.597 0.640 0.163 0.527 -0.253 0.135 
Panel B: Technical indicators 
MA(1,9) 0.609 1.190 0.186 0.557 0.548 0.159 
MA(1,12) 0.609 1.352 0.192 0.621 1.649 0.196 
MA(2,9) 0.595 1.132 0.184 0.551 0.625 0.163 
MA(2,12) 0.654 1.959 0.212 0.660 2.151 0.212 
MA(3,9) 0.575 0.809 0.174 0.508 0.100 0.149 
MA(3,12) 0.555 0.439 0.163 0.522 0.094 0.147 
MOM(9) 0.596 1.023 0.180 0.544 0.425 0.157 
MOM(12) 0.601 1.039 0.178 0.559 0.516 0.158 
VOL(1,9) 0.583 0.887 0.179 0.523 0.130 0.149 
VOL(1,12) 0.616 1.349 0.191 0.566 0.839 0.171 
VOL(2,9) 0.565 0.653 0.171 0.540 0.374 0.156 
VOL(2,12) 0.603 1.093 0.179 0.594 1.011 0.171 
VOL(3,9) 0.543 0.255 0.156 0.528 0.020 0.144 
VOL(3,12) 0.650 1.663 0.200 0.638 1.650 0.192 
Panel C: Benchmarks 
Const 0.563 0.320 0.146 
HA 0.551 0.141 
Notes: See the notes to Table 4.3. 
131 
-------, .---\ ,,----~ 
1 1 II I 
., i: 
,.-11 
,' ,, 
l'li 
I 
, ... 
Lauri Nevasalmi 
Table 4.9: Out of-sample predictive results for predictor groups with γ = 5 
maxor γ = 3 and w = 1. 
Utility boosting Linear (PCA), 
constrained weights 
Variables Util(%) CER(%) Sharpe Util(%) CER(%) Sharpe 
maxPanel A: Selections γ = 5 and w = 1 
MACRO 0.480 1.075 0.164 0.269 -1.658 0.075 
TECH 0.513 1.840 0.191 0.522 1.885 0.191 
ALL 0.487 1.273 0.171 0.348 -0.136 0.132 
maxPanel B: Selections γ = 3 and w = 1 
MACRO 0.566 0.322 0.159 0.377 -1.932 0.097 
TECH 0.628 1.561 0.196 0.631 1.717 0.195 
ALL 0.571 0.441 0.164 0.441 -0.729 0.148 
Notes: See the notes to Table 4.5. 
1960 1980 2000 2020
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
Time
w
(D
P t−
1)
boosting
linear
Figure 4.1: In-sample portfolio weights using the predictor DP (dividend-
price ratio). 
132 
I 
I 1 1 
'•• · 1,11 
,, ,, 
.,,. 
'' " 
I I 
1 1 
I 
- -- .. - , 
I I 11 1 1 11 
I I 111 111 
I I 111 11 1 
I I 1 1' 111 
I I 1 1t 
I I 11 I 
I I O 
I 11 
I 
I I 
I 
' ~ 
h :: 
I 11 
,,., 
• I 
• I 
\\1 
11 
11 
11 
11 
11 
II 
II 
II 
i,1 
I 
11 
11 
,. 
I I 
1 1 
11 
11 
II 
~ I 
'' 11 11 
j11 -
I~ - • 
Moving forward from predictive regressions: Boosting asset allocation decisions 
1990 1995 2000 2005 2010 2015 2020
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
Time
w
(M
A.
1.9
. t−1
)
boosting
linear
Figure 4.2: Out of-sample portfolio weights using the predictor MA(1,9). 
1990 1995 2000 2005 2010 2015 2020
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
Time
w
(T
EC
H t−
1)
boosting
linear
Figure 4.3: Out of-sample portfolio weights using the predictor group 
TECH. 
133 
Lauri Nevasalmi 
Appendix A: Additional empirical results 
This Appendix presents several additional analyses and robustness checks 
for the main empirical results reported in Section 4.3 of the main paper. If 
not otherwise mentioned, we consider the same monthly sample period and 
constructed volatility proxies as in Section 4.3. In Section A.1, we explore the 
impact of allowing for leveraged positions while Section A.2 concentrates on 
the results when low and high transaction costs scenarios are also imposed on 
the evaluation stage. Section A.3 contains alternative setting of the customized 
gradient boosting algorithm where the regression tree-based base learner 
function is replaced by a spline-based learner. Section A.4 examines the 
robustness of our fndings on the selected volatility proxy by utilizing the 
GARCH model and a shorter-horizon realized variance as volatility proxies. 
Finally, Section A.5 considers the predictive power of monthly changes in the 
highly persistent macroeconomic predictor variables, such as the dividend-
price ratio, instead of their levels. 
A.1 Leveraged weights 
In Tables A.1.1 and A.1.2, we report the results when the upper bound for the 
maxportfolio weights is set to w = 1.5. This allows an investor to take leverage 
in his or her positions. When comparing to the results of Tables 4.3-4.4 and 
4.7-4.8 in the main analysis, the essential conclusions are intact (with some 
minor predictor-specifc differences). 
Both in- and out-of-sample results refect the fact that the utility boosting 
approach takes into account the risk awareness in estimation. Naturally the 
utility boosting benefts less than the two-step ‘linear’ approach from the 
ability to take leverage as risky leveraged positions are less frequently taken 
maxwhen γ = 5 (in the latter the upper bound w = 1.5, and even levels above 
that, are often reached without restrictions on the maximum portfolio weight). 
However, the Sharpe ratios are consistently higher in the utility boosting. With 
the higher tolerance to risk (γ = 3), riskier positions are allowed and hence 
resulting utilities are then naturally also somewhat higher. 
134 
Moving forward from predictive regressions: Boosting asset allocation decisions 
Table A.1.1: In-sample results of predictor-specifc models with γ = 5 and 
maxγ = 3 when w = 1.5. 
γ = 5 γ = 3 
Utility boosting Linear, Utility boosting Linear, 
constr. weights constr. weights 
Variable Util(%) Sharpe Util(%) Sharpe Util(%) Sharpe Util(%) Sharpe 
Panel A: Macroeconomic variables 
DP 0.677 0.177 0.587 0.136 0.929 0.191 0.738 0.141 
DY 0.831 0.223 0.595 0.141 0.999 0.208 0.734 0.140 
EP 0.716 0.193 0.593 0.138 0.848 0.181 0.717 0.138 
DE 0.636 0.170 0.573 0.127 0.806 0.172 0.740 0.141 
RVOL 0.695 0.185 0.616 0.136 1.114 0.243 0.758 0.138 
BM 0.719 0.186 0.566 0.128 0.890 0.183 0.712 0.135 
NTIS 0.668 0.178 0.576 0.129 0.869 0.182 0.739 0.141 
TBL 0.782 0.212 0.642 0.155 0.877 0.187 0.808 0.165 
LTY 0.694 0.193 0.600 0.140 0.866 0.195 0.766 0.151 
LTR 0.803 0.215 0.688 0.161 1.059 0.220 0.841 0.164 
TMS 0.768 0.206 0.642 0.155 1.139 0.245 0.846 0.170 
DFY 0.650 0.170 0.565 0.124 0.823 0.175 0.745 0.141 
DFR 0.652 0.180 0.593 0.144 1.069 0.237 0.766 0.155 
INFL 0.690 0.187 0.571 0.128 0.931 0.197 0.733 0.142 
Panel B: Technical indicators 
MA(1,9) 0.643 0.182 0.633 0.164 0.810 0.181 0.770 0.164 
MA(1,12) 0.667 0.190 0.672 0.179 0.845 0.190 0.837 0.184 
MA(2,9) 0.651 0.183 0.649 0.171 0.824 0.184 0.797 0.172 
MA(2,12) 0.691 0.198 0.703 0.190 0.876 0.197 0.874 0.194 
MA(3,9) 0.662 0.185 0.664 0.174 0.836 0.185 0.816 0.175 
MA(3,12) 0.616 0.171 0.605 0.150 0.777 0.170 0.741 0.152 
MOM(9) 0.622 0.173 0.612 0.153 0.784 0.172 0.745 0.153 
MOM(12) 0.622 0.172 0.612 0.153 0.786 0.171 0.745 0.154 
VOL(1,9) 0.633 0.176 0.617 0.154 0.793 0.174 0.737 0.152 
VOL(1,12) 0.648 0.181 0.643 0.167 0.820 0.181 0.788 0.169 
VOL(2,9) 0.617 0.172 0.606 0.153 0.780 0.172 0.752 0.156 
VOL(2,12) 0.622 0.174 0.611 0.156 0.792 0.176 0.773 0.164 
VOL(3,9) 0.609 0.168 0.597 0.146 0.766 0.167 0.730 0.148 
VOL(3,12) 0.659 0.182 0.654 0.169 0.838 0.183 0.824 0.176 
Panel C: Benchmarks 
Const 0.564 0.144 0.715 0.142 
HA 0.564 0.127 0.714 0.137 
Notes: See the notes to Tables 4.3 and 4.4 in the body text. 
135 
Lauri Nevasalmi 
Table A.1.2: Out of-sample results of predictor-specifc models with γ = 5 
maxand γ = 3 when w = 1.5. 
γ = 5 γ = 3 
Utility boosting Linear, Utility boosting Linear, 
constr. weights constr. weights 
Variable Util(%) Sharpe Util(%) Sharpe Util(%) Sharpe Util(%) Sharpe 
Panel A: Macroeconomic variables 
DP 0.403 0.125 0.245 0.032 0.574 0.137 0.255 0.031 
DY 0.383 0.119 0.259 0.041 0.563 0.135 0.280 0.041 
EP 0.404 0.145 0.510 0.162 0.547 0.150 0.665 0.167 
DE 0.339 0.126 0.228 0.067 0.393 0.103 0.383 0.092 
RVOL 0.414 0.135 0.381 0.097 0.578 0.139 0.502 0.110 
BM 0.421 0.124 0.398 0.102 0.487 0.116 0.536 0.118 
NTIS 0.441 0.152 0.364 0.109 0.588 0.148 0.489 0.119 
TBL 0.363 0.131 0.412 0.134 0.620 0.152 0.627 0.155 
LTY 0.436 0.156 0.441 0.126 0.520 0.135 0.615 0.144 
LTR 0.427 0.139 0.439 0.122 0.547 0.134 0.525 0.122 
TMS 0.425 0.146 0.345 0.125 0.566 0.143 0.590 0.148 
DFY 0.432 0.136 0.351 0.092 0.436 0.103 0.556 0.128 
DFR 0.471 0.167 0.460 0.136 0.659 0.179 0.618 0.152 
INFL 0.417 0.137 0.368 0.105 0.622 0.153 0.514 0.122 
Panel B: Technical indicators 
MA(1,9) 0.514 0.188 0.509 0.160 0.709 0.196 0.627 0.164 
MA(1,12) 0.531 0.197 0.560 0.187 0.680 0.193 0.706 0.197 
MA(2,9) 0.506 0.189 0.515 0.172 0.675 0.190 0.628 0.173 
MA(2,12) 0.561 0.207 0.609 0.206 0.748 0.213 0.763 0.214 
MA(3,9) 0.455 0.166 0.470 0.154 0.636 0.176 0.572 0.156 
MA(3,12) 0.466 0.169 0.480 0.149 0.645 0.176 0.586 0.154 
MOM(9) 0.482 0.175 0.516 0.161 0.685 0.186 0.635 0.167 
MOM(12) 0.476 0.172 0.515 0.159 0.689 0.185 0.638 0.165 
VOL(1,9) 0.487 0.177 0.475 0.150 0.658 0.181 0.572 0.151 
VOL(1,12) 0.509 0.183 0.520 0.170 0.700 0.193 0.634 0.173 
VOL(2,9) 0.449 0.164 0.460 0.147 0.632 0.175 0.582 0.155 
VOL(2,12) 0.486 0.176 0.523 0.163 0.689 0.185 0.670 0.175 
VOL(3,9) 0.463 0.165 0.463 0.139 0.635 0.171 0.583 0.147 
VOL(3,12) 0.543 0.194 0.566 0.179 0.736 0.199 0.742 0.196 
Panel C: Benchmarks 
Const 0.413 0.138 0.578 0.141 
HA 0.438 0.120 0.588 0.136 
Notes: See the notes to Tables 4.3 and 4.4 in the body text. 
136 
Moving forward from predictive regressions: Boosting asset allocation decisions 
A.2 Transaction costs involved 
In the following Tables A.2.1 and A.2.2, we impose transaction costs along 
the lines of Neely et al. (2014) in the evaluation stage when computing the 
resulting portfolio returns and the values of different evaluation metrics. We 
consider two transaction costs scenarios (see Marquering and Verbeek, 2004; 
Rossi, 2018): Low transaction costs case means that 0.1% of the value traded is 
lost as a result of rebalancing portfolio weights whereas high transaction costs 
corresponds the case of 0.5%. 
As discussed in Section 4.3.3, here we take the standpoint that asset allo-
cation decisions are frst determined at each step so that the potential impact 
of transaction costs is not taken into account. This corresponds the idea of 
continuous portfolio rebalancing. Integrating transaction costs to optimal 
portfolio weight determination requires several additional steps and detailed 
examination, which are hence left for the future research. First of all, to de-
termine whether it is optimal to update the portfolio weight from period t 
to t + 1 at all needs to be addressed. As discussed in the main text, say that 
the portfolio weight for the risky asset is 0.7 at time t (i.e. wt = 0.7) and then 
the optimal prediction (not acknowledging the impact of transaction costs) is 
to change it to 0.75 for the period t + 1. Is this change of 0.05 large enough 
that we should really rebalance the portfolio again or should we just continue 
with the weight 0.7? Seemingly, as also pointed out already by Brandt (1999), 
the impact of the investment horizon must also be addressed simultaneously 
when addressing this question, whereas throughout this study we concentrate 
on the one-month horizon. 
The above points show that the potential impact of the transaction costs in 
various stages of our analysis is much more fundamental to the method than 
is possible to deal with in this paper yet, given all the other advancements. 
Even though these extensions are left for the future research, below we report 
the results when taking the same standpoint as in the past two-step statistical 
approaches. That is, the impact of transaction costs is taken into account in the 
evaluation stage when evaluating the resulting portfolio returns and utility 
levels. This view is natural in the two-step statistical approach where the 
predictive regressions do not contain utility optimization, whereas transaction 
costs would be an important part of direct portfolio optimization problem. 
137 
Lauri Nevasalmi 
Table A.2.1: In-sample results with low transaction costs when γ = 5 and 
maxγ = 3 with the portfolio weight upper bound w = 1. 
γ = 5 γ = 3 
Utility boosting Linear, Utility boosting Linear, 
constr. weights constr. weights 
Variable Util(%) Sharpe Util(%) Sharpe Util(%) Sharpe Util(%) Sharpe 
Panel A: Macroeconomic variables 
DP 0.680 0.183 0.579 0.136 0.808 0.186 0.710 0.147 
DY 0.725 0.199 0.579 0.137 0.850 0.199 0.704 0.146 
EP 0.653 0.180 0.567 0.134 0.736 0.172 0.697 0.144 
DE 0.591 0.159 0.586 0.139 0.698 0.158 0.686 0.138 
RVOL 0.716 0.201 0.597 0.136 0.815 0.196 0.695 0.139 
BM 0.711 0.200 0.566 0.132 0.819 0.182 0.702 0.143 
NTIS 0.657 0.179 0.581 0.137 0.775 0.177 0.701 0.143 
TBL 0.666 0.184 0.632 0.164 0.891 0.210 0.726 0.160 
LTY 0.632 0.180 0.604 0.150 0.784 0.191 0.697 0.148 
LTR 0.702 0.188 0.632 0.155 0.845 0.198 0.748 0.161 
TMS 0.690 0.190 0.652 0.168 0.909 0.219 0.773 0.170 
DFY 0.600 0.162 0.585 0.137 0.702 0.162 0.705 0.143 
DFR 0.695 0.200 0.586 0.147 0.857 0.214 0.711 0.152 
INFL 0.672 0.186 0.574 0.138 0.863 0.204 0.690 0.141 
Panel B: Technical indicators 
MA(1,9) 0.626 0.177 0.605 0.163 0.729 0.178 0.692 0.158 
MA(1,12) 0.649 0.187 0.645 0.181 0.758 0.187 0.742 0.182 
MA(2,9) 0.635 0.181 0.620 0.169 0.739 0.180 0.710 0.166 
MA(2,12) 0.670 0.195 0.670 0.190 0.779 0.195 0.773 0.192 
MA(3,9) 0.645 0.183 0.635 0.174 0.749 0.182 0.716 0.167 
MA(3,12) 0.609 0.168 0.590 0.152 0.711 0.167 0.682 0.146 
MOM(9) 0.611 0.170 0.592 0.154 0.714 0.170 0.679 0.147 
MOM(12) 0.613 0.169 0.592 0.154 0.718 0.169 0.684 0.148 
VOL(1,9) 0.614 0.170 0.583 0.150 0.711 0.169 0.670 0.145 
VOL(1,12) 0.631 0.178 0.613 0.167 0.736 0.178 0.709 0.166 
VOL(2,9) 0.606 0.168 0.591 0.155 0.706 0.166 0.694 0.152 
VOL(2,12) 0.613 0.171 0.604 0.162 0.722 0.173 0.711 0.164 
VOL(3,9) 0.600 0.165 0.580 0.147 0.701 0.164 0.681 0.144 
VOL(3,12) 0.645 0.181 0.636 0.173 0.753 0.182 0.739 0.176 
Panel C: Benchmarks 
Const 0.570 0.141 0.698 0.144 
HA 0.569 0.134 0.695 0.141 
Notes: In these results, the low transaction cost scenario (0.10% of the traded portfolio value) 
is taken into account when computing the values of economic goodness-of-ft measures. 
138 
Moving forward from predictive regressions: Boosting asset allocation decisions 
Table A.2.2: In-sample results with high transaction costs when γ = 5 and 
maxγ = 3 with the portfolio weight upper bound w = 1. 
γ = 5 γ = 3 
Utility boosting Linear, Utility boosting Linear, 
constr. weights constr. weights 
Variable Util(%) Sharpe Util(%) Sharpe Util(%) Sharpe Util(%) Sharpe 
Panel A: Macroeconomic variables 
DP 0.663 0.177 0.571 0.134 0.786 0.179 0.705 0.145 
DY 0.701 0.190 0.571 0.135 0.818 0.189 0.699 0.144 
EP 0.641 0.176 0.559 0.132 0.727 0.169 0.692 0.143 
DE 0.584 0.155 0.580 0.137 0.689 0.155 0.683 0.138 
RVOL 0.662 0.180 0.578 0.130 0.757 0.177 0.683 0.135 
BM 0.686 0.190 0.559 0.130 0.801 0.177 0.699 0.142 
NTIS 0.642 0.173 0.575 0.135 0.760 0.172 0.699 0.142 
TBL 0.656 0.180 0.625 0.161 0.867 0.203 0.720 0.158 
LTY 0.624 0.176 0.599 0.148 0.766 0.184 0.692 0.146 
LTR 0.603 0.155 0.523 0.120 0.744 0.167 0.658 0.135 
TMS 0.671 0.183 0.632 0.161 0.861 0.204 0.758 0.165 
DFY 0.589 0.157 0.578 0.135 0.686 0.156 0.702 0.143 
DFR 0.616 0.171 0.524 0.126 0.771 0.186 0.670 0.141 
INFL 0.614 0.165 0.556 0.132 0.787 0.181 0.679 0.138 
Panel B: Technical indicators 
MA(1,9) 0.592 0.163 0.571 0.149 0.690 0.163 0.662 0.148 
MA(1,12) 0.621 0.175 0.614 0.168 0.729 0.176 0.710 0.170 
MA(2,9) 0.608 0.170 0.591 0.158 0.710 0.170 0.682 0.157 
MA(2,12) 0.646 0.185 0.642 0.179 0.754 0.186 0.744 0.182 
MA(3,9) 0.619 0.173 0.609 0.164 0.721 0.172 0.690 0.158 
MA(3,12) 0.590 0.161 0.574 0.146 0.692 0.161 0.670 0.143 
MOM(9) 0.587 0.160 0.571 0.146 0.686 0.159 0.663 0.142 
MOM(12) 0.594 0.162 0.573 0.147 0.696 0.161 0.668 0.143 
VOL(1,9) 0.570 0.153 0.543 0.136 0.662 0.152 0.637 0.135 
VOL(1,12) 0.593 0.163 0.572 0.151 0.695 0.163 0.669 0.152 
VOL(2,9) 0.578 0.157 0.565 0.145 0.680 0.157 0.672 0.146 
VOL(2,12) 0.589 0.162 0.579 0.152 0.697 0.164 0.689 0.157 
VOL(3,9) 0.576 0.156 0.561 0.140 0.674 0.156 0.668 0.141 
VOL(3,12) 0.621 0.172 0.611 0.164 0.728 0.173 0.714 0.168 
Panel C: Benchmarks 
Const 0.567 0.140 0.697 0.144 
HA 0.564 0.133 0.693 0.141 
The full sample results in Tables A.2.1 and A.2.2 show that when com-
paring to the main conclusions of Tables 4.3 and 4.4 the best predictors are 
still essentially the same. In other words, transaction costs naturally diminish 
the economic gains in different predictors quite symmetrically. The essential 
differences between the utility boosting and the current two-step approach are 
still valid. Less persistent variables (DFR and INFL) naturally suffer somewhat 
more on inevitably stronger impact of transaction costs (i.e. due to more active 
portfolio rebalancing), especially in the high cost scenario. On the other hand, 
as technical indicators are defned as binary variables, there is especially in 
139 
Lauri Nevasalmi 
full sample estimations no portfolio rebalancing in most months as the value 
of the predictor remains the same (i.e. due to persistent runs of either zeros 
and ones as defned in Section 3.1). 
Table A.2.3: Out of-sample results with low transaction costs when γ = 5 
maxand γ = 3 with the portfolio weight upper bound w = 1. 
γ = 5 γ = 3 
Utility boosting Linear, Utility boosting Linear, 
constr. weights constr. weights 
Variable Util(%) Sharpe Util(%) Sharpe Util(%) Sharpe Util(%) Sharpe 
Panel A: Macroeconomic variables 
DP 0.420 0.131 0.242 0.030 0.518 0.134 0.293 0.049 
DY 0.413 0.128 0.256 0.039 0.470 0.119 0.285 0.046 
EP 0.403 0.146 0.497 0.166 0.514 0.151 0.586 0.164 
DE 0.366 0.121 0.296 0.084 0.418 0.109 0.435 0.108 
RVOL 0.434 0.137 0.377 0.103 0.496 0.125 0.519 0.126 
BM 0.368 0.107 0.411 0.115 0.499 0.131 0.506 0.126 
NTIS 0.446 0.150 0.369 0.113 0.539 0.148 0.508 0.131 
TBL 0.418 0.133 0.470 0.155 0.556 0.145 0.590 0.153 
LTY 0.429 0.146 0.461 0.142 0.514 0.138 0.580 0.150 
LTR 0.396 0.124 0.383 0.113 0.505 0.132 0.477 0.120 
TMS 0.412 0.135 0.426 0.143 0.519 0.138 0.605 0.157 
DFY 0.396 0.122 0.414 0.121 0.445 0.111 0.538 0.137 
DFR 0.462 0.168 0.450 0.144 0.578 0.173 0.552 0.150 
INFL 0.435 0.139 0.381 0.114 0.581 0.158 0.519 0.133 
Panel B: Technical indicators 
MA(1,9) 0.506 0.191 0.469 0.162 0.602 0.184 0.553 0.157 
MA(1,12) 0.488 0.184 0.522 0.194 0.601 0.189 0.614 0.193 
MA(2,9) 0.487 0.186 0.472 0.171 0.588 0.181 0.545 0.161 
MA(2,12) 0.536 0.207 0.559 0.212 0.647 0.210 0.654 0.210 
MA(3,9) 0.462 0.169 0.439 0.155 0.567 0.172 0.501 0.147 
MA(3,12) 0.465 0.169 0.449 0.153 0.549 0.161 0.519 0.146 
MOM(9) 0.478 0.176 0.483 0.167 0.590 0.178 0.540 0.155 
MOM(12) 0.480 0.176 0.482 0.164 0.595 0.176 0.556 0.157 
VOL(1,9) 0.481 0.177 0.433 0.148 0.572 0.175 0.514 0.146 
VOL(1,12) 0.516 0.191 0.473 0.170 0.604 0.187 0.555 0.167 
VOL(2,9) 0.454 0.166 0.440 0.153 0.558 0.168 0.534 0.154 
VOL(2,12) 0.490 0.179 0.499 0.173 0.597 0.177 0.590 0.170 
VOL(3,9) 0.468 0.169 0.444 0.145 0.536 0.154 0.524 0.143 
VOL(3,12) 0.529 0.195 0.541 0.193 0.643 0.197 0.633 0.190 
Panel C: Benchmarks 
Const 0.435 0.139 0.563 0.146 
HA 0.443 0.133 0.551 0.141 
Notes: As in Tables A.2.1 and A.2.2., transaction costs are taken into account during the 
out-of-sample forecasting evaluation. 
140 
Moving forward from predictive regressions: Boosting asset allocation decisions 
Table A.2.4: Out of-sample results with high transaction costs when γ = 5 
maxand γ = 3 with the portfolio weight upper bound w = 1. 
γ = 5 γ = 3 
Utility boosting Linear, Utility boosting Linear, 
constr. weights constr. weights 
Variable Util(%) Sharpe Util(%) Sharpe Util(%) Sharpe Util(%) Sharpe 
Panel A: Macroeconomic variables 
DP 0.394 0.121 0.233 0.025 0.491 0.126 0.281 0.044 
DY 0.383 0.117 0.243 0.031 0.441 0.110 0.269 0.039 
EP 0.369 0.129 0.487 0.161 0.466 0.132 0.578 0.161 
DE 0.344 0.113 0.283 0.080 0.385 0.098 0.423 0.104 
RVOL 0.397 0.123 0.358 0.097 0.445 0.110 0.503 0.122 
BM 0.337 0.093 0.404 0.112 0.466 0.119 0.501 0.125 
NTIS 0.427 0.143 0.357 0.109 0.517 0.141 0.499 0.128 
TBL 0.401 0.128 0.465 0.153 0.534 0.139 0.588 0.153 
LTY 0.414 0.140 0.455 0.140 0.494 0.131 0.577 0.149 
LTR 0.296 0.090 0.262 0.075 0.393 0.098 0.366 0.089 
TMS 0.381 0.125 0.412 0.139 0.481 0.127 0.595 0.155 
DFY 0.364 0.109 0.404 0.117 0.399 0.096 0.530 0.135 
DFR 0.421 0.149 0.406 0.128 0.511 0.149 0.512 0.138 
INFL 0.376 0.118 0.343 0.102 0.517 0.137 0.487 0.124 
Panel B: Technical indicators 
MA(1,9) 0.483 0.181 0.449 0.154 0.574 0.173 0.535 0.151 
MA(1,12) 0.458 0.172 0.496 0.183 0.570 0.177 0.586 0.182 
MA(2,9) 0.457 0.174 0.447 0.161 0.557 0.170 0.521 0.153 
MA(2,12) 0.510 0.196 0.537 0.202 0.619 0.199 0.629 0.201 
MA(3,9) 0.432 0.157 0.412 0.145 0.537 0.161 0.471 0.137 
MA(3,12) 0.445 0.161 0.434 0.148 0.525 0.153 0.505 0.142 
MOM(9) 0.456 0.166 0.465 0.160 0.562 0.168 0.525 0.150 
MOM(12) 0.462 0.168 0.468 0.159 0.571 0.168 0.544 0.153 
VOL(1,9) 0.439 0.159 0.395 0.133 0.528 0.158 0.478 0.134 
VOL(1,12) 0.474 0.174 0.432 0.153 0.560 0.171 0.509 0.151 
VOL(2.9) 0.426 0.154 0.413 0.142 0.527 0.157 0.508 0.145 
VOL(2,12) 0.471 0.171 0.480 0.165 0.572 0.168 0.572 0.164 
VOL(3,9) 0.446 0.159 0.426 0.139 0.507 0.144 0.509 0.138 
VOL(3,12) 0.505 0.186 0.520 0.184 0.618 0.188 0.609 0.182 
Panel C: Benchmarks 
Const 0.432 0.138 0.561 0.146 
HA 0.439 0.131 0.548 0.140 
In Tables A.2.3 and A.2.4, we present the corresponding out-of-sample 
forecasting results as in Tables 4.7 and 4.8 in the main text but now incorpo-
rating also transaction costs in the evaluation stage. The obtained realized 
utility levels naturally drop due to transaction costs. The drop appears to 
be about the same magnitude in both the utility boosting and the ’linear’ 
two-step statistical approach. As in the in-sample results above, the smaller 
amount of trading activity in the case of technical indicators imply that the 
impact of transaction costs is substantially smaller versus the macroeconomic 
variables where partly quite erratic behaviour (such as the IT bubble during 
141 
Lauri Nevasalmi 
the millennium 2000) complicates their use as predictors. 
A.3 Splines as baselearner function 
Regression trees, which are now used as the base learner function in the 
customized gradient boosting in the main analysis, have several advantages. 
Trees are highly fexible, easily interpretable and computationally effcient. 
However, trees provide only one possible alternative in this respect. This 
section presents results when the base learner function is changed to a slightly 
more constrained spline-based alternative (i.e. using so called P-splines). 
Table A.3.1: Spline-based in-sample predictive results with γ = 5 or γ = 3 
maxand w = 1. 
γ = 5 γ = 3 
Variable Util(%) CER(%) Sharpe Util(%) CER(%) Sharpe 
Panel A: Macroeconomic variables 
DP 0.651 1.309 0.172 0.762 0.929 0.172 
DY 0.647 1.270 0.170 0.763 0.953 0.171 
EP 0.648 1.434 0.174 0.779 1.304 0.175 
DE 0.588 0.556 0.150 0.706 0.211 0.148 
RVOL 0.619 1.000 0.164 0.708 0.233 0.151 
BM 0.566 0.341 0.145 0.741 0.705 0.160 
NTIS 0.655 1.419 0.173 0.771 1.095 0.170 
TBL 0.638 1.289 0.170 0.740 0.772 0.168 
LTY 0.603 0.849 0.164 0.740 0.675 0.161 
LTR 0.649 1.244 0.168 0.780 1.207 0.170 
TMS 0.681 1.728 0.186 0.793 1.368 0.181 
DFY 0.597 0.750 0.158 0.684 -0.035 0.150 
DFR 0.605 0.909 0.162 0.720 0.514 0.160 
INFL 0.643 1.321 0.173 0.766 1.114 0.177 
Panel B: Technical indicators 
MA(1,9) 0.633 1.529 0.181 0.739 1.055 0.181 
MA(1,12) 0.655 1.835 0.189 0.766 1.405 0.190 
MA(2,9) 0.641 1.625 0.184 0.749 1.162 0.183 
MA(2,12) 0.675 2.103 0.197 0.786 1.651 0.197 
MA(3,9) 0.650 1.698 0.185 0.756 1.208 0.184 
MA(3,12) 0.612 1.163 0.169 0.717 0.722 0.169 
MOM(9) 0.616 1.255 0.172 0.721 0.788 0.172 
MOM(12) 0.617 1.232 0.171 0.724 0.800 0.170 
VOL(1,9) 0.623 1.296 0.174 0.724 0.785 0.173 
VOL(1,12) 0.639 1.555 0.181 0.748 1.130 0.181 
VOL(2,9) 0.612 1.183 0.171 0.719 0.754 0.171 
VOL(2,12) 0.618 1.291 0.173 0.731 0.931 0.175 
VOL(3,9) 0.604 1.049 0.167 0.708 0.583 0.166 
VOL(3,12) 0.649 1.627 0.182 0.759 1.226 0.183 
142 
Moving forward from predictive regressions: Boosting asset allocation decisions 
Table A.3.2: Spline-based out of-sample predictive results with γ = 5 or 
maxγ = 3 and w = 1. 
γ = 5 γ = 3 
Variable Util(%) CER(%) Sharpe Util(%) CER(%) Sharpe 
Panel A: Macroeconomic variables 
DP 0.431 0.084 0.135 0.545 -0.094 0.143 
DY 0.395 -0.425 0.120 0.529 -0.311 0.137 
EP 0.446 0.804 0.164 0.584 0.678 0.174 
DE 0.348 -0.604 0.112 0.433 -1.352 0.111 
RVOL 0.434 0.147 0.136 0.521 -0.522 0.131 
BM 0.401 -0.440 0.118 0.484 -0.953 0.122 
NTIS 0.489 1.152 0.165 0.615 0.917 0.167 
TBL 0.392 -0.128 0.130 0.543 -0.041 0.141 
LTY 0.424 0.228 0.138 0.505 -0.579 0.131 
LTR 0.404 -0.237 0.127 0.535 -0.167 0.138 
TMS 0.429 0.306 0.141 0.544 0.078 0.145 
DFY 0.391 -0.457 0.117 0.462 -1.140 0.119 
DFR 0.441 0.538 0.150 0.503 -0.517 0.132 
INFL 0.460 0.525 0.147 0.576 0.320 0.153 
Panel A: Technical indicators 
MA(1,9) 0.505 1.689 0.187 0.605 1.144 0.183 
MA(1,12) 0.508 1.953 0.193 0.618 1.513 0.195 
MA(2,9) 0.481 1.514 0.180 0.589 1.067 0.181 
MA(2,12) 0.553 2.573 0.213 0.660 2.063 0.214 
MA(3,9) 0.473 1.373 0.173 0.570 0.794 0.172 
MA(3,12) 0.459 1.011 0.164 0.562 0.588 0.166 
MOM(9) 0.475 1.244 0.172 0.581 0.843 0.174 
MOM(12) 0.480 1.268 0.173 0.585 0.852 0.173 
VOL(1,9) 0.478 1.328 0.174 0.590 0.979 0.181 
VOL(1,12) 0.499 1.674 0.184 0.611 1.333 0.190 
VOL(2,9) 0.456 1.028 0.165 0.559 0.583 0.167 
VOL(2,12) 0.493 1.429 0.178 0.592 0.960 0.173 
VOL(3,9) 0.470 1.055 0.166 0.556 0.401 0.160 
VOL(3,12) 0.527 1.972 0.194 0.645 1.678 0.199 
Table A.3.3: Spline-based in-sample predictive results for the predictor 
maxgroups with γ = 5 or γ = 3 and w = 1. 
γ = 5 γ = 3 
Variables Util(%) CER(%) Sharpe Util(%) CER(%) Sharpe 
MACRO 0.840 3.819 0.252 0.959 3.504 0.253 
TECH 0.686 2.227 0.200 0.822 2.040 0.207 
ALL 0.851 4.016 0.261 0.993 3.975 0.270 
143 
Lauri Nevasalmi 
Table A.3.4: Spline-based out of-sample predictive results for the predictor 
maxgroups with γ = 5 or γ = 3 and w = 1. 
γ = 5 γ = 3 
Variables Util(%) CER(%) Sharpe Util(%) CER(%) Sharpe 
MACRO 0.501 1.139 0.166 0.582 0.411 0.158 
TECH 0.511 1.907 0.194 0.588 1.106 0.183 
ALL 0.541 1.747 0.187 0.613 0.901 0.173 
The results in Tables A.3.1–A.3.4 are largely similar to the ones obtained 
with regression trees in Tables 4.3–4.4 and 4.7–4.9. This concerns single predic-
tors and multivariate predictor groups. Here using all the predictors (ALL) 
even lead to somewhat superior performance than obtained with the regres-
sion tree case, further strengthening the usefulness of the utility boosting 
method and its internal model selection capability. 
A.4 GARCH model and realized variance based volatility proxies 
In accordance with various closely related return predictability studies (see, 
e.g., Campbell and Thompson, 2008; Rapach et al., 2010; Rapach and Zhou, 
2013; Neely et al., 2014), we have used a fve-year rolling window (moving 
average) based volatility proxy computed using historical excess returns. In 
this section, we consider two alternative ways to extract the necessary volatil-
ity proxy σ2: The Generalized Autoregressive Conditional Heteroskedastic t 
(GARCH) model and an alternative realized variance-based approach. 
In this section, we concentrate on the full sample results. Specifcally 
here, as this is already a robustness check for the main analysis, detailed 
out-of-sample forecasting results, containing continuous parameter updating, 
depends crucially on fnding the optimal GARCH model specifcation as well 
and its detailed forecasting performance evaluation. Hence, we believe relying 
on the maximum data availability provides the best additional value over the 
main results. 
In the GARCH model specifcation for the excess market returns, the 
conditional variance h2 is extracted with a GARCH error term, combined with t 
the constant conditional mean in line with the evidence of Welch and Goyal 
(2008), among others (i.e. no mean return predictability). Formally, we set 
144 
hence 
Moving forward from predictive regressions: Boosting asset allocation decisions 
re,t = α0 + ht νt, 
where νt is an independent and identically distributed error term with zero 
mean and unit variance (νt ∼ iid(0, 1)). Following the large majority of past 
GARCH specifcations, the normality assumption (i.e. νt ∼ nid(0, 1)) is as-
sumed for maximum likelihood estimation purposes, and the conditional 
variance is assumed to follow the GARCH(1,1) process 
h2 2 = c + b1h
2 c > 0, b1 ≥ 0, a1 > 0,t t−1 + a1ut−1, 
where ut = re,t − α0. There are, of course, various extensions available for 
the above specifcations of the conditional mean and conditional variance, 
including, for example, asymmetric conditional variance equations (asym-
metric GARCH models) and non-Gaussian innovations, but here we restrict 
ourselves to this commonly used specifcation. Finally, the extracted estimate 
of the conditional variance bh2 acts as our volatility proxy in the procedures t 
described in Section 2. 
Table A.4.1 presents the (full sample) results obtained with our benchmark 
maxselections γ = 5 and w = 1 when the GARCH-based volatility proxy is 
employed. It turns out that within technical indicators the same indicators 
perform the best (MA(1,12), MA(2,12) and VOL(3,12)) as in the case of main 
volatility proxy. In the case of macroeconomic variables, some changes occur 
as now EP and TMS are among the best performing variables in this setting. 
All in all, all the same main conclusions on the superiority of the utility 
boosting over the traditional two-step ‘linear’ approach are intact (seen as 
almost uniformly higher Sharpe ratios and realized utility levels). 
145 
Lauri Nevasalmi 
maxTable A.4.1: In-sample predictive results with γ = 5 and w = 1 when 
using the extracted conditional variance from the GARCH(1,1) model as the 
underlying volatility proxy σt 2 . 
Utility boosting Linear, 
constrained weights 
Variable Util(%) CER(%) Sharpe Util(%) Sharpe t-val adj-R2 
Panel A: Macroeconomic variables 
DP 0.664 1.692 0.186 0.562 0.142 1.76 0.28 
DY 0.656 1.633 0.184 0.563 0.144 1.87 0.33 
EP 0.706 2.375 0.204 0.554 0.140 0.77 0.04 
DE 0.579 0.711 0.157 0.537 0.134 0.69 -0.02 
RVOL 0.690 2.229 0.209 0.555 0.145 2.80 0.63 
BM 0.666 1.755 0.186 0.536 0.133 0.58 -0.07 
NTIS 0.637 1.451 0.177 0.532 0.131 0.27 -0.10 
TBL 0.637 1.503 0.184 0.584 0.155 2.26 0.55 
LTY 0.625 1.308 0.180 0.563 0.146 1.49 0.18 
LTR 0.678 1.895 0.190 0.601 0.155 2.68 0.69 
TMS 0.704 2.280 0.202 0.595 0.154 1.90 0.41 
DFY 0.667 1.848 0.191 0.526 0.129 0.38 -0.08 
DFR 0.603 1.078 0.169 0.575 0.154 0.96 0.08 
INFL 0.622 1.302 0.174 0.540 0.135 0.35 -0.09 
Panel B: Technical indicators 
MA(1,9) 0.621 1.541 0.178 0.584 0.159 1.63 0.29 
MA(1,12) 0.662 2.039 0.192 0.628 0.175 1.97 0.55 
MA(2,9) 0.642 1.751 0.184 0.606 0.166 1.83 0.39 
MA(2,12) 0.683 2.293 0.199 0.654 0.184 2.34 0.76 
MA(3,9) 0.645 1.739 0.183 0.606 0.164 1.78 0.41 
MA(3,12) 0.614 1.316 0.171 0.569 0.148 1.02 0.08 
MOM(9) 0.612 1.320 0.172 0.571 0.151 1.11 0.11 
MOM(12) 0.617 1.350 0.171 0.571 0.150 1.10 0.12 
VOL(1,9) 0.603 1.253 0.170 0.575 0.153 1.30 0.16 
VOL(1,12) 0.627 1.583 0.179 0.596 0.162 1.73 0.39 
VOL(2,9) 0.613 1.368 0.174 0.578 0.154 1.31 0.18 
VOL(2,12) 0.620 1.502 0.177 0.587 0.158 1.49 0.29 
VOL(3,9) 0.600 1.162 0.168 0.562 0.146 0.95 0.04 
VOL(3,12) 0.642 1.693 0.181 0.609 0.164 1.93 0.52 
Panel C: Benchmarks 
Const 0.556 0.425 0.143 
HA 0.529 0.131 
As another realized volatility proxy and robustness check for the 5-year 
rolling window volatility, we utilize RVOL as defned in Table 4.1 (see Mele 
(2007), without annualizing RVOL). This means that the lagged RVOL (lagged 
by one month) then acts as our volatility proxy σt 2 . As a 12-month moving aver-
age volatility estimator, RVOL is also rather persistent, as the 60-month (5-year) 
moving average in the main analysis. However, due to the shorter computa-
tion window the 12-month estimator naturally reacts somewhat sharper to 
volatility changes. 
146 
Moving forward from predictive regressions: Boosting asset allocation decisions 
maxTable A.4.2: In-sample predictive results with γ = 5 and w = 1 when 
RVOL is the volatility proxy. 
Utility boosting Linear, 
constrained weights 
Variable Util(%) CER(%) Sharpe Util(%) Sharpe t-val adj-R2 
Panel A: Macroeconomic variables 
DP 0.667 1.877 0.186 0.585 0.142 1.76 0.28 
DY 0.639 1.556 0.174 0.588 0.144 1.87 0.33 
EP 0.728 2.740 0.207 0.571 0.136 0.77 0.04 
DE 0.584 0.972 0.157 0.547 0.128 0.69 -0.02 
RVOL 0.587 0.982 0.160 0.563 0.144 2.80 0.63 
BM 0.707 2.396 0.205 0.550 0.129 0.58 -0.07 
NTIS 0.646 1.724 0.180 0.535 0.123 0.27 -0.10 
TBL 0.629 1.599 0.180 0.587 0.148 2.26 0.55 
LTY 0.726 2.740 0.218 0.570 0.140 1.49 0.18 
LTR 0.681 2.068 0.190 0.596 0.146 2.68 0.69 
TMS 0.735 2.897 0.215 0.596 0.147 1.90 0.41 
DFY 0.590 1.036 0.159 0.537 0.124 0.38 -0.08 
DFR 0.622 1.418 0.170 0.587 0.148 0.96 0.08 
INFL 0.601 1.165 0.164 0.550 0.129 0.35 -0.09 
Panel B: Technical indicators 
MA(1,9) 0.615 1.639 0.176 0.590 0.155 1.63 0.29 
MA(1,12) 0.654 2.122 0.190 0.628 0.172 1.97 0.55 
MA(2,9) 0.627 1.732 0.180 0.609 0.163 1.83 0.39 
MA(2,12) 0.673 2.360 0.197 0.651 0.180 2.34 0.76 
MA(3,9) 0.636 1.838 0.181 0.610 0.162 1.78 0.41 
MA(3,12) 0.606 1.403 0.168 0.571 0.142 1.02 0.08 
MOM(9) 0.601 1.373 0.168 0.577 0.145 1.11 0.11 
MOM(12) 0.623 1.587 0.173 0.587 0.148 1.10 0.12 
VOL(1,9) 0.598 1.359 0.168 0.587 0.149 1.30 0.16 
VOL(1,12) 0.621 1.711 0.178 0.601 0.161 1.73 0.39 
VOL(2,9) 0.601 1.435 0.171 0.581 0.149 1.31 0.18 
VOL(2,12) 0.608 1.559 0.174 0.588 0.155 1.49 0.29 
VOL(3,9) 0.589 1.207 0.163 0.562 0.138 0.95 0.04 
VOL(3,12) 0.631 1.759 0.178 0.607 0.160 1.93 0.52 
Panel C: Benchmarks 
Const 0.562 0.611 0.143 
HA 0.541 0.125 
Table A.4.2 shows that when relying on the RVOL-based volatility proxy, 
basically the same empirical fndings between the methods and predictor-
specifc performances arise as already pointed out in the GARCH case. With 
the technical indicators, the same indicators are again performing the best 
whereas with the macroeconomic variables there is more variation. Interest-
ingly in both robustness checks (GARCH and RVOL), EP and TMS stand out 
much more clearer among the macroeconomic variables than in the case of 
slowly evolving 5-year moving average volatility proxy. 
Instead of the predictor-specifc considerations, let us still consider multi-
147 
Lauri Nevasalmi 
variate results with different volatility proxies. Table A.4.3 reports the same 
groups of predictors and analysis settings as in Table 4.5 in the main text, but 
now for the GARCH and RVOL-based volatility proxies. 
maxTable A.4.3: In-sample predictive results with γ = 5 and w = 1 when 
using the extracted conditional variance from the GARCH(1,1) model and 
RVOL as the volatility proxy. 
Utility boosting Linear (PCA), 
constrained weights 
Variables Util(%) CER(%) Sharpe Util(%) Sharpe adj-R2 
Panel A: GARCH-based volatility proxy 
MACRO 0.934 5.223 0.305 0.584 0.158 0.54 
TECH 0.676 2.186 0.198 0.614 0.169 0.45 
ALL 0.944 5.377 0.313 0.650 0.188 0.92 
Panel B: RVOL-based volatility proxy 
MACRO 0.896 4.866 0.292 0.605 0.159 0.54 
TECH 0.648 1.982 0.191 0.605 0.161 0.45 
ALL 0.958 5.664 0.321 0.645 0.183 0.92 
The main results are largely the same as in Table 4.5. Larger information 
sets of predictors stabilizes the performances over certain noise accompanied 
to predictor-specifc results. Compared with Table 4.5, maybe the only slight 
change is that now TECH appears to contribute the combined ALL case 
marginally stronger both in the utility boosting and in the two-step linear 
PCA-based approaches. 
A.5 First differences of macroeconomic predictors 
The descriptive statistics in Table 4.2 shows that majority of macroeconomic 
predictive variables are highly persistent. The only exceptions are LTR, DFR 
and INFL. This section considers whether taking the frst differences of the 
persistent macroeconomic predictors affect the out-of-sample forecasting re-
sults. 
148 
Moving forward from predictive regressions: Boosting asset allocation decisions 
Table A.5.1: Out of-sample results in predictor-specifc models with γ = 5 
maxand γ = 3 (w = 1) after taking the frst differences of the persistent 
macroeconomic predictors. 
γ = 5 γ = 3 
Utility boosting Linear, Utility boosting Linear, 
constr. weights constr. weights 
Variable Util(%) Sharpe Util(%) Sharpe Util(%) Sharpe Util(%) Sharpe 
ΔDP 0.510 0.182 0.438 0.139 0.649 0.194 0.520 0.139 
ΔDY 0.417 0.128 0.430 0.125 0.510 0.130 0.521 0.131 
ΔEP 0.380 0.123 0.353 0.104 0.491 0.135 0.478 0.123 
ΔDE 0.492 0.172 0.440 0.138 0.624 0.184 0.546 0.145 
ΔRVOL 0.396 0.127 0.419 0.125 0.519 0.138 0.535 0.137 
ΔBM 0.441 0.142 0.456 0.139 0.536 0.141 0.554 0.146 
ΔNTIS 0.435 0.139 0.416 0.124 0.530 0.145 0.561 0.145 
ΔTBL 0.417 0.136 0.452 0.135 0.527 0.136 0.541 0.138 
ΔLTY 0.428 0.138 0.435 0.130 0.541 0.142 0.545 0.139 
ΔTMS 0.418 0.138 0.406 0.121 0.457 0.120 0.542 0.139 
ΔDFY 0.421 0.137 0.464 0.139 0.546 0.144 0.569 0.147 
Notes: In this table, we report the out-of-sample forecasting results when taking the frst 
differences of the persistent macroeconomic predictive variables (see Tables 4.1 and 4.2). 
Compare the out-of-sample forecasting results (with the levels) in Tables 4.7 and 4.8. 
Table A.5.2: Out of-sample predictive results of the predictor groups 
max(MACRO and ALL) with γ = 5 and w = 1 using different volatility 
proxies and taking the frst differences of persistent macroeconomic predic-
tors. 
Utility boosting Linear (PCA), 
constrained weights 
Variables Util(%) CER(%) Sharpe Util(%) CER(%) Sharpe 
Panel A: 5-year moving average volatility proxy 
MACRO 0.538 1.795 0.186 0.375 -0.985 0.108 
ALL 0.553 2.035 0.195 0.371 -0.520 0.119 
Panel B: GARCH-based volatility proxy 
MACRO 0.571 1.444 0.205 0.377 -1.356 0.118 
ALL 0.574 1.552 0.209 0.357 -1.375 0.116 
Panel C: RVOL-based volatility proxy 
MACRO 0.564 1.454 0.199 0.416 -1.003 0.123 
ALL 0.594 1.893 0.215 0.375 -1.244 0.115 
Forecasting results with the changes in the dividend-price ratio (ΔDP), 
and partly also the differenced payout ratio (ΔDE), pops out from the results. 
In other words, using the changes of the strongly asset pricing-motivated 
dividend-price ratio supports its use in portfolio weight determination even 
further. Together with the fact that less persistent infation (INFL) and partly 
also default return spread (DFR) perform well in Tables 4.7 and 4.8, this all 
149 
Lauri Nevasalmi 
suggests that less persistent state variables seem generally more useful in the 
utility boosting than highly persistent ones. 
Multivariate results in Table A.5.2 (cf. Table 4.9)) again strongly support the 
usefulness of the utility boosting method. Using all the predictors, including 
thus also the technical indicators (TECH), show that the additional predictive 
value can be obtained over using just the MACRO variables (here the frst 
differences of the persistent macro variables augmented with LTR, DFR and 
INFL). It is also noteworthy how the multivariate results for utility boosting 
in Panel A are slightly higher using each evaluation criteria when compared 
to the ones obtained without differencing in Table 4.9 of the main text. 
Appendix B: Comparison to Brandt and Santa-Clara (2006) 
In this Appendix, we briefy state the main points of Brandt and Santa-Clara 
(2006) relevant to the comparison with our method and analysis. Their study is 
probably the closest one to our approach in terms of the general idea of direct 
portfolio weight determination. However, their approach is still in various 
ways very different and aiming to answer somewhat different demand than 
ours. First of all, their empirical context is closely connected to potentially 
large cross-sections of assets. That is at the centre of attention also in Brandt 
(1999), Aït-Sahalia and Brandt (2001) and Brandt et al. (2009) whereas here 
we are introducing a new empirical approach in a simple setting between one 
risky asset and the risk-free rate. 
Methodologically, in contrast to our highly fexible, nonlinear and ad-
vanced machine learning-motivated approach, Brandt and Santa-Clara (2006) 
assume that the optimal portfolio weights are linear in parameters 
wt = θxt−1, (B.1) 
where θ is the (row) vector of parameter coeffcients. Likewise in (4.2), they 
consider the decision making problem of an investor who maximizes the 
conditional expected value of a quadratic utility function over the next period’s 
wealth. When writing their objective function in terms of our notation, we get n o 
2 max 
γ 
(θxt−1)2 r . (B.2)
θ 
rf,t + θxt−1re,t − 
2 e,t 
150 
Moving forward from predictive regressions: Boosting asset allocation decisions 
In our single risky asset setting (market portfolio and risk-free rate), the 
solution of the above optimization problem leads to the estimate 
T T b 1 X 0 2 −1 X θ = (xt−1xt−1) r xt−1re,t. (B.3)e,tγ 
t=1 t=1 
From this solution, the (empirical) weight invested in risky asset is obtained 
by adding the corresponding products of elements of θb and xt−1. This solu-
tion depends only on the data and does not require any assumptions about 
the distribution of stock returns, apart from stationarity (as in our utility 
boosting approach) and that returns are assumed non-i.i.d. (identically and 
independently distributed). 
The fnal solution (B.3) relies on the selected (simple) conditional variance 
specifcation in (B.2) (cf. equation (4.2)). This does not allow a straightfor-
ward use of rolling window (moving-average) based volatility proxy σ2 (as in t 
the utility boosting) which importantly acknowledges the well-documented 
volatility clustering in asset prices and found an important ingredient in the 
past two-step statistical approaches (see Section 4.2.1). Together with this point, 
another point not allowing for a direct comparison to our empirical perspec-
tive, framed by the recent large branch of return predictability studies, is the 
fact that the solution (B.3) does not automatically respect the pre-determined 
bounds (4.8), which are guaranteed and an important internal ingredient in 
the utility boosting approach. 
Appendix C: Tuning customized gradient boosting 
As was shown in the proposed algorithm in Section 4.2.3, boosting builds 
the fnal model in a forward stagewise manner by adding new base learner 
functions to the ensemble that best ft the negative gradient of the loss function. 
The fnal additive boosting ensemble can be written as a sum of the base learner 
functions 
MX 
FM (xt−1) = υhm(xt−1). 
m=1 
This equation also summarizes the tuning parameters related to gradient 
boosting, which are the base learner function hm(xt−1) and the amount of 
151 
Lauri Nevasalmi 
iterations M . In order to make the optimization in functional space feasible, 
the base learners are assumed to belong to certain parameterized class of 
functions. Regression trees and smoothing splines are two common choices. 
Simple linear functions is another alternative, leading to the linear fnal model. 
Regression trees split the predictor space into J disjoint rectangles and 
attach a simple constant as the functional estimate in each of these rectangles. 
This is the method that we incorporate to the practical algorithm in Section 
4.2.3. Mathematically these J-terminal node regression trees can be written as 
JX 
hm(xt−1; {cjm, Rjm}jJ =1) = cjmI(xt−1 ∈ Rjm), 
j=1 
where cjm ∈ R is the functional estimate in region Rjm at the mth iteration. 
The complexity of the regression tree base learner can be controlled by the 
amount of terminal nodes J , and the amount of observations required at 
each terminal node. The simplest regression tree with two terminal nodes is 
suffcient in our predictor-specifc analysis. Each of the terminal nodes must 
contain at least 10 observations, a value commonly used in previous literature. 
The optimal amount of iterations M could be determined by setting aside 
part of the dataset (i.e. excluding this part from ftting the model) and then this 
separate test set is used to evaluate the generalization ability of the model. To 
this end K-fold cross-validation is a commonly used method where the dataset 
is randomly split into K non-overlapping folds. Each of the K folds (concretely 
K = 5 in this study) is used as a test set once while the model is trained using 
the remaining K − 1 folds. Using the utility function presented in (4.17), the 
validation utility is the average utility produced by the K independent folds 
KX1 (−k)
CV = u¯ k ,K 
k=1 
where u¯ k is the average utility when the data points of fold k are used as an 
independent test set and not in ftting the model. The cross-validation estimate 
for the amount of iterations M is the one producing the maximum validation 
utility. 
Friedman (2002) introduces an additional randomization step to the al-
gorithm in Section 4.2.3 to further enhance the generalization ability of the 
152 
Moving forward from predictive regressions: Boosting asset allocation decisions 
algorithm. At each of the m repeats a random subsample is drawn without 
replacement from the entire dataset. The pseudo-residuals and the new base 
learner is then constructed using this random sample instead of the entire 
dataset. A subsampling rate of one half is a commonly used alternative. 
153 
Lauri N
evasalm
i
E 67
A
N
N
A
LES U
N
IV
ERSITATIS TU
RK
U
EN
SIS
TURUN YLIOPISTON JULKAISUJA – ANNALES UNIVERSITATIS TURKUENSIS
SARJA – SER. E OSA – TOM. 67  |  OECONOMICA  |  TURKU 2020
ESSAYS ON ECONOMIC
FORECASTING USING
MACHINE LEARNING
Lauri Nevasalmi
 
  
 
 
-
-
Pa
in
os
al
am
a 
O
y, 
Tu
rk
u,
 F
in
la
nd
 2
02
0 
ISBN 978-951-29-8222-6 (PRINT)
ISBN 978-951-29-8223-3 (PDF)
ISSN 2343 3159 (Painettu/Print) 
ISSN 2343 3167 (Verkkojulkaisu/Online)