AI-Driven Portfolio Management: A Comparative Research of Deep Reinforcement Learning
Techniques Against The 1/N Portfolio Strategy
Master’s thesis
in Accounting and Finance
Author:
Joni Aarnio
Supervisor:
Prof. Luis Alvarez Esteban
18/09/2025
Turku
The originality of this thesis has been checked in accordance with the University of Turku
quality assurance system using the Turnitin Originality Check service.
Master’s thesis
Subject: Accounting and Finance
Author: Joni Aarnio
Title: AI-Driven Portfolio Management: A Comparative Research of Deep Reinforce-
ment Learning Techniques against the 1/N Portfolio Strategy
Supervisor: Luis Alvarez Esteban
Number of pages: 70
Date: 18/09/2025
Recent advances in deep reinforcement learning (DRL) for portfolio management offers promising
methods, yet their real-world edge over simple heuristic allocations remains unclear. This thesis
evaluates whether state-of-the-art DRL agents can outperform the naive but hard-to-beat 1/N
strategy. Three algorithms: Proximal Policy Optimization (PPO), Advantage Actor-Critic (A2C)
and Deep Deterministic Policy Gradient (DDPG) are trained to allocate capital across ten highly
liquid U.S. large-capitalisation equities drawn from diverse sectors. Daily total-return data from
January 2010 to December 2024 are split chronologically: 2010-2019 forms the learning window,
while 2020-2024 provides an untouched out-of-sample testing period, capturing the COVID-19
shock and subsequent regime shifts.
The study contributes a rigorously controlled, multi-algorithm comparison that integrates real-
istic costs and robust statistics. The environment frames portfolio management as a sequen-
tial Markov decision process. Each state aggregates recent price dynamics, technical indicators,
rolling fundamentals and macro variables where actions are continuous weight vectors constrained
to full investment. A risk-adjusted reward embeds a 10 bp transaction-cost penalty to discourage
excessive turnover. Hyper-parameters are tuned via grid search, and model robustness is checked
across multiple random seeds.
Out-of-sample results reveal that none of the DRL agents delivers a statistically significant
improvement over equal weighting. The 1/N benchmark achieves a compound annual growth
rate of 20.9 % and the highest annualised Sharpe ratio (1.075), marginally ahead of DDPG
(0.916), A2C (0.840) and PPO (0.805). A Ledoit-Wolf circular block bootstrap with 1 000
replications finds p-values between 0.46 and 0.51 for Sharpe-ratio differentials, confirming that
observed gaps are indistinguishable from noise at conventional significance levels. Overall, the
evidence indicates that algorithmic ingenuity alone does not guarantee superior risk-adjusted
returns in liquid equity markets.
AI disclaimer: AI-based tools, particularly ChatGPT and Grammarly AI, were used during the
research for language editing, project coding, and LaTeX formatting.
Key words: Reinforcement learning, Stock markets, Portfolio management
Gradututkielma
Oppiaine: Laskentatoimi ja rahoitus
Tekijä: Joni Aarnio
Otsikko: Tekoälyvetoinen salkunhoito: Vertaileva tutkimus syvävahvistusoppimisen me-
netelmistä suhteessa 1/N-salkkustrategiaan
Ohjaaja: Luis Alvarez Esteban
Sivumäärä: 70
Date: 18/09/2025
Viimeaikaiset edistysaskeleet syvävahvistusoppimisen saralla (DRL) salkunhoidossa tarjoaa lupaavia
menetelmiä, mutta niiden todellinen etu verrattuna yksinkertaisiin heuristisiin allokointisääntöi-
hin on yhä epäselvä. Tämä pro gradu -tutkielma selvittää, kykenevätkö huipputason DRL-
agentit päihittämään naivin mutta vaikeasti voitettavaksi tunnetun 1/N-strategian. Kolme al-
goritmia: Proximal Policy Optimization (PPO), Advantage Actor-Critic (A2C) ja Deep Determ-
inistic Policy Gradient (DDPG) koulutetaan kohdentamaan pääomaa kymmeneen hyvin likvidiin
yhdysvaltalaiseen suuryhtiöosakkeeseen useilta toimialoilta. Päivittäinen kokonaistuottodata
ajalta tammikuu 2010 – joulukuu 2024 jaetaan kronologisesti: vuodet 2010–2019 muodostavat
oppimisjakson, kun taas 2020–2024 toimii koskemattomana ulkoisen testauksen ajanjaksona,
kattaen muun muassa COVID-19-shokin ja sitä seuranneet rakennemuutokset.
Tutkimus tarjoaa tiukasti kontrolloidun, useita algoritmeja vertailevan asetelman, joka yhdistää
realistiset kustannukset ja vankan tilastollisen analyysin. Salkunhoito mallinnetaan peräkkäis-
enä Markovin päätösprosessina, jossa tilavektori koostaa viimeaikaiset hintaliikkeet, tekniset in-
dikaattorit, rullaavat fundamentit ja makromuuttujat ja jossa toiminnot ovat jatkuvia pain-
ovektoreita, joiden on täytettävä täysinvestoinnin ehto. Riskikorjattu palkkio sisältää 10 korkop-
isteen transaktiokustannuspenaltin liiallisen vaihtuvuuden hillitsemiseksi. Hyperparametrit vir-
itetään ruutuhakumenetelmällä, ja mallien kestävyys testataan useiden satunnaissiementen avulla.
Ulkoisen testiaineiston tulokset osoittavat, ettei mikään DRL-agenteista saavuta tilastollisesti
merkittävää parannusta tasapainottuvaan 1/N-strategiaan nähden. Vertailustrategia tuottaa
20,9 %:n yhdistetyn vuotuisen kasvuvauhdin ja korkeimman annualisoidun Sharpe-suhteen (1,075),
niukasti DDPG:n (0,916), A2C:n (0,840) ja PPO:n (0,805) edellä. Ledoit–Wolfin syklinen lo-
hkobootstrap (1 000 replikointia) antaa Sharpe-eroille p-arvot 0,46–0,51, mikä vahvistaa, että
havaitut erot ovat perinteisin raja-arvoin erottamattomia satunnaisvaihtelusta. Tulokset viit-
taavat siihen, että jopa kehittyneet DRL-mallit jäävät likvideillä osakemarkkinoilla yksinker-
taisen, kustannustehokkaan 1/N-strategian varjoon.
Tekoälyseloste: Tutkielman laatimisessa on hyödynnetty tekoälypohjaisia työkaluja, erityisesti
ChatGPT:tä ja Grammarly AI:ta, kielenhuoltoon, projektikoodin tuottamiseen ja LaTeX-muotoiluun.
Avainsanat: Vahvistusoppiminen, Osakemarkkinat, Salkunhoito
CONTENTS
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Objectives and structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 THEORETICAL FRAMEWORK . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1 Stock market predictability . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Portfolio theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Previous studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3 MACHINE LEARNING FRAMEWORK . . . . . . . . . . . . . . . . . . . . . . 20
3.1 Neural Networks and the Backpropagation Algorithm . . . . . . . . . . . . 20
3.2 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 DRL Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.1 Advantage Actor-Critic . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.2 Proximal Policy Optimization . . . . . . . . . . . . . . . . . . . . . 30
3.3.3 DDPG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4 METHODOLOGY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.1 Data description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 Defining the state space . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3 Model Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.4 Training the agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.5 Performance evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5 RESULTS AND DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.1 Data Characteristics and Feature Behavior . . . . . . . . . . . . . . . . . . 44
5.2 DRL Model Configuration and Hyperparameters . . . . . . . . . . . . . . . 47
5.3 Out-of-Sample Performance Comparison . . . . . . . . . . . . . . . . . . . 48
5.4 Interpretation and Critical Discussion . . . . . . . . . . . . . . . . . . . . . 56
6 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
FIGURES
1 Multilayer neural network . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2 Activation functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3 Chain rule illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4 Backward pass in a neural network . . . . . . . . . . . . . . . . . . . . . . 24
5 Agent-environment interaction . . . . . . . . . . . . . . . . . . . . . . . . . 25
6 Correlation matrix of daily log-returns for the ten equities. . . . . . . . . . 45
7 Empirical distribution of the treasury rate . . . . . . . . . . . . . . . . . . 46
8 Empirical distribution of the bond-market volatility . . . . . . . . . . . . . 46
9 Cumulative portfolio value in the test period . . . . . . . . . . . . . . . . . 49
10 Rolling 60-day annualised Sharpe ratio for each strategy. . . . . . . . . . . 51
11 Evolution of portfolio weights . . . . . . . . . . . . . . . . . . . . . . . . . 52
12 Median portfolio trajectories from multiple random seeds. . . . . . . . . . . 53
13 Median annualised Sharpe ratios from multiple seeds. . . . . . . . . . . . . 54
TABLES
1 Summary of studies on reinforcement learning in portfolio management. . . 19
2 Final stock universe by sector and style . . . . . . . . . . . . . . . . . . . . 35
3 Hyperparameter Grid Search Values . . . . . . . . . . . . . . . . . . . . . . 41
4 Key descriptive statistics for the ten constituent equities. . . . . . . . . . . 44
5 Descriptive statistics for the two macro-economic features. . . . . . . . . . 44
6 Final hyperparameters chosen for each DRL algorithm. . . . . . . . . . . . 47
7 Out-of-sample performance metrics. . . . . . . . . . . . . . . . . . . . . . . 50
71 INTRODUCTION
1.1 Background
Every rational investor wants to maximize the utility received from their investments. This
stems to the central research topic in finance about the profitability of active portfolio
management and whether it is possible to consistently outperform the market benchmarks.
One of the cornerstones in finance theory is the Efficient Market Hypothesis (EMH)
proposed by Fama (1970), which states that in efficient markets asset prices fully reflect
all available information. This theory suggests that all new information is incorporated
into stock prices instantly, making it difficult to systematically outperform benchmark
indexes even trough advanced modelling techniques. This assumption challenges the
ability of even the most advanced methods such as deep reinforcement learning (DRL),
to outperform the market benchmark if the market is truly efficient.
Alternative theories have been proposed to challenge the idea of perfectly efficient mar-
kets and to provide additional understanding about the price information. Behavioral
Finance argues that individual cognitive biases, such as overconfidence or loss aversion,
play significant role in investment decisions and can systematically differ from rational
expectations (Barberis and Thaler, 2003, 1063-1070). This notion is extended by heur-
istics that are fuelled by emotions, such as fear, that propagate these biases throughout
financial markets (Shiller, 2017, 974). These perspectives underscore that market beha-
vior are not only shaped by informational efficiency but also by investors psychological
and behavioral factors, which can create exploitable, albeit transient, anomalies. The
Adaptive Market Hypothesis (AMH), proposed by Lo (2004), takes into account prin-
ciples from EMH but also from Behavioral Finance. AMH acknowledges that market
efficiency is dynamic and instead of being static market efficiency becomes more efficient
as investors adapt to changing environments. Therefore AMH framework suggests that
markets are capable for temporary inefficiencies, as well as learning mechanisms occurring
among market participants.
Modern artificial intelligence (AI) and machine learning (ML) technologies are providing
new non-linear multiphase methods that can be used in portolio management to support
investment decisions in complex market conditions. These methods now challenge tradi-
tional portfolio management models that are widely used in both academia and institu-
tional portfolio management. The Capital Asset Pricing Model (CAPM), first introduced
by Sharpe (1964), has generally been a foundational approach in the field. CAPM is built
on the presumption that all equity risk premiums are originated from a single market risk
factor, also known as systematic risk. The model suggests that asset returns are linearly
8related to market movements, implying that equity prices are purely determined by this
single market factor. Even though CAPM and similar models have laid the foundation for
portfolio theory, they are often viewed problematic because they often make assumptions
which oversimplify the financial markets that in reality hold complex dynamics.
Recent research by Gu et al. (2020) suggests that multi-factor models and non-linear
forecasting methods can significantly enhance prediction accuracy, as they can capture a
broader range of variables influencing market behaviour. This shift has provoked a new
wave of research that focuses on leveraging advanced machine learning techniques, such
as DRL, to identify non-linear patterns within complex financial data structures and to
improve investment outcomes. An advanced branch of machine learning, DRL, combines
reinforcement learning (RL) principles with deep learning architectures. This combination
allows DRL agents to make autonomous, data driven decisions in uncertain and constantly
shifting market conditions. DRL’s adaptability is a particularly valuable attribute, as it
can optimize portfolio allocations by processing and reacting to vast quantities of real-time
data more effectively than static models. Recent study by Huang et al. (2022) has shown
DRL’s capacity to consistently optimize returns while maintaining lower transaction costs
by employing short-selling and arbitrage mechanisms. These findings underscore the po-
tential for DRL-based models to outperform traditional strategies, especially in highly
volatile environments. Additionally, another key advantage of DRL based models is its
potential for explainability, which is crucial in financial decision-making. While machine
learning models are usually considered as hardly explainable black box types, techniques
like integrated gradients, for example a study by Guan and Liu (2021), have been de-
veloped to elaborate which data features influence the agent’s decision-making process.
This transparency can help enhance investors trust in AI-driven investment strategies and
make them more understandable.
Existing research of deep reinforcement learning in finance context often involve method-
ological shortcomings. Key essential shortcome is using single algorithms against weak
baselines and lack of in-depth analysis of the results. In this thesis these gaps are ad-
dressed by providing added value on several fronts. First, a controlled framework for
multiple key DRL algorithms is deployed and evaluated against a robust 1/N portfo-
lio benchmark aligning this research with important strand of literature by DeMiguel
et al. (2009) that calls for caution against complex models. More importantly, this study
provides deeper analysis to uncover the sources of returns by linking empirical results to
financial theory. For example, agent’s success might stem from genuine market-timing
ability challenging Efficient Market Hypothesis or simply by taking on greater systematic
risk that could be explained by CAPM. Further, these findings are elevated with stat-
istical testing to assess whether any outperformance is statistically significant. Finally,
the durability of the models is verified through grid search and multiple seeds simulation
9to account for hyperparameter variations and stochastic factors to offer more robust and
repeatable findings.
In summary, DRL models represent a new promising orientation in portfolio management,
not only because they allow better advanced adaptability but also for their capabilities to
deal with complex and interconnected nature of modern financial market environments.
By employing DRL, this study aims to find out whether AI-driven strategies can de-
liver superior risk-adjusted returns than traditional allocation methods. This research
addresses a critical need in finance for sophisticated, responsive strategies that align with
the unpredictable nature of contemporary markets and investor expectations.
1.2 Objectives and structure
The aim of this research is to examine the applicability and potential of deep reinforce-
ment learning agents in portfolio management by evaluating risk-adjusted returns against
traditional strategies, especially the 1/N allocation strategy. The 1/N allocation strategy,
also known as naive diversification strategy, is a very common strategy among investors
due to its ease of use and equal asset weights (DeMiguel et al., 2009, 1916-1917). For
academic purposes naive allocation strategy provides a solid benchmark with no regard
for market variation. Since financial markets environment is very complex and volatile,
there is demand for more advanced and adaptable predictive models that can process real-
time data to dynamically adjust portfolio allocations. DRL offers a platform to optimize
long-term rewards through continuous learning, which may offer significant advantages
over such traditional methods.
Despite the principles of EMH, evidence of market inefficiencies, such as behavioral biases
and temporary mispricings can provide opportunities for DRL to potentially outperform
traditional models. This leads to a null hypothesis in this study that DRL strategies will
not systematically outperform a market benchmark, aligning with EMH’s implication of
market unpredictability. Hypothesis testing within a DRL context provides a critical eval-
uation for the EMH under practical conditions and highlighting any noticeable deviations
that DRL may capitalize on.
To investigate this hypothesis, the study is examined by a main research question that is
whether can deep reinforcement learning (DRL) achieve better portfolio performance than
the naive 1/N allocation strategy in terms of risk-adjusted returns. To provide a clear
answer, this question is addressed with two sub-questions. The first sub-question seeks to
answer which DRL algorithms are most effective in optimizing portfolio allocation. Then
the second sub-question is a direct comparison of using DRL-based portfolio management
with the 1/N strategy, evaluating their performance based on risk-adjusted returns.
10
The rest of the thesis is organized as follows. Chapter 2 delves into the foundational fin-
ance theories behind portfolio management. Chapter 3 introduces the machine learning
framework to provide solid understanding of research methods and to contextualize the
potential and limitations of AI-driven strategies in theoretical context. Together these
sections supply the conceptual formation for subsequent modelling and benchmarking.
Additionally also wrapping earlier research results in to context. Chapter 4 discusses
methodology used in this research detailing data pipeline, agent environment, state rep-
resentation, model training, and statistical validation methods. Chapter 5 discusses and
represents empirical results providing observations from the outcome. Conclusions are
presented in Chapter 6, closing the research by summarising the contributions, acknow-
ledging limitations, and suggesting avenues for further research. Additionally, appendixes
for pseudo algorithms are provided in the end along with references. By progressing from
theory to method to evidence, this structure ensures that each chapter addresses a distinct
level of inquiry while laying the groundwork for the next, thereby providing a coherent
narrative from initial motivation to actionable insights.
11
2 THEORETICAL FRAMEWORK
2.1 Stock market predictability
The Efficient Market Hypothesis (EMH) is introduced in this chapter as a foundational
theory in finance. First proposed by Fama (1970), EMH states that financial markets are
informationally efficient, meaning the prices of assets fully reflect all available information
in markets. In an informationally efficient market, all new information such as earnings
news or macroeconomic data reflects into stock prices so fast that no trading strategy can
systematically earn excess risk-adjusted returns beyond what could be expected by chance.
In other words, one cannot "beat the market" consistently except by luck or by accepting
higher risk. Closely associated with EMH is so called Random Walk Theory. Malkiel
(2003) argues that stock market prices are following so-called "random walk" which means
that stock prices exhibit random and unpredictable patterns. Under EMH, price changes
follow a near-random walk because any predictable patterns would be arbitraged away by
informed traders (Malkiel, 2003, 59-72). Jones and Netter (2008) add that because new
information is randomly favourable or unfavourable towards expectations, all changes in
stock prices in an efficient market should be random and result as random walk in stock
prices.
This information efficiency is divided into three stages by Fama (1970): weak form effi-
ciency, semi-strong efficiency, and strong efficiency. The weak form efficiency is suggesting
that current prices are reflecting all past trading information, implying that technical ana-
lysis can not provide an edge against the markets. Semi-strong efficiency suggests that all
publicly available information is already reflected in the prices. This includes any public
information about the company beyond historical price data, such as earnings reports,
market news, and analyst reports. This form suggests that fundamental based analysis
can not gain consistent advantage over the markets. Strong-form efficiency, suggests that
even the insider information is priced in the assets, meaning that no investor could be
able to achieve excess returns even through information based strategies.
For a predictive model to be able to function in the context of stock markets, there needs
to be an assumption made about market efficiency. If markets represent full efficiency,
these predictive models would not be beneficial in forecasting asset performance. Addi-
tionally, according to Random Walk Theory price changes are independent and identically
distributed and also hold no memory meaning that historical prices give no indication of
future (Fama, 1965, 34). This leads to a point where any agent that tries to find patterns
in historical price data would theoretically find none that persist, as markets have no
memory beyond random noise. Therefore, full market efficiency would mean that any
12
changes in asset prices would be caused by new information, which cannot be predicted
by financial models. Additionally, if the EMH holds, at least in its weak or semi-strong
forms, it casts doubt on the efficacy of any complex trading strategy.
The EMH provides a skeptical baseline: any observed outperformance by a DRL strategy
might indicate a challenge to market efficiency, or it might result from luck, selection
bias, or use of information not fully appreciated by the market. It is worth noting that in
practice, markets are not perfectly efficient. Numerous anomalies and behavioral factors
can create opportunities to find patterns from data that provide excessive returns. How-
ever, the persistence of anomalies remains a subject of ongoing debate, as recent empirical
studies produce mixed evidence, suggesting that market efficiency can evolve over time
in response to changing market conditions and the widespread dissemination of anomaly-
related research (Schwert, 2003, 968).
Within the EMH framework, a DRL policy has zero expected risk-adjusted excess returns
relative to an appropriate asset-pricing model, net of transaction and implementation
costs. Beating a cap weighted benchmark can arise from risk exposures, rebalancing,
or sampling variation and does not by itself violate efficiency. This reflects the joint-
hypothesis problem which means that tests of efficiency are inseparable from the asset-
pricing model used to measure excess returns.
Behavioral finance presents an alternative perspective to EMH, suggesting that markets
are not always efficient due to psychological biases and cognitive limitations that affect
investor behaviour. Barberis and Thaler (2003) provide a comprehensive overview of
these biases, highlighting how overreactions, underreactions, and crowd behavior can lead
to temporary market inefficiencies. This framework suggests that DRL models could, in
theory, exploit these inefficiencies to achieve excess returns, especially in situations where
investor behavior diverges from rational expectations.
In practice, DRL algorithms could potentially be tailored to detect and capitalize on
behavioral biases, particularly in volatile markets. By doing so, DRL may challenge
the EMH perspective, suggesting that, markets may have exploitable inefficiencies that
advanced ML algorithms can adapt to.
The Adaptive Market Hypothesis (AMH), proposed by Lo (2004), integrates EMH with in-
sights from behavioural finance, suggesting that market efficiency is not static but evolves
based on environmental conditions and the experiences of market participants. In AMH,
markets fluctuate between periods of efficiency and inefficiency as investors adapt to new
information and competitive pressures.
13
In earlier days Black (1986) argued that prices are typically “efficient within a factor of
two” because the noise supplied by uninformed traders makes markets liquid yet leaves
persistent mis-pricings. This fluid definition of efficiency anticipated the Adaptive Mar-
ket Hypothesis by recognising that rational and irrational forces coexist and vary over
time. Two centuries of data analysed by Bouchaud et al. (2017) confirm Black’s intuition:
markets exhibit medium-term trending that eventually mean-reverts on multi-year hori-
zons, a pattern the authors attribute to an adaptive tug-of-war between trend-following
“chartists” and valuation-driven “fundamentalists.” Together, these studies portray mar-
kets as noisy-efficient: liquid enough to provide abundant data, but systematically dis-
torted enough to offer transient opportunities. Additionally, a recent multi-scale research
by Safari and Schmidhuber (2025) indicates that efficiency is horizon dependent: prices
usually mean-revert at very short (under 15 minutes) and very long (over 2 year) ho-
rizons, whilst indicating trend persistence from 30 minutes up to two years. This was
measured by trend following premium, captured by linear trend coefficient β, that has de-
clined steadily since the early 1990s. This indicates that competitive capital progressively
arbitrages away patterns that were previously profitable.
This evolution allows for occasional inefficiencies that advanced models, like DRL, may
exploit. Empirical evidence by Safari and Schmidhuber (2025), shows that the linear
coefficient β flips sign outside the 30 minute to 2 year window, whilst the cubic term c
alternates accordingly. This is a pattern that they interpret as evidence of how markets
hover near a self organised critical point that peridodically shifts between efficient and
inefficient regimes. Such cyclical behavior provides concrete empirical support for Lo’s
view and implies that adaptive algorithms can bring value by continually inferring where
the market sits on the efficiency spectrum. However, in practice this means that trends in
financial markets tend to reverse before coming statistically significant, which implies that
when a trend is obvious on a price chart, it is typically already over which highlights the
short time period learning cycles for trading models. For DRL, AMH provides a theoretical
basis for developing adaptive strategies that respond dynamically to market shifts. This
aligns well with DRL’s strength in learning from and adapting to new patterns, suggesting
that DRL could be particularly effective in markets that exhibit cyclical inefficiencies.
2.2 Portfolio theory
Modern Portfolio Theory (MPT), introduced by Markowitz (1952), provides the found-
ational quantitative framework for portfolio selection. MPT formalizes the concept of
diversification: by holding a portfolio of assets, an investor can reduce unsystematic risk,
also known as asset-specific risk, and achieve a better trade-off between risk and return.
The key insight is that the risk of a portfolio, measured as variance or standard deviation
14
of returns, is not simply the weighted sum of individual volatilities, but also depends on
the correlations between asset returns. Markowitz showed that for a given level of expec-
ted return, an investor should choose the portfolio with minimum variance. Conversely,
for a given variance level, investor should choose the portfolio with maximum expected
return. The set of such optimal portfolios is known as the efficient frontier.
In the classical Markowitz mean-variance framework the decision variable is the weight
vector w = (w1, w2, . . . , wN), whose elements specify the fraction of total wealth invested
in each of the N available assets. Assuming the portfolio is fully invested with no leverage,
the weights satisfy the single budget constraint
∑︁N
i=1wi = 1.
This can be framed as a quadratic optimization:⎧⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎩
min
w
wTΣw
s.t. wTµ = R¯,∑︂
i
wi = 1,
where µ is the vector of expected returns for each asset, Σ is the covariance matrix of asset
returns, and R¯ is a target expected return. An equivalent single-objective formulation
uses a Lagrange multiplier λ > 0 to trade off expected return and risk:
max
w
wTµ − λ
2
wTΣw,
where λ controls the trade-off between return and risk. This framework yields a specific
optimal portfolio for given inputs µ and Σ.
A DRL-based approach does not explicitly have to compute µ or Σ, but if the policy is
conditioned to maximize the risk-adjusted return, then it should implicitly try to time
and allocate assets to maximize returns and manage risk. In effect, the DRL agent could
learn a dynamic portfolio strategy that might achieve a better risk-return trade-off than
any static weights. One may view a well-trained DRL agent as trying to approximate
the moving target of the optimal portfolio under changing market conditions. Tradi-
tional mean-variance optimisation can of course also be made dynamic by re-estimating
µ and Σ on a rolling window and rebalancing periodically, yet each update still solves a
static quadratic programme based only on first- and second-moment estimates from that
window. By contrast, a DRL policy can react at every time step, exploit higher-order or
path-dependent features, and adapt nonlinearly to structural breaks. This way any added
value is therefore most likely to emerge in nonstationary, regime-shifting markets where
window-based Markowitz reoptimizations may lag.
15
The 1/N strategy is a naive application of diversification where it ignores µ and Σ al-
together, simply allocating equally. Surprisingly, DeMiguel et al. (2009) noted that 1/N
often lies not too far from the efficient frontier in practice because estimation errors in µ
and Σ can make optimized portfolios perform worse out-of-sample. For example, if expec-
ted returns are overestimated for some stocks, a mean-variance optimizer will overweight
them, often leading to poor realized performance. The 1/N strategy avoids estimation
altogether, thereby avoiding overfitting to historical data. DeMiguel et al. (2009) showed
that only with an extremely long estimation window, with thousands of months of data for
a moderate asset universe, can optimized portfolios statistically outperform 1/N portfolio.
The Capital Asset Pricing Model (CAPM), developed by Sharpe (1964) and others, builds
on Modern Portfolio Theory (MPT) to provide an equilibrium model of asset prices.
CAPM asserts that in equilibrium, investors will hold some combination of the risk-free
asset and the market portfolio (the portfolio of all risky assets weighted by market value).
CAPM’s central equation is:
E[Ri] = Rf + βi(E[Rm]−Rf ),
where E[Ri] is the expected return of asset i, Rf is the risk-free rate, E[Rm] is the expected
return of the market portfolio, and
βi =
Cov(Ri, Rm)
Var(Rm)
.
Asset i’s beta, βi, is a measure of its systematic risk relative to the market.
This equation implies that an asset’s excess return E[Ri]−Rf is proportional to its beta,
meaning that asset-specific risk is not rewarded because it can be diversified away. The
market portfolio itself has β = 1 and provides the highest Sharpe ratio achievable by any
combination of risky assets in the CAPM world. If CAPM holds, any portfolio’s perform-
ance should be evaluated in terms of alpha, which means excess return beyond what its
beta would predict. A consistently positive alpha would indicate skill or exploitation of
market inefficiency.
For example, if a DRL strategy has a certain beta exposure to the market, part of its
returns might just reflect being leveraged to the market or certain risk factors. We might
find that a DRL agent is implicitly tilting the portfolio towards higher-beta stocks or
certain sectors to chase returns, something that 1/N does not do, since 1/N is neutral in
not favoring any stock beyond equal weighting. It will be examine whether DRL strategies
yield excess returns unexplained by market risk (alpha).
16
CAPM is a single-factor model (the market). More modern approaches like the Fama-
French multi-factor models incorporate more factors such as size, value, and momentum.
If the DRL agent exploits something like momentum, one could argue it is harvesting a
known factor rather than truly "inventing" a new strategy. In academic terms, beating the
1/N and even beating the market might simply mean the agent loaded on known factors
without taking into account the potential confounders that might have influenced the
outcome. For example, it might learn to overweight high-momentum stocks, effectively
implementing a momentum strategy. This is not necessarily trivial, it would show the
agent learned a sensible strategy but it would not violate market efficiency if those factors
are known risk premia. Long-run mutual-fund evidence likewise shows that most apparent
alpha disappears once returns are benchmarked against multi-factor models, implying that
any “outperformance” is explained almost entirely by factor exposures and expenses, not
genuine skill (Carhart, 1997, 57-58). On the other hand, if the agent’s performance cannot
be explained by exposures to common factors or higher beta, it could indicate finding some
niche inefficiency or superior timing ability.
In summary, classical finance provides normative models of how portfolios should be man-
aged under certain assumptions (rational investors, efficient markets, multivariate normal
returns, etc.). These models yield elegant solutions but often fall short in practice due to
model mis-specification, estimation error, or inability to adapt to real-world complexities
like transaction costs, changing distributions, and investor behavior biases. Data-driven
machine learning approaches, including DRL, take a different path: rather than assuming
a model for returns, they learn directly from data. This can potentially capture phenom-
ena that static models miss. For example, a DRL agent could learn a dynamic allocation
strategy that increases equity exposure in rising markets and shifts to defensive assets
during downturns, a form of market timing that static MPT do not allow since those
assume a fixed allocation to each asset or risk factor.
However, ML approaches are not guaranteed to find a truly optimal strategy. They
require large amounts of data, and they risk overfitting to historical patterns that may
not repeat. Moreover, they typically lack the clear theoretical guarantees of classical
methods. For example, one might achieve a good Sharpe ratio in backtesting with a
complex network, but it’s harder to prove why it should continue. This is where combining
insights is valuable: for instance, incorporating ideas from finance (like risk aversion or
transaction cost penalties) into the DRL reward function can guide the learning to more
sensible solutions. Indeed, recent studies have tried to merge domain knowledge with
DRL, such as using Modern Portfolio Theory within the reward design (Zhang et al.,
2019) or momentum strategies (Wang et al., 2019) as a guide.
17
2.3 Previous studies
One of the early demonstrations of deep reinforcement learning in portfolio management
was by Jiang et al. (2017), who proposed a framework called Ensemble of Identical In-
dependent Evaluators (EIIE) for cryptocurrency trading. They employed an ensemble of
deep neural networks as policy networks to allocate a portfolio over multiple cryptocur-
rencies. Their results indicated that RL-based portfolios outperformed several traditional
strategies in backtests, even when accounting for substantial transaction costs. This was
a striking result given the high volatility and noise in crypto markets, suggesting that the
DRL agent could extract useful trading signals. Around the same time, another early
practical implementation emerged by Almahdi and Yang (2017) employed a recurrent
reinforcement learning agent to trade an equity portfolio with the Calmar ratio as the
objective. They reported that on a risk-adjusted basis this approach outperformed mean-
variance optimization. These early results helped catalyze an active research area at the
intersection of finance and machine learning.
Following, Liu et al. (2018) extended the idea by applying DDPG to portfolio optimiz-
ation. Their study showed remarkable performance improvement by using a continuous
action RL method (DDPG) compared to prior policy gradient methods. This implied
that allowing the agent to finely adjust portfolio weights (rather than picking discrete ac-
tions) and to learn from off-policy data can yield better trading performance. Their work
provided evidence that DRL can handle the multi-asset allocation problem effectively and
motivated the inclusion of DDPG in this comparative study.
Another notable line of research involves incorporating domain knowledge into DRL. For
example, Wang et al. (2019) introduced AlphaStock, which combined a momentum-based
strategy with deep RL. The RL agent in AlphaStock had two components: one focusing
on buying recent winners and another on selling losers, echoing the “buy winners, sell
losers (BWSL)" momentum strategy. By optimizing for the Sharpe ratio and using an
attention-based neural network, AlphaStock’s agent was not a pure black-box but it was
guided to exploit a known anomaly (momentum) in an intelligent way. The authors also
performed a sensitivity analysis on stock features, finding that the learned strategy favored
stocks with low volatility and high long-term growth, aligning with intuitive investment
principles (avoid extremely volatile stocks, prefer fundamentally strong ones). This kind
of result is encouraging, as it shows RL can rediscover sensible patterns and also highlights
the importance of interpretability (they could identify what features were important to
the agent’s decisions).
Ye et al. (2020) took a different approach by augmenting the state space of the RL
agent with predictions of asset movements. They used an LSTM-based predictive model
18
to forecast short-term returns for each asset and fed these predictions as part of the
RL agent’s state (along with other market features). Their RL algorithm (based on
a policy gradient method) could then use this enriched state information to allocate
the portfolio. Essentially, this merges supervised learning (predictive modeling) with
reinforcement learning (portfolio decision-making). The results in their research showed
improved performance, suggesting that hybrid models that use external signals or models
can enhance a pure RL approach.
Yang et al. (2020), proposed an ensemble strategy that dynamically switches between
PPO, A2C, and DDPG based on market conditions. This is very relevant to this work
because they literally considered the same algorithms this study is researching. In their
approach, they trained separate agents with PPO, A2C, DDPG and then developed a
meta-agent that looks at market volatility/regime indicators to decide which agent’s ac-
tions to follow at a given time. They reported that this ensemble preserved robustness
across different market scenarios. For instance, in stable market periods a value-based or
deterministic strategy might do well, whereas in highly volatile times a more exploratory
strategy might cope better. Their approach successfully underscores that no single al-
gorithm may be universally best but each has strengths under certain conditions. Their
ensemble achieved better overall performance than any single algorithm by essentially
performing an algorithm selection based on regimes. This provides an interesting per-
spective: rather than seeking one champion algorithm, combining them could yield more
consistent results.
Beyond equity portfolios, similar DRL methods have been applied to other financial prob-
lems. For example studies like Nevmyvaka et al. (2006) and Beysolow II (2019) applied
RL for market making and execution, where RL algorithms (including DDPG variants)
were used to optimize order placement in limit order books. For asset allocation with
different assets, some studies looked at portfolios including bonds, commodities, etc. For
example, a method called DeepPocket by Soleymani and Paquet (2021), represented the
portfolio as a graph of assets to capture relationships, and used a graph convolutional
network with RL to manage a multi-asset portfolio. Such approaches show the flexibility
of deep RL in handling complex relationships.
Beyond standard single-agent RL, researchers have started to explore multi-agent and
meta-learning approaches. Lee et al. (2020) introduced a multi-agent reinforcement learn-
ing system called MAPS (Multi-Agent Portfolio Management System), in which multiple
agents each learn distinct portfolio strategies and collectively form an ensemble portfolio.
The agents are trained with a diversity seeking objective so that each specializes in dif-
ferent market conditions, thereby achieving a more robust overall performance through
diversification. Meanwhile, meta-reinforcement learning (meta-RL) is being examined as
19
a way to handle non-stationarity. The concept is to train an agent that can quickly adapt
to new market regimes by learning how to learn. For instance, recent work by Tian et al.
(2024) applies model agnostic meta-learning (MAML) to trading, pairing a PPO meta-
learner with a fast adaptive learner, and shows improved performance in fast-changing
markets. Though still nascent, meta-RL could allow an agent to generalize knowledge
from one market or period (say, a bull market) to another (a sudden crash) with minimal
additional training which is a valuable trait given the covariate shifts in finance.
Survey by Bai et al. (2024) reviewed RL in finance, stating that while many papers claim
positive results, the field faces issues like lack of standardized benchmarks, difficulty in
reproducing results, and insufficient testing of robustness. That said, there has been
positive moves, for example Liang et al. (2018) published a GitHub repository for their
work, and others like Liu et al. (2018) also open sourced their trading agent code. Even
with code, hyperparameter tuning is still a weakness. RL algorithms have many settings
(learning rate, exploration noise, etc.) and finance provides no clear solution to tune
against except final profit, which is a high variance metric. Liang et al. (2018) specifically
experimented with different learning rates and network structures and found that some
algorithms (like DDPG) were very sensitive to tuning, sometimes getting stuck in local
optimum. This suggests RL needs carefully calibrated hyperparameters to perform well
on financial problems, which is a drawback compared to more straightforward methods.
Table 1 below contains a summary of recent studies in the field.
Study (Year) RL Algorithm(s) Market / Context
Moody et al. (1998) PG (RRL) Single asset
Almahdi and Yang (2017) PG (RRL) Multi-asset equities
Jiang et al. (2017) AC (DDPG) Cryptocurrency
Jiang and Liang (2017) PG (direct) Cryptocurrency
Pendharkar and Cusatis (2018) VB (Q-learning) Equity indices
Liang et al. (2018) AC (DDPG, PPO) China A-shares
Xiong et al. (2018) AC (DDPG var.) U.S. equities
Buehler et al. (2019) AC (Deep Hedging) Options hedging
Jeong and Kim (2019) VB (DQN, transfer) U.S. single stocks, daily
Ye et al. (2020) AC (PPO, DDPG/PG) “HighTech” portfolio
Lee et al. (2020) Multi-agent (PPO/agent) U.S. equities
Huang et al. (2020) AC (DDPG/PPO) China A-shares
Park et al. (2020) VB (DQN) Korean equities
Wang et al. (2021) PG+PPO (two-unit) Global stock indices
Wu et al. (2021) PG (CNN/RNN policy) U.S. stocks & sector ETFs
Betancourt and Chen (2021) AC (PPO, RNN policy) Cryptocurrency
Wu et al. (2021) AC (CNN/RNN policies) Taiwan 50 and S&P 500
Table 1: Summary of representative studies on reinforcement learning in portfolio man-
agement. Abbrev: PG = Policy Gradient, AC = Actor–Critic, VB = Value-based.
20
3 MACHINE LEARNING FRAMEWORK
3.1 Neural Networks and the Backpropagation Algorithm
Deep learning is a subfield of machine learning that uses artificial neural networks with
multiple processing layers to learn data representations at increasing levels of abstraction.
In a deep neural network, each layer transforms its input (the outputs of the previous layer)
into a more abstract representation, enabling the model to capture complex patterns in
data. The concept of training such networks dates back several decades, but it was the
reintroduction of effective backpropagation combined with large datasets and computing
power that led to recent breakthroughs in speech recognition, computer vision, natural
language processing, and many other domains. The process described in this section forms
the foundation of modern deep learning methods and closely follows the work of Lecun
et al. (2015).
The backpropagation algorithm is used to efficiently train these deep models by computing
how the network’s parameters (weights and biases) should be adjusted to minimize errors.
Formally, backpropagation applies the chain rule of calculus to propagate the gradient of
a loss function (which measures prediction error) from the output layer back through the
hidden layers, accumulating partial derivatives. This allows gradient-based optimizers
such as gradient descent to incrementally update the network’s weights and improve
performance. In practice, variants like stochastic gradient descent (SGD) are used, where
network parameters θ are updated iteratively:
θ ← θ − η∇θE(θ),
where η is the learning rate and E the loss function. Through iterative adjustments, the
network learns to approximate the complex function that maps inputs to outputs, often
achieving high accuracy on training data. Importantly, deep neural networks (those with
many hidden layers) can model extremely nonlinear relationships, given sufficient data
and computational resources.
Figure 1 illustrates a fully connected feed-forward network with two hidden layers H1 and
H2 and one output layer. Let the input be x = (x1, . . . , xn). For any unit u, denoted
by zu its pre-activation (the affine input before the nonlinearity) and by yu its activation
(output). Using indexing as wab is the weight from source unit a to destination unit b
(e.g. wij connects input i to hidden unit j). Set yi = xi for input units i ∈ Input.
21
The forward pass is then
zj =
∑︂
i∈Input
wij yi + bj, yj = f(zj) (j ∈ H1),
zk =
∑︂
j∈H1
wjk yj + bk, yk = f(zk) (k ∈ H2),
zl =
∑︂
k∈H2
wkl yk + bl, yl = f(zl) (l ∈ Output).
Here f(·) is the activation function. Inputs propagate as activations y to the next layer,
each unit forms a pre-activation z via a weighted sum plus bias, and the nonlinearity
produces the next activations.
Figure 1: A multilayer neural network with two hidden layers and one output layer.
(Lecun et al., 2015)
The activation function introduces non-linearity, which is crucial. Without it, a stack
of linear neurons would collapse into an equivalent single linear model. In modern deep
networks, the most commonly used activation functions are the logistic sigmoid, the hy-
perbolic tangent (tanh), the Rectified Linear Unit (ReLU), and (for multi-class outputs)
the softmax.
In binary classification, the logistic sigmoid is defined by
σ(x) =
1
1 + e−x
.
It maps real inputs to the open interval (0, 1), enabling a probabilistic interpretation.
22
A downside is saturation near 0 and 1, where the derivative becomes small and learning
can suffer from vanishing gradients.
The hyperbolic tangent,
tanh(x) =
ex − e−x
ex + e−x
,
takes values in (−1, 1) and is zero-centered, which typically yields more balanced updates
around the origin and in practice often stronger gradients than the sigmoid.
The Rectified Linear Unit (ReLU) is defined piecewise as
ReLU(x) =
⎧⎨⎩0, x < 0,x, x ≥ 0,
where negative inputs are set to zero while positive inputs pass through linearly. ReLU
is computationally efficient and helps mitigate vanishing gradients (Glorot et al., 2011,
p. 318).
For multi-class outputs, the softmax function converts a vector x ∈ RK into a categorical
distribution:
softmax(x)i =
exi∑︁K
j=1 e
xj
, i = 1, . . . , K.
The resulting components are non-negative and sum to one, providing a probabilistic
interpretation over K classes. Figure 2 visualizes these activation functions.
Figure 2: Different activation functions. (Musiol, 2016).
23
Once the hidden layer activations are computed, they are propagated forward to the next
layer. The same process of weighted sums and nonlinear activation continues until the
final layer (the output layer), which provides the network’s prediction. In classification
tasks, for instance, the output layer often uses a softmax or sigmoid function, while in
regression tasks, the output can be linear or another suitable function.
Training the network consists of finding the set of weights {wij, wjk, . . .} that minimizes
a cost function E, which measures the discrepancy between the network’s prediction and
the target. The key to performing this minimization is the backpropagation algorithm,
which relies on the chain rule of derivatives (Figure 3).
Figure 3: An illustrative example of the chain rule. (Lecun et al., 2015).
If x affects y according to ∂y
∂x
, and y affects z according to ∂z
∂y
, then a small change ∆x in
x induces a change ∆z = ∂z
∂y
∂y
∂x
∆x. This principle naturally extends to multiple variables
and is crucial in deriving the backpropagation equations.
After the forward pass, the error at the output layer is computed by comparing the
network’s output yl with the target tl. A common choice is the mean squared error
E = 1
2
∑︁
l (yl − tl)2 , whose derivative with respect to yl is simply ∂E∂yl = yl − tl. Using
the chain rule again, we obtain the error derivative with respect to the net input zl:
∂E
∂zl
= ∂E
∂yl
· ∂yl
∂zl
.
For a hidden unit k with pre-activation zk =
∑︁
j wjkyj + bk and activation yk = f(zk),
the error with respect to its output is obtained by summing the downstream error signals
weighted by the outgoing weights to all units l in the next layer:
∂E
∂yk
=
∑︂
l
wkl
∂E
∂zl
,
24
since zl =
∑︁
wulyu+bl implies ∂zl/∂yk = wkl. This is then converted to the pre-activation
derivative via the activation slope,
∂E
∂zk
= f ′(zk)
∂E
∂yk
.
Finally, the gradient of a weight from neuron j (previous layer) to neuron k (current layer)
is
∂E
∂wjk
= yj
∂E
∂zk
.
Figure 4 shows a diagram of how these derivatives move backward in the network.
Figure 4: Schematic diagram showing the backward pass. (Lecun et al., 2015).
3.2 Reinforcement Learning
Sutton et al. (1998) introduced reinforcement learning (RL) as a paradigm of machine
learning in which an agent learns to make sequential decisions by interacting with an
environment, with the goal of maximizing cumulative reward. In contrast to supervised
learning, where the correct actions are provided for each example, an RL agent must
discover good strategies through trial and error. Sutton et al. (1998) famously defined
reinforcement learning as “learning what to do - how to map situations to actions - so as
to maximize a numerical reward signal.” In this framework, the agent is not explicitly told
which action to take at any state but instead it receives feedback in the form of rewards
and must learn from experience. Over time, through repeated interaction, the agent
improves its policy (decision making strategy) to achieve higher accumulated rewards.
25
Figure 5: The agent-environment interaction in reinforcement learning. (Sutton et al.,
2018, 48).
According to Sutton et al. (1998), the standard formalization can be presented as a Markov
Decision Process (MDP). This can be defined by a 4-tuple (S,A, P,R), where S is the set
of states, A the set of actions, P (s′ | s, a) the transition kernel, and R(s, a) the immediate
reward. Additionally, to ensure that the total reward sum converges, a discount factor
0 ≤ γ ≤ 1 that weights future rewards, is used here. At each time step t, the agent
observes state st ∈ S, chooses an action at ∈ A according to some policy π(at | st),
and the environment transitions to a new state st+1 and provides a reward according to
function R(st, at).
The value of a state under policy π is
Vπ(s) = Eπ
[︄ ∞∑︂
t=0
γtR(st, at)
⃓⃓⃓⃓
⃓S0 = s
]︄
,
and the state-action value is
Qπ(s, a) = Eπ
[︄ ∞∑︂
t=0
γtR(st, at)
⃓⃓⃓⃓
⃓S0 = s, A0 = a
]︄
.
These satisfy the Bellman expectation equations:
Vπ(s) =
∑︂
a
π(a | s)
[︄
R(s, a) + γ
∑︂
s′
P (s′ | s, a)Vπ(s′)
]︄
,
Qπ(s, a) = R(s, a) + γ
∑︂
s′
P (s′ | s, a)
∑︂
a′
π(a′ | s′)Qπ(s′, a′).
The optimality equations are
V ∗(s) = max
a
[︄
R(s, a) + γ
∑︂
s′
P (s′ | s, a)V ∗(s′)
]︄
,
Q∗(s, a) = R(s, a) + γ
∑︂
s′
P (s′ | s, a)max
a′
Q∗(s′, a′),
26
and the optimal policy chooses
π∗(s) = argmax
a
Q∗(s, a).
In most real-world problems, including portfolio management, the state and action spaces
are large (possibly continuous), and the environment dynamics are unknown. The Bellman
equations cannot be solved exactly, so instead approximate values or policies are used
and iteratively improved which is the essence of reinforcement learning. In the portfolio
application, the control problem is cast in the MDP framework. Let Gt+1 = 1 + Rt+1
denote next period gross and the portfolio weights at time t to be denoted by wt =
(wt,1, . . . , wt,n), wt,i ≥ 0 and
∑︁n
i=1wt,i = 1. Wealth evolves according to
Wt+1 = Wt (w
⊤
t Gt+1), W0 > 0.
Then the portfolio-specific state to be st = (Xt, wt−1), where Xt collects predictive fea-
tures/market signals, and the action to be at = wt. A natural per-period reward for
logarithmic growth with trading frictions C(wt, wt−1) ≥ 0 is
rt = log(w
⊤
t Gt+1) − C(wt, wt−1).
Over a finite horizon T (or in episodic tasks with γ = 1),
logWT = logW0+
T−1∑︂
t=0
log(w⊤t Gt+1) ⇒ E
[︄
T−1∑︂
t=0
rt
]︄
= E[logWT ] − E
[︄
T−1∑︂
t=0
C(wt, wt−1)
]︄
,
so maximizing the expected cumulative reward coincides with maximizing expected ter-
minal log-wealth net of costs, aligning the RL objective with the classical log optimal
criterion.
When trading costs are set to zero (C = 0), there is no market impact (actions do not
affect future return distributions or features), and Xt+1 ∼ P (· | Xt) is action-independent,
Bellman’s recursion decomposes and the optimal decision becomes myopic:
w⋆t ∈ arg max
w∈∆n
E
[︁
log(w⊤Gt+1) |Xt
]︁
.
In other words, the long-horizon “coupling” between actions is absent by construction
because current actions do not influence future states or costs and the problem reduces
to a contextual bandit. A special case is i.i.d. returns with no informative features, which
yields a time-invariant Kelly portfolio
w⋆ ∈ arg max
w∈∆n
E
[︁
log(w⊤G)
]︁
.
27
Intertemporal coupling reappears once one introduces trading costs C(wt, wt−1) > 0,
action-dependent dynamics/market impact, or path-dependent constraints (e.g., leverage,
turnover, drawdown, taxes). In those cases the problem is genuinely dynamic and long-
horizon credit assignment becomes essential.
The objective is to find a policy that maximizes expected discounted return,
J(π) = Eπ
[︄
T−1∑︂
t=0
γt rt
]︄
,
with T finite or infinite. The policy-gradient theorem provides a direct optimization route
with:
∇θJ(πθ) = E
[︂
∇θ log πθ(at | st) Aπθ(st, at)
]︂
,
where Aπ(s, a) = Qπ(s, a)− Vπ(s) is the advantage. A common estimator is Generalized
Advantage Estimation (GAE) (Schulman et al., 2015, 5), which trades bias and variance
via λ ∈ [0, 1]:
δt = rt + γ Vϕ(st+1)− Vϕ(st), ˆ︁A(λ)t = ∞∑︂
l=0
(γλ)l δt+l.
In Deep RL, neural networks are used as function approximators for the policy and/or
value functions. The neural network parameters serve as θ (for policy) and ϕ (for the
value function), which are optimized via stochastic gradient descent on appropriate loss
functions. For instance, a policy network might output a probability distribution over
actions or parameters of a distribution (e.g. mean and variance for continuous actions),
and then adjust its weights to increase the probability of actions that lead to higher
returns. The policy of a Markov Decision Process can be optimised with a wide range of
reinforcement learning algorithms. Numerous methods have been proposed, each excelling
under different trade-offs in data quality, sample efficiency, and computational cost. At
the highest level under deep reinforcement learning these algorithms are usually divided
into model-based and model-free approaches
Model-based RL learns (or is given) an explicit model of the environment’s transition
dynamics and reward function. The policy is then improved by planning, which refers
to a computational process that searches an optimal path from the state space (Sutton
et al., 2018, 160-163). In the model-free setting an agent foregoes any explicit model of the
environment’s dynamics and instead shapes its behaviour solely through sampled trans-
itions (St, At, Rt+1, St+1). The most widely used learning signal is Temporal-Difference
(TD) error, where value estimates are forwarded toward a bootstrap target that already
contains the current prediction, thus blending the low variance of one-step look-ahead
28
with the long-range credit assignment of simulation returns.
Where TD methods treat the policy as an implicit consequence of a value function,
policy-gradient theory poses control directly as stochastic optimisation in parameter space.
These policy-based methods directly learn the policy π(a | s; θ) parameterized by θ (the
weights of a neural network). These methods adjust θ to maximize J(π) using gradient
ascent. A famous example is the REINFORCE algorithm by Williams (1992). Although
unbiased, this estimator suffers from high variance. Subtracting a learned baseline greatly
reduces that variance without altering the expectation, motivating the actor-critic archi-
tecture in which a critic trained by TD stabilises the policy updates of an actor (Sutton
et al., 2018, pp. 325–332). These actor-critic methods combine both, an actor (policy)
that decides actions and a critic (value function) that critiques them. Deep reinforcement
learning became widely practical when neural networks were paired with policy-gradient
algorithms that stabilise assignments through the use of an auxiliary critic. The critic
helps reduce variance in policy gradient updates by providing an estimate of how good an
action was compared to an average baseline. Another key approach is value-based meth-
ods that learn an estimate of Q∗(s, a) (or V ∗(s)) and derive the policy from it. Classic
examples: Q-learning (Watkins and Dayan, 1992) and its deep neural network variant
Deep Q-Network (DQN) by (Mnih et al., 2016).
PPO, DDPG, and A2C were chosen as they are among the most prominent DRL methods,
each representing a different approach to policy learning. Each algorithm has been used
in prior finance research. For instance, A2C/A3C has been applied in trading scenarios
for its simplicity and parallel training capabilities, DDPG has been employed to optimize
trading strategies with continuous position sizes, and PPO has gained popularity in many
domains due to its reliability, including ensemble approaches for stock trading.
3.3 DRL Algorithms
3.3.1 Advantage Actor-Critic
Advantage Actor–Critic (A2C) is a policy-gradient method that trains two networks to-
gether and combines their results to learn effectively. These networks include an actor
that tries to estimate what should be done next, and a critic that evaluates states as how
good is the current situation. A2C may be viewed as the synchronous counterpart of
the earlier asynchronous A3C algorithm by Mnih et al. (2016), operating within a single
process rather than relying on parallel workers. This synchronization simplifies imple-
mentation and often makes training more reproducible without changing the underlying
learning signals.
29
Formally, the actor defines a stochastic policy πθ(a | s) that maps a state s to a distribution
over actions, whereas the critic approximates the state-value function Vϕ(s), which is the
expected discounted return available from s when following the current policy. Using
the critic’s value estimate as a baseline, A2C centers the policy-gradient update on the
advantage, which measures how much better or worse a chosen action performed compared
with the policy’s average expectation in that state. The advantage at time t is
Aˆt = Rˆt − Vϕ(st),
where Rˆt in A2C is an n-step bootstrap return,
Rˆt =
n−1∑︂
i=0
γirt+i + γ
nVϕ(st+n),
with discount factor γ ∈ (0, 1]. Intuitively, Aˆt > 0 signals that the taken action was better
than expected, and Aˆt < 0 that it was worse. Using advantages reduces the variance of
gradient estimates while keeping them unbiased, which typically yields faster and more
stable learning than a plain policy gradient.
The actor is updated to increase the likelihood of advantageous actions and decrease the
likelihood of disadvantageous ones. The canonical policy-gradient direction is
∇θJ(θ) = Et
[︂
Aˆt∇θ log πθ(at | st)
]︂
.
In parallel, the critic is trained as a regressor to fit the bootstrap returns, by minimizing
a squared-error loss,
Lvalue(ϕ) =
(︂
Rˆt − Vϕ(st)
)︂2
.
To avoid premature collapse of the policy’s exploration, A2C commonly adds an entropy
bonus that rewards higher-entropy action distributions. This is implemented as a regu-
larizer in the loss:
Lentropy = −β Et[H(πθ(· | st))] ,
where H is the Shannon entropy and β > 0 controls the strength of exploration.
These components are combined into a single objective optimized by stochastic gradient
methods. Writing the training criterion as a loss to be minimized,
L(θ, ϕ) = −Et
[︂
Aˆt log πθ(at | st)
]︂
+ c1Et
[︃(︂
Rˆt − Vϕ(st)
)︂2]︃
− c2Et[H(πθ(· | st))] ,
with weights c1, c2 ≥ 0 tuning the relative importance of value-function accuracy and
exploration. A typical training iteration proceeds by rolling out the current policy to
30
collect short trajectories (st, at, rt, st+1), computing n-step returns and advantages, and
then applying gradient steps that simultaneously push the actor toward actions with
positive advantages and refine the critic to make value estimates better predictors of
future returns. This synchronized loop is repeated until the policy stabilizes. Pseudocode
provided in Appendix 1.
A2C benefits lie in its balance of bias and variance. The critic’s baseline reduces the
variance of policy-gradient estimates without introducing bias, the multi-step bootstrap
returns trade off short-horizon noise and long-horizon bias, and the entropy bonus sus-
tains exploration early in training. In practice, these ingredients yield a simple, efficient
algorithm that often learns faster and more stably than vanilla policy gradients while
avoiding the engineering overhead of asynchronous methods.
3.3.2 Proximal Policy Optimization
Proximal Policy Optimization (PPO), proposed by Schulman et al. (2017), is an on-policy
policy gradient method that achieves stable and reliable training by not allowing large
updates to the policy at once. Like Advantage Actor–Critic (A2C), it trains an actor to
select actions and a critic to estimate values, but its hallmark is how it constrains each
policy update so the new policy does not drift too far from the old one in a single step.
This addresses a well-known failure mode of vanilla policy gradients, where overly large
gradient steps can push the policy into regions with poor performance and high variance.
PPO can be seen as a simplified version of the earlier Trust Region Policy Optimization
(TRPO) method, which enforced a hard constraint on the change in policy per update.
PPO instead uses a soft constraint by clipping the policy change.
PPO maximizes expected return but augments it with a safety mechanism that prevents
overly large policy changes. This is done by comparing the new policy πθ to the old policy
πθold via the probability ratio
rt(θ) =
πθ(at | st)
πθold(at | st)
.
If rt(θ) > 1, the new policy assigns higher probability to the sampled action than before
and if rt(θ) < 1, it assigns less. As in A2C, updates are guided by an advantage estimate
Aˆt, which scores how much better (positive) or worse (negative) the sampled action was
relative to the state’s baseline value.
The PPO surrogate objective maximized in practice is the clipped objective
LCLIP(θ) = Et
[︂
min
(︂
rt(θ) Aˆt, clip(rt(θ), 1− ϵ, 1 + ϵ) Aˆt
)︂]︂
.
The clip(·) operator cuts the ratio to the interval [1− ϵ, 1+ ϵ] for a small ϵ (e.g., 0.1–0.2).
31
Intuitively, when Aˆt > 0 the objective refuses to reward increases in rt beyond 1 + ϵ and
when Aˆt < 0 it refuses to reward decreases beyond 1− ϵ. In both cases the “min” selects
the pessimistic (clipped) improvement if the un-clipped term would push the policy too
far. This acts as a soft trust region: it allows helpful, local updates while discouraging
destructive jumps.
As with actor-critic methods more broadly, PPO augments the policy objective with
a value-function loss and an entropy bonus. Writing the overall training target as a
maximization problem,
LPPO(θ, ϕ) = Et
[︂
LCLIP(θ)− c1 (Rˆt − Vϕ(st))2 + c2H(πθ(· | st))
]︂
,
where Vϕ is the critic, Rˆt is a bootstrap estimate of return, H is the Shannon entropy
of the policy, and c1, c2 ≥ 0 weight value accuracy and exploration. In code this is
implemented as a loss to minimize by negating the objectives first and last terms and
leaving the squared-error term positive.
Training proceeds in short on-policy batches. The algorithm freezes θold, rolls out the
current policy for several episodes (an episode is a complete run in the environment from
an initial state to a terminal state) and computes advantages frequently using General-
ized Advantage Estimation (GAE). With the batch fixed, PPO then performs multiple
epochs of stochastic gradient ascent on LPPO using minibatches, standardizing advantage
to stabilize the scale of updates. Pseudocode provided in Appendix 3.
Compared with A2C, PPO differs in two practical ways. First, it replaces the plain
advantage-weighted policy gradient with the clipped surrogate, which explicitly prevents
large per-sample likelihood changes. Second, it reuses each on-policy batch for several
optimization epochs, substantially improving sample efficiency relative to a single update
per batch. These two design choices translate to greater stability and competitive per-
formance while keeping the implementation as simple as a standard first-order optimizer.
PPO is known for its stability and ease of use. It generally requires fewer parameter
tweaks than DDPG and avoids the complexity of a replay buffer. Being on-policy, it does
require more environment interactions to learn effectively, but with modern compute and
using parallel simulation, this is manageable. In the finance context, on-policy means
that it is always using the latest policy to generate new trajectories. One potential issue
is that financial time series are not easily resettable to random states (like game states),
instead it is often trained on one long chronological sequence. To apply PPO, one should
split the historical data into multiple segments (or use multiple parallel markets or time
periods) to simulate multiple episodes. Alternatively, treat each day as a continuing
32
episode with for example random start years. PPO has been successfully used in various
financial studies (sometimes combined with other models). For example, Yang et al.
(2020) include PPO in an ensemble strategy for stock trading and found it helpful for
adapting to different market conditions. PPO’s clipping mechanism also makes it easier
to incorporate additional objectives (like risk penalties) without destabilizing the training
too much. In this work I will implement a version of PPO with a multi-output neural
network (one output for the action distribution over assets, another for the value estimate)
often called an actor-critic PPO.
3.3.3 DDPG
Deep Deterministic Policy Gradient (DDPG) by Lillicrap et al. (2015) is an off-policy
actor-critic algorithm tailored to continuous action spaces where it learns a Q-function
by using the Bellman equation and a policy by using the Q-function. It expands the idea
of deterministic policy gradient (DPG) framework from (Silver et al., 2014) by adding
value-based stabilization techniques from deep Q-learning (Mnih et al., 2015). Core idea
of DDPG is an actor network µθ(s) that deterministically maps state to a specific action.
In continuous action spaces, this is more efficient than learning a probability distribution
over actions. A critic network Qϕ(s, a) estimates the value of state-action pairs.
DDPG also keeps target networks µθ′ and Qϕ′ , which are copies of the actor and critic
that lag behind, updated slowly via:
θ′ ← τθ + (1− τ)θ′, ϕ′ ← τϕ+ (1− τ)ϕ′, with τ ≪ 1
These targets are used for computing stable target Q-values. This lagged tracking prevents
destructive feedback loops in which rapidly changing targets destabilize the critic, which
in turn misguides the actor.
Because DDPG is off-policy, it decouples data collection from learning. Transitions
(st, at, rt, st+1) are stored in an experience replay buffer, from which the algorithm draws
random minibatches for updates. Replay reduces the temporal correlations present in se-
quential data and allows each sample to be reused across many gradient steps, improving
sample efficiency and smoothing the learning signal.
The critic is trained by minimizing a temporal-difference (TD) regression loss toward a
bootstrap target computed with the target networks:
yt = rt + γ Qϕ′(st+1, µθ′(st+1)) ,
33
L(ϕ) =
1
N
∑︂
t
(Qϕ(st, at)− yt)2 .
Here γ ∈ (0, 1) is the discount factor. The target yt uses a slowly moving critic and actor,
which makes the supervised signal for the online critic less volatile and reduces the risk
of divergence.
The actor is optimized to choose actions that the current critic deems valuable. Using
the deterministic policy-gradient theorem, the update direction follows the critic’s action-
gradient:
∇θJ(θ) ≈ 1
N
∑︂
t
[︂
∇aQϕ(st, a)|a=µθ(st)∇θµθ(st)
]︂
.
Intuitively, the critic supplies a local slope in action space to nudge the action to improve
long-run return while the actor backpropagates this signal through its own parameters to
make such actions more likely in similar states.
Because the policy is deterministic, exploration must be injected explicitly at action time.
DDPG disturbs the actor’s output with noise:
at = µθ(st) +Nt.
(Lillicrap et al., 2015) chose Ornstein-Uhlenbeck process noise by Uhlenbeck and Ornstein
(1930) for temporally correlated exploration in physical control tasks but uncorrelated
Gaussian noise is a common alternative.
A typical training iteration proceeds as follows. Agent interacts with the environment
using the noisy policy, appending each transition to the replay buffer. Periodically, it
samples a minibatch from the buffer, updates the critic by minimizing the TD loss toward
the target networks’ bootstrap, updates the actor using the critic’s action-gradients, and
then softly updates the targets toward the new online parameters. This loop improves
the quality of value estimates and the policy they guide. Pseudocode in Appendix 2.
DDPG’s off-policy nature allows efficient reuse of past experience, especially beneficial in
finance where historical data is limited. It handles continuous actions such as portfolio
weights. However, off-policy methods like DDPG can be fragile regarding exploration
or function approximator instability. It might get stuck in suboptimal policies due to
insufficient exploration or overestimate Q-values. Improvements like TD3 by Fujimoto
et al. (2018) have addressed some issues via twin critics. DDPG’s continuous action for-
mulation is particularly attractive in finance, for instance Liu et al. (2018) successfully
applied DDPG to stock trading, showing improved performance over simpler policy gradi-
ent methods. Portfolio weightsare constrained with normalization to ensure valid portfolio
configurations (sum to 1, non-negative).
34
4 METHODOLOGY
4.1 Data description
A compact yet realistic selection of U.S. large-cap equities was selected so that the RL
agents (and the 1/N benchmark) can allocate across liquid, sector-diverse assets. Chosen
stocks have high liquidity to ensure that trading signals from the DRL agent are viable in
practice meaning that there are no issues filling orders and also to minimize microstructure
noise in price data. Liquidity is typically measured by metrics such as average daily
trading volume or market capitalization. In this research the focus is on large-cap stocks
which trade millions of shares per day. Concretely, the construction starts from the S&P
500 contenders and then the chosen stocks are picked from the most traded among them
that fulfill other criteria. High liquidity also means fewer missing trading days and more
reliable pricing data.
To ensure a diversified portfolio, stocks are included from a variety of sectors, for example
technology, finance, healthcare, consumer, and industrials. This prevents the portfolio
from being too concentrated in a single industry, which could bias results and also increases
collinearity of returns. By having a mix of sectors, the DRL agent will have to learn to
allocate among different industries, which often respond differently to market conditions,
for instance tech and utilities. This tests the agent’s ability to rotate between sectors
if advantageous (a form of dynamic allocation that can add value if it predicts sectoral
trends). Stocks included with a range of volatility profiles, some more stable with lower
volatility e.g. utilities or large consumer staples, and some more volatile e.g. tech or
smaller caps. This ensures the agent deals with different risk profiles. High volatility
stocks present more opportunity and risk for the agent to exploit. Low volatility stocks
provide stability and are often favored in risk-adjusted strategies. By mixing them, the
agent might learn to overweight volatile stocks when confident and shift to stable ones in
uncertain times, similar to a market volatility timing strategy. However, to keep things
reasonable, extremely volatile or distressed stocks are avoided, because they could skew
results or present irregular data patterns due to corporate actions.
Applying these criteria a portfolio of 10 is constructed. Ten is a common number in
academic studies to keep state-space manageable for RL, for example Jiang et al. (2017)
used 11 including cash. A cash asset is sometimes included (or money market fund
proxy) as one of the "assets" in the portfolio. This would allow the agent to deallocate
from equities into cash if it predicts a broad downturn. The 1/N strategy including cash
would mean 1/(N+1) in each stock and cash, however, often 1/N refers to fully invested
in equities. For simplicity and alignment with many studies, I proceed with fully invested
35
portfolios with no cash. The agent must always distribute 100 percent among the stocks.
This also simplifies the action space constraint. The chosen stocks are listed in Table 2.
Table 2: Final stock universe by sector and style
Company Sector Formal Style
Apple (AAPL) Information Technology Sensitive
Nvidia (NVDA) Semiconductors / IT Sensitive
Google (GOOGL) Communication Services Sensitive
JPMorgan Chase (JPM) Financials Cyclical
Johnson & Johnson (JNJ) Health Care Defensive
Walmart (WMT) Consumer Staples Defensive
Coca-Cola (KO) Consumer Staples (Beverages) Defensive
McDonald’s (MCD) Consumer Discretionary Cyclical
NextEra Energy (NEE) Utilities Defensive
Lockheed Martin (LMT) Industrials / Defense Sensitive
This set ensures each asset is well traded. Each stock in the list had no major data
issues, like a merger causing a disappearance in time series. This of course leads to a
look-ahead bias, but since both DRL and 1/N benchmark portfolios are built from the
same identical survivorship universe, any residual look-ahead bias applies equally to each
strategy and thus cancels out in their relative performance comparison. Yet it is worth
noting that this may overstate absolute risk-adjusted returns and thus should therefore
be cautious when making any generalisations about attainability of these results in real
time. By diversifying sectors, some colinearity reduction is introduced: stock returns
within a sector are usually more correlated than across sectors. So multi-sector means the
correlation matrix of returns is less dominated by a single factor. However, since all are
equities, there will still be a common market factor. That is fine and reflection of reality.
Sector diversity and mixed volatility profiles encourage the RL agent to rotate between
cyclical and defensive exposures rather than memorising a single dominant factor, thereby
providing a more stringent test of dynamic portfolio skills.
Historical prices are downloaded from Refinitiv Datastream, where total return series in-
corporate splits and cash dividends. Additionally, the 3-month Treasury-bill yield, price
to earning ratios, and 1-month bond volatility index are also downloaded from Refinitiv
Datastream. The data coverage includes a daily panel spanning from January 2010 to
December 2024, which is roughly 3 800 trading days that cover the post global financial
crisis recovery, the COVID-19 turmoil, and a variety of subsequent market regimes. Ob-
servations from 2010–2019 (about 2 520 days) are reserved for model training, while the
2020–2024 window is held strictly out-of-sample for performance evaluation to maintain
the temporal flow of information.
36
Total-return indices, which automatically adjust for corporate actions such as splits and
dividends, are used to compute continuous logarithmic daily returns as
rt,i = ln
(︃
Pt,i
Pt−1,i
)︃
= ln(Pt,i) − ln(Pt−1,i),
eliminating the need for separate dividend handling. The risk-free benchmark is the three-
month U.S. Treasury bill. Its yield is converted into a constant maturity with daily series
and merged with the price panel, ensuring that excess returns, Sharpe ratios, and Jensen’s
alphas are grounded in an observable short-term rate. This forward-chaining design where
training the agent on one market era and evaluating it on genuinely unseen conditions
creates a realistic forward looking scenario. The agent is trained on one market era and
judged on future, unseen conditions, providing a harsh generalisation check.
4.2 Defining the state space
State representation for the reinforcement-learning state must relay market information
for each asset and the agent’s current allocation. In addition to historical price data, the
RL survey suggests including techinical indicators and market indicators can help (Bai
et al., 2024, 5). In line with quantitative finance practices, the feature set is designed to
be expressive yet compact.
The state vector is built around four stock specific signals. First, the daily log-returns
observed over the most recent one to five trading days capture short-term momentum and
potential mean-reversion effects. Second, a rolling 21-day standard deviation of returns
serves as a volatility proxy, linking each asset’s expected reward to its recent risk. Third,
a small set of higher-level signals such as 14-day RSI is added to accelerate learning.
Fourth, a rolling price-to-earnings (P/E) ratio presents a fundamental feature that may
help the agent discriminate between over- and undervalued stocks. To fulfill the state
vector it is augmented with two additional elements. The previous portfolio weights wt−1
are appended, letting the agent weigh the cost of re-balancing when transaction fees apply.
Lastly two macro variables are added: the three-month Treasury bill yield and a bond-
market volatility index add information about prevailing funding conditions and market
stress.
With N stocks and k stock-level features, the state dimension equals N × k. For N = 10
and k = 5 this is 50 elements. Adding the previous weights (+10) and two macro variables
still keeps the vector well below a few hundred inputs which is manageable for a fully
connected neural network. For larger universes (N = 50) the dimension rises to roughly
250–400, so redundant signals are intentionally avoided.
37
Principal component or factor representations could decorrelate highly similar stocks, but
at the cost of interpretability. However, machine-learning-based methodologies including
neural networks are highly immune to collinearity caused biases (Lindner et al., 2022,
pp. 1311–1312). Since moderate collinearity is tolerable for modern networks, direct stock
representation is retained in this work. The final state vector st for day t chaines four
stock-specific features including recent returns, 21-day volatility, a 14-day RSI, and the
rolling P/E together with the previous allocation wt−1 and two macro variables. Refreshed
each trading day, this vector is fed to the policy network, which outputs the next allocation
wt.
Neural networks train more robustly when input features share a comparable numerical
range (Sola and Sevilla, 1997, pp. 1467–1468). Daily returns, technical indicators and
fundamentals are therefore transformed as follows. Returns instead of raw prices are
calculated as log returns: rt,i = ln(Pt,i/Pt−1,i). This representations removes the scale
effect of share price. For example, a $20 and a $300 stock contribute equally if they move
by ±5%. Feature standardisation is applied by setting for every feature x its training
set mean µ and standard deviation σ, and the standard score x˜ = (x − µ)/σ is used in
all splits (train and test). The transformation yields approximate zero mean and unit
variance, which accelerates optimisation. For Outlier control, extremely volatile variables
are capped inside [−3σ, +3σ ]. Bounded indicators such as RSI are rescaled to [0, 1]. Non-
stationarity causion is handled. Because financial series are non-stationary, fixed µ and σ
from the training window may become sub-optimal in later periods. Nevertheless, static
scaling avoids data leakage and sufficed in preliminary experiments. This pipeline delivers
inputs of similar scale to the neural networks while preserving the temporal integrity of
the train–test split.
In equity markets stocks often move together, particularly during market wide events
which can lead to highly correlated returns and multicollinear feature sets. Extreme
collinearity may obscure the attribution of risk or return to individual assets, yet in a
portfolio setting an RL policy can simply treat strongly correlated stocks as a single
factor exposure by assigning them similar weights. The following measures are adopted
to keep collinearity at a manageable level: Stock choices are drawn from multiple sectors,
reducing the likelihood of high pairwise correlations. Redundant indicators are avoided.
For example, only one short-term momentum signal is chosen. Features are computed on
returns rather than raw prices, improving stationarity and relevance to re-balancing. The
training set correlation matrix is inspected as: if any asset pair or indicator pair exhibits
ρ > 0.90, one variable is dropped or substituted to maintain informational diversity.
Dimensionality reduction techniques such as PCA can transform highly correlated returns
into orthogonal factor scores. Some prior works like Shen et al. (2015) did use PCA on
stock returns to define "arms" for a bandit algorithm, essentially picking uncorrelated
38
portfolios rather than stocks. However, doing so sacrifices some transparency. In practice,
correlated price movements do not break the approach: the RL agent is free to allocate
similar weights to assets that move alike. By combining sector diversification, a non-
redundant feature set and ongoing correlation checks, severe multicollinearity is prevented
from dominating the state space.
Following this pipeline guarantees that each state presented to the RL agent contains
well-aligned technical features, macro signals and fundamentals that are clean, scaled and
strictly historical.
4.3 Model Implementation
The experiments are set up to answer the research questions. This includes the configur-
ation of portfolio management environment for the RL agents, the training procedure for
each algorithm, the definition of reward and any constraints, and the evaluation protocol
comparing against the 1/N strategy.
All methods are implemented in Python and rely exclusively on open-source software.
The implementation of the reinforcement learning algorithms were utilized with Stable
Baselines3 (SB3), which provides production-ready versions of PPO, A2C, and DDPG.
SB3 is a set of reliable implementations of reinforcement learning algorithms in PyTorch,
documented by Raffin et al. (2021). PyTorch is a machine learning library that offers
a deep learning framework that focuses on usability and speed, documented by Paszke
et al. (2019). OpenAIGym offers the environment interface under which StockTradingEnv
is defined. StockTradingEnv is name of the custom environment used in this research.
OpenAIGym is a toolkit for reinforcement learning research, documented by Brockman
et al. (2016). The complete software stack therefore combines reliable RL implementations
with a flexible simulation interface and a modern deep-learning framework.
The portfolio-allocation problem is modelled as a custom OpenAIGym environment, des-
ignated as StockTradingEnv. The environment packages the market data, state represent-
ation, admissible actions, and reward signal into a sequential decision process in which a
single trading day forms one environment step. In StockTradingEnv the state comprises
historical features and current weights, actions are allocation vectors, and rewards are
daily log returns. Simplified StockTradingEnv is presented in Appendix 4. In the follow-
ing, the state and action spaces are specified, the transition dynamics are described, and
the reward function is constructed.
At each trading day t, the state vector st gathers all information the agent needs to
choose a new portfolio allocation. The environment provides several key components
39
for this state vector. Firstly, it includes a windowed history of technical features for each
stock. A rolling window of length w = 5 days is kept. For each of the N stocks, F technical
features including daily returns, RSI and rolling volatility are collected, producing a tensor
of shape (w,N, F ), which is flattened into one vector:
Features = [xt−w+1,1, . . . , xt,1, xt−w+1,2, . . . , xt,2, . . . , xt−w+1,N , . . . , xt,N ],
where xt,i denotes the feature vector for stock i on day t.
Secondly, the previous portfolio allocation wt−1 ∈ RN is appended so the agent knows its
current positioning, which is critical when weight changes (and any transaction costs) are
calculated. Finally, to enable the policy to condition on broader market signals, the state
includes a macro vector mt ∈ RM and per-stock fundamentals ft,i are linked, enabling
the policy to condition on both technical and macro/fundamental signals. Putting these
parts together,
st =
[︂
xt−w+1,1, . . . , xt,N⏞ ⏟⏟ ⏞
flattened stock features
, wt−1,1, . . . , wt−1,N⏞ ⏟⏟ ⏞
prev. weights
, mt⏞⏟⏟⏞
macro
, ft,1, . . . , ft,N⏞ ⏟⏟ ⏞
fundamentals
]︂⊤
.
This consolidated state is passed to the policy network at every step.
The action at on day t is the new weight vector
wt = (wt,1, . . . , wt,N)
⊤,
where each component satisfies wt,i ≥ 0 and
∑︁N
i=1wt,i = 1, meaning the agent invests
100% of its capital across the N stocks. The mechanism for generating this action vector
varies depending on the reinforcement learning algorithm.
For stochastic policy-based methods like Proximal Policy Optimization (PPO) or Advant-
age Actor-Critic (A2C) the action is first represented as a continuous vector in RN . The
policy network outputs a diagonal Gaussian whichs mean is the unconstrained vector u˜.
A softmax is subsequently applied to map u˜ onto the simplex,
wt = softmax(u˜),
ensuring non–negative weights that sum to unity. The resulting wt is used directly for
rebalancing.
For deterministic algorithms like the Deep Deterministic Policy Gradient (DDPG), the
actor’s final layer is followed by a softmax, so that the network itself emits a valid weight
40
vector:
wt = softmax(u˜+ noise).
During training, exploration noise is added to u˜ before the softmax. The environment
therefore always receives an allocation that lies strictly on the simplex. In every scenario
the allocation satisfies wt ∈ [0, 1]N and
∑︁
iwt,i = 1, a requirement consistent with an
equity-only portfolio that disallows short selling and idle cash.
After the environment has been supplied with the weight vector wt, the portfolio is re-
balanced to those weights as a state transition. During the following day asset prices
evolve exogenously, and the portfolio value is updated as
Vt+1 = Vt
(︂
1 +
N∑︂
i=1
wt,iRt,i
)︂
,
where Rt,i is the realised return of asset i from day t to t + 1. The next state st+1 is
then assembled from the next window of features {xt+1−w+1,i, . . . , xt+1,i} for every stock,
the newly chosen weights wt that are now stored as the “previous-weights” block, and the
updated macro and fundamental data for day t+ 1.
The base reward is the logarithmic portfolio return,
rt = ln
(︂Vt+1
Vt
)︂
= ln
(︂
1 +
N∑︂
i=1
wt,iRt,i
)︂
.
Instead of maximizing the discounted sum, a better objective is to maximize expected
log-return, because the logarithm’s concavity makes the problem concave aligning better
with long-run terminal wealth. When turnover costs are included, the reward is reduced
by
r˜t = ln(1 +Rp,t)− c
N∑︂
i=1
|wt,i − wt−1,i|⏞ ⏟⏟ ⏞
cost_term
,
where Rp,t is the gross portfolio return and c > 0 is the proportional fee rate. The penalty
discourages frequent or large re-balancing moves and ensures that long-term policy is
updated and the model doesn’t stop early.
4.4 Training the agents
Three deep-reinforcement-learning agents PPO, DDPG and A2C are trained on the Stock-
TradingEnv introduced in Section 4.3. All agents receive identical features, episode bound-
aries and evaluation protocols so that the comparison remains fair. Hyperparameters are
41
selected either from values commonly reported in the literature or from a modest grid
search, described below.
Identical network sizes are used for all agents to ensure a consistent basis for comparison.
The actor network and, when applicapble, the critic network adopt a uniform structure.
This architecture consists of two hidden layers, each containing 64 units and utilizing
the Rectified Linear Unit (ReLU) as the activation function. This same two-layer, 64-
unit ReLU structure is employed for the critic networks in algorithms that require one.
While the hidden layers are standardized, the output layers differ based on the algorithm’s
requirements. For the stochastic agents PPO and A2C, the network outputs the mean and
standard deviation parameters that define a diagonal Gaussian distribution from which
actions are sampled. In contrast, the DDPG agent’s network emits a single deterministic
action. Regardless of the agent type, a final softmax activation function is applied to the
output logits. This critical step ensures that the resulting portfolio weights are properly
normalized, summing to one and thus representing a valid, fully invested allocation.
Each episode spans roughly 250 trading days, suggesting that a pure return-sum objective
with γ = 1 could be used. However, practical and theoretical considerations favour a
discount factor that is slightly below unity. A range of γ ∈ [0.95, 0.99] is adopted for
all agents because it preserves the standard convergence guarantees of RL algorithms, it
places a mild emphasis on near-term returns while still valuing long-term performance,
and it aligns with defaults in widely used libraries such as Stable Baselines3.
This range γ ∈ [0.95, 0.99] has frequently been used in prior financial RL work (eg.,
(Mohammed et al., 2021; Zou et al., 2024)) to maintain theoretical convergence properties
and to ensure that returns far in the future do not dominate the current updates in an
unbounded manner.
A moderate grid search explores learning rates, discount factors, step sizes and transaction-
cost rates introduced in Table 3:
Table 3: Hyperparameter Grid Search Values
Hyperparameter PPO A2C DDPG
Discount Factor (γ) {0.95, 0.99} {0.95, 0.99} {0.95, 0.99}
Learning Rate {1e−4, 3e−4} {1e−4, 3e−4} {1e−3, 5e−4}
Steps (nsteps) {256, 512} {5, 8} N/A
Value Function Coeff. N/A {0.25, 0.5} N/A
Transaction Cost Rate N/A N/A {0.0005, 0.001}
Each hyperparameter combination is trained for 20k–50k time-steps on the training set
(multiple episodes) and evaluated on a separate validation year. The configuration that
maximises a composite metric of net return, Sharpe ratio and turnover is retained.
42
All algorithms are implemented with the Stable Baselines3 library (PyTorch backend).
The custom StockTradingEnv Gym environment integrates seamlessly with the library’s
policy, replay-buffer and optimisation modules while allowing task-specific customisations
such as action projection.
On-policy agents (PPO, A2C) the training data is divided into multiple episodes of roughly
250 trading days. After each episode the portfolio value is reset to 1 and weights are
initialised to equal allocations. Episodes are sampled chronologically. For off-policy agent
(DDPG), year-long episodes are used to establish clear boundaries while the replay buffer
is filled across episodes and sampled uniformly for updates.
During the grid search each agent is trained for 15k–50k time-steps, enough to observe
performance plateaus and identify stable hyperparameter choices. The best configura-
tion is then re-trained for 200k–500k time-steps. Periodic validation checks allow early
stopping if performance ceases to improve, reducing the risk of over-fitting.
4.5 Performance evaluation
The deep-RL agents are evaluated against a transparent, equal-weight benchmark that
invests the same fraction in each of the N assets. Daily rebalancing keeps the comparison
on the same one-step-per-day schedule used by the learning agents. At the start of the
test period the capital is split equally, w0,i = 1/N for i = 1, . . . , N , and the portfolio
value is normalised to 1. At every trading day t the weights are reset to wt,i = 1/N
∀ i, fully offsetting any drift caused by price movements. In the base experiment the
benchmark is assumed frictionless, mirroring the initial DRL runs. When transaction
costs are introduced, the 1/N portfolio pays the same costs rate each day to restore the
equal weights, preserving fairness against DRL policies that may rebalance less often. The
daily portfolio return is
R
1/N
t =
1
N
N∑︂
i=1
Rt,i,
and the value update follows V 1/Nt+1 = V
1/N
t (1 + R
1/N
t ). Cumulative return, Sharpe ratio,
drawdown and turnover are computed exactly as for the learning agents.
A strict equal-weight strategy provides a low-information, diversified baseline. Although
equal-weight funds often rebalance monthly or quarterly to mitigate turnover, a daily
schedule simplifies the comparison: both benchmark and DRL agents act once per trading
day. If a DRL policy genuinely exploits predictive structure in the features, it is expected
to surpass this 1/N benchmark on risk-adjusted metrics such as annualised Sharpe and
maximum drawdown.
43
The final policies of PPO, DDPG and A2C are back-tested on a held-out test period and
compared with the 1/N benchmark. No further learning occurs during testing. In back-
testing protocol the portfolio value is set to V0 = 1 and the agent’s trained parameters
are loaded. In the daily simulation the agent observes the state st, a weight vector wt is
produced, the environment re-balances to wt, and next-day returns Rt update the value
via
Vt+1 = Vt
(︂
1 +
∑︂
i
wt,iRt,i
)︂
.
This procedure creates a time series rt of daily log-returns for both the RL agent and the
1/N benchmark.
For each strategy the following quantities are reported: compound annual growth rate
(CAGR), annualised volatility σann, maximum drawdown (MDD), total transaction costs,
and Sharpe ratio =
r¯ − rf
σann
, where r¯ is the mean daily return and rf the daily risk-free
rate.
For statistical comparison a robust Sharpe difference testing by (Ledoit and Wolf, 2008)
is adopted. The block bootstrap approach proposed by Ledoit and Wolf resamples blocks
of daily returns to preserve autocorrelation structure. Under the null hypothesis of equal
Sharpe ratios, estimation about the distribution of Sharpe differences is tested. Then a
p-value is computed to see if the observed difference in Sharpe is statistically significant.
Because RL training is partly stochastic (e.g., random initialization, exploration noise),
each DRL agent is trained multiple times (with different random seeds) to gauge variability
in outcomes. If, for instance, PPO consistently outperforms 1/N across several runs, that
provides stronger evidence of robust outperformance. If outcomes vary widely, the average
and standard deviation of the performance metrics is reported.
Final evaluation steps summarized: a set of performance metrics for each agent and for
1/N, statistical tests (Ledoit and Wolf (2008) block bootstrap) to ascertain if differences
in Sharpe are robustly significant, and lastly seed repetition to check the consistency of
learned policies. These steps provide a rigorous foundation for concluding whether or not
our DRL agents truly surpass the straightforward 1/N baseline.
44
5 RESULTS AND DISCUSSION
5.1 Data Characteristics and Feature Behavior
The investment universe includes ten large-capitalisation U.S. equities: Apple (AAPL),
JPMorgan Chase (JPM), Johnson & Johnson (JNJ), Walmart (WMT), NextEra Energy
(NEE), NVIDIA (NVDA), Alphabet (GOOGL), Lockheed Martin (LMT), Coca-Cola
(KO), and McDonald’s (MCD).1 Daily prices span several years and are split into a
training and a test sample. Descriptive statistics are presented in Table 4.
Table 4: Key descriptive statistics for the ten constituent equities. Annual figures are
geometric unless noted.
Ticker Ann. Return (%) Ann. Vol. (%) Beta (1/n) Skew Kurtosis Avg. P/E
AAPL 30.66 27.27 1.24 -0.05 5.61 20.42
GOOGL 23.92 26.85 1.22 0.41 8.68 32.73
JNJ 10.09 16.50 0.65 -0.11 9.49 21.44
JPM 19.81 27.23 1.20 0.27 10.68 11.56
KO 10.51 16.72 0.71 -0.62 9.51 25.67
LMT 18.47 20.54 0.79 -0.29 13.59 16.53
MCD 14.71 18.34 0.76 0.40 32.39 23.35
NEE 17.75 21.59 0.82 -0.21 11.72 25.13
NVDA 65.22 44.67 1.96 0.67 8.66 49.58
WMT 15.80 18.96 0.65 0.12 16.68 24.01
NVDA combines the highest annual return with the greatest volatility and a beta almost
twice that of the equal-weight benchmark. Low-beta defensives (JNJ, KO) display muted
skewness and thinner tails, whereas MCD and LMT show pronounced kurtosis, signalling
fat-tail risk.
Table 5: Descriptive statistics for the two macro-economic features.
Variable Mean Std. Min Max Skew Kurtosis
TBILL_3M 1.24 1.74 −0.05 5.36 1.37 0.46
BONDVOL_1M 78.62 25.30 36.60 198.70 0.83 0.18
The three-month T-bill rate is strongly right-skewed, capturing the abrupt hiking cycle
after a prolonged near-zero period. Bond-market volatility is likewise right-skewed, with
occasional spikes above 200 basis points.
Tech names (AAPL, NVDA, GOOGL) exhibit the tightest cluster, suggesting a sector
1Ticker conventions follow CRSP. Company names are given only for clarity; the portfolio is managed
strictly at the ticker level.
45
Figure 6: Correlation matrix of daily log-returns for the ten equities.
Darker shading denotes stronger positive correlation.
factor the DRL agents may hedge or exploit. Cross-sector correlations remain moderate,
preserving diversification potential, in fact, no pairwise correlation exceeds roughly 0.60,
so multicollinearity is limited and dimensionality reduction techniques such as PCA are
not strictly necessary for the state space construction. Descriptive statistics for these
variables appear in Figure 6.
It is worth noting differences between training and testing regimes. An essential observa-
tion is that the distribution of key features shifts between the two periods. During most
of the training sample, short-term interest rates remained near the zero lower bound,
whereas the test period contains episodes of rising rates and heightened volatility. Figure
7 illustrates the two way nature of TBILL_3M: the primary mode clusters around 0%
while a secondary mode appears between 4% and 5%, signalling a regime change toward
tighter monetary policy late in the sample.
Similarly, the distribution of BONDVOL_1M is markedly right-skewed (Figure 8). Al-
though moderate volatility prevails for most observations, rare spikes to extreme levels
occur. Some of these events lie entirely outside the range witnessed during training.
Also, DRL agents must contend with out-of-sample conditions that partially violate the
stationarity assumption, which is a known challenge in financial machine learning.
46
Figure 7: Empirical distribution of TBILL_3M across the full sample.
The primary mode clusters near 0 % (x in percentage points) while a secondary peak around 4
- 5 % marks the regime shift toward higher short-term rates.
Figure 8: Distribution of the bond-market volatility BONDVOL_1M.
The pronounced right tail (x in bp) reveals rare but extreme volatility spikes that the DRL agents
must handle out-of-sample.
47
Features such as interest-rate levels and volatility surges can exert first-order influence on
optimal portfolio choice. A financially rational policy might reduce equity exposure when
BONDVOL_1M exceeds a stress threshold, or rebalance toward defensive assets when
TBILL_3M rises sharply, reflecting an increased risk-free alternative and a potential
contraction in the equity risk premium.
If the DRL agents have genuinely internalised such relations, it is expected for their
allocations to react coherently to these signals in the test sample. By contrast, overfitting
to the low-volatility, low-rate training regime could manifest as poor generalisation once
those assumptions break down. The performance analyses will revisit these hypotheses
and examine whether the strategies adapt effectively or falter under the observed regime
shifts.
5.2 DRL Model Configuration and Hyperparameters
Before presenting performance outcomes, here is a summarize of the final model con-
figurations. Each DRL algorithm underwent a modest grid search on a validation set
to balance reward maximisation with practical considerations such as risk and trading
costs. All three agents share an identical neural network architecture for comparability:
two hidden layers of 64 ReLU units to control for capacity differences. Table 6 lists the
hyperparameters ultimately selected for PPO, A2C, and DDPG.
Table 6: Final hyperparameters chosen for each DRL algorithm.
Hyperparameter PPO A2C DDPG
Discount factor γ 0.99 0.95 0.99
Learning rate 3× 10−4 1× 10−4 1× 10−3
Rollout length / n-steps 256 5 –
Value-loss coefficient – 0.50 –
Transaction-cost rate (train) 0.001 0.001 0.001
Note: “ - ” indicates not applicable (e.g., DDPG is off-policy and does not use fixed-length rollouts
in the same way as PPO/A2C share an implicit transaction cost penalty via environment).
The above values were arrived at by a modest grid search. For example, the discount
factor was tested γ ∈ {0.95, 0.99} for each algorithm, and ultimately γ = 0.99 was chosen
for PPO and DDPG to emphasize long-term return (and because it improved validation
performance), whereas A2C slightly preferred γ = 0.95, potentially because of stability
concerns when bootstrapping with longer horizons. Learning rates were explored in the
range 10−4 to 10−3. PPO performed best with a learning rate of 3 × 10−4, while A2C,
being somewhat less sample-efficient, was kept at a conservative 1× 10−4 to avoid diver-
gence. DDPG’s actor-critic optimizer benefited from a relatively larger step size (10−3)
for faster convergence, given its off-policy training can utilize more data. The n-steps (the
48
number of steps per update) for the on-policy methods were also tuned: PPO uses 256-
step rollouts per update (roughly covering a bit more than one trading year per episode in
the setup), balancing bias and variance in advantage estimation. A2C uses 5-step returns
(which is standard for A2C and was found to work well in validation). Also the coeffi-
cient was adjusted on A2C’s value-function loss (0.5 was found slightly better than 0.25),
which helps A2C learn state values without overweighting this term. Lastly, an explicit
transaction-cost rate of 0.1% (0.001 in fractional terms) was built into the environment
for all strategies during training to penalize excessive turnover. It was confirmed that
including this modest cost in training encourages the agents to moderate their trading
frequency. Notably, in the grid search for DDPG I tried a slightly lower cost rate (0.05%),
but the higher cost rate of 0.1% yielded a better return–turnover trade-off and was chosen
as the final setting. Overall, these hyperparameters are consistent with values commonly
used in the DRL literature for trading tasks and were selected to give each algorithm the
best chance to learn a robust policy.
5.3 Out-of-Sample Performance Comparison
First examined is the cumulative portfolio value trajectories in the test period for each
strategy. Figure 9 plots the evolution of portfolio wealth (normalised to an initial value
of 1.0) over the entire test horizon for the equal-weight (1/N) strategy and the DRL-driven
portfolios (PPO, A2C, DDPG).
Several observations can be made from this figure. All strategies achieved substantial
growth over the test period, indicating positive returns in a broad bull-market trend.
However, the 1/N portfolio ultimately ends with the highest terminal value (approx-
imately 2.7 times the starting value), closely followed by PPO and A2C. The DDPG
strategy’s final wealth is a bit lower and has the highest volatility. In terms of sheer final
return, none of the DRL trategies convincingly exceeds the naive 1/N yet indeed, the
equal-weight strategy slightly outperformed all three DRL agents.
It is insightful to consider the journey of these returns, not just the endpoint. The
plot reveals periods where DRL strategies overtook the benchmark and vice versa. For
instance, in the middle of the test period (around days 500 - 600), the A2C portfolio
surged above the 1/N, peaking around a value of 2.00 when 1/N was at 1.7. However,
A2C then suffered a sharper drawdown around days 600 - 700 and 800 - 1000, erasing its
lead. As contrast, PPO’s trajectory is relatively more stable: it tracks the 1/N portfolio
closely, with only minor deviations. The DDPG path is the most volatile of the three DRL
agents, dipping more aggressively during certain downturns yet rallying strongly toward
the end to close the gap with PPO.
49
Figure 9: Cumulative portfolio value in the test period
Portfolio value normalised to 1.0 at inception. The equal-weight benchmark (blue) finishes
marginally ahead of the DRL strategies: PPO (orange) and A2C (red) tracks closely, while
DDPG (green) exhibits the greatest volatility.
These patterns indicate that the DRL strategies dynamically adjusted their exposures over
time, occasionally capturing gains or avoiding losses better than 1/N, but at other times
making missteps the static benchmark avoided. This hints a key theme: variability versus
consistency as the equal-weight portfolio, being a passive buy-and-hold with periodic
rebalancing, shows a steadier upward trend and smaller oscillations, aside from market-
wide moves.
The return performance can be quantified more formally via the compound annual growth
rate (CAGR). Over the test horizon of T years (roughly T ≈ 5 years given ∼ 1250 trading
days), the 1/N portfolio achieved a CAGR of about ∼ 21%, whereas PPO’s CAGR was
approximately 17%, DDPG’s about 20%, and A2C’s about 18%. These CAGRs (which
will be detailed in Table 7) confirm that the equal-weight strategy provided the highest
long-run growth. Nonetheless, the differences are not enormous as DDPG in particular
delivered almost the same growth rate, falling short by only a percentage point. In
absolute return terms, therefore, the advantage of the benchmark is modest and is later
tested by running models with multiple seeds. Finally, should be noted that all portfolios
benefitted from a generally rising market. This means even the “worst” strategy (PPO)
more than doubled the initial capital in the given period, which is a strong absolute result.
However, for evaluating skill or added value, risk-adjusted returns and consistency should
be more informative.
50
Table 7: Out-of-sample performance metrics.
Metric PPO A2C DDPG 1/n
Final portfolio value rounded 2.20 2.25 2.55 2.70
CAGR (%) 17.1 17.6 19.6 20.9
Sharpe ratio 0.805 0.840 0.916 1.075
Max drawdown (%) −31.9 −33.4 −32.5 −30.7
Trades (count) 421 153 1 242 1 258
Total cost 0.70519 0.31583 0.54653 0.02070
The Sharpe ratio provides a summary of risk-adjusted return (assuming reasonably stable
distributions). Over the entire test period, the annualized Sharpe ratios of the strategies
were as follows: the 1/N portfolio attained the highest Sharpe around 1.1, DDPG was
slightly lower around 0.9, A2C further behind with 0.84, and PPO roughly in 0.8. These
values are reported in Table 7. In effect, 1/N had the best trade-off between mean return
and volatility. This might be surprising at first glance since one might expect a soph-
isticated DRL agent to deliver superior Sharpe if it can reduce risk during bad times.
However, the results show that the DRL strategies did not substantially outperform the
naive diversification on a Sharpe basis. One contributing factor is volatility of returns:
the DRL strategies introduced additional variability by shifting portfolios. While they
sometimes reduced exposure successfully before a downturn, at other times they shifted
into the wrong assets and increased volatility. The equal-weight portfolio, by contrast, al-
ways holds a broad mix, which smooths out idiosyncratic fluctuations and yields relatively
stable returns (aside from systemic market moves).
To illustrate the time-varying nature of risk-adjusted performance, Figure 10 plots the
rolling 60-trading-day Sharpe ratio (annualized) for each strategy throughout the test
period. The 1/N portfolio (red line in this figure) maintains a consistently positive Sharpe
in most periods and reaches very high Sharpe values (above 5 or even 8 on occasion) during
strong market rallies with low volatility. The DRL strategies show more volatility in their
short-term Sharpe. For example, in early parts of the test, the A2C strategy (orange line)
spiked to a rolling Sharpe of 5, outperforming 1/N at that moment, but shortly thereafter
its Sharpe plunged toward 0 or negative when some trades turned bad. Similarly, DDPG
(green line) had a period around day 600 where its 60-day Sharpe fell below 2, whereas
1/N at the same time was around 0. Later in the test (day 1050-1150), all strategies
saw Sharpe ratios jump as the market rallied. Interestingly, 1/N’s Sharpe (red) soared
the highest (peaking above 8) because it was fully exposed to the rising market, whereas
PPO and A2C (blue and orange) remained more muted around 5-6. By the very end of
the test, the rolling Sharpes of all strategies converge downward as a minor drawdown
occurs, with some even turning slightly negative. These fluctuations underscore that
51
Figure 10: Rolling 60-day annualised Sharpe ratio for each strategy.
the DRL strategies’ performance was more erratic in the short run. They achieved high
risk-adjusted returns in certain windows but could not maintain that consistently.
Another critical risk metric is the maximum drawdown (the largest peak-to-valley per-
centage loss during the period). In results, the max drawdown for the 1/N portfolio was
about −30.7%. The DRL strategies fared slightly worse overall: PPO suffered a draw-
down of roughly −31.9%, A2C the largest at around −33.4%, and DDPG about −32.5%.
These values in Table 7 show that none of the DRL agents improved on the benchmark’s
downside protection. A2C increased downside risk the most, while PPO and DDPG were
only marginally deeper than the benchmark.
The portfolio weight evolution plot (Figure 11) vividly illustrates the differences in strategy
behaviour. The 1/N benchmark (bottom-right panel) is trivial: ten static bands of equal
width (≈ 10%) apart from minor drift before rebalancing. By contrast, the DRL panels
are highly dynamic: PPO (top-left) appears to cycle between two dominant states. In
one state it splits capital almost evenly between two favoured stocks whilst in the other
it reverts to an approximate equal-weight allocation across the remaining eight. Short
excursions away from these states also occur. This “bimodal” pattern is plausibly an
echo of the training phase, where portfolio weights were initialised at 1/N. Once PPO
finds a near-balanced configuration, it tends to sit there to save transaction costs. A2C
(bottom-left) shows a related but less symmetric behaviour: it also favours a handful of
quasi-balanced mixes, yet the weights inside each mix are uneven, suggesting the agent
locks onto a temporary “optimal” basket for a few days before abruptly rotating else-
52
Figure 11: Evolution of portfolio weights
Each coloured band represents the fraction of capital invested in one of the ten stocks.
where. DDPG (top-right) is the most chaotic. Its continuous-action updates lead to
almost frame-by-frame adjustments with no stable pattern, consistent with a policy that
chases short-term signals rather than settling into discrete regimes.
Thus the DRL agents continually reshuffle their portfolios whereas the 1/N strategy simply
holds everything. Active rotation is not inherently problematic, but given the 0.1% trans-
action cost penalty it erodes much of the DRL excess return, helping to explain why the
simple equal-weight portfolio ultimately outperforms on a net basis.
Figures 12 and 13 repeat the experiment with markedly shorter training horizons: roughly
30 000 steps for A2C, 50 000 for PPO and 10 000 for DDPG, and then reinitialises each
algorithm with distinct random seeds. For every seed the test period wealth path and
Sharpe ratio is recorded, then plot the cross-seed median. The median return curves
confirm the earlier pattern, not achieving sustainable overperformance but rather differing
performances due to stochastic path. Equally important, the median Sharpe ratios cluster
around the market portfolio’s level, indicating that even when training time is curtailed
and weight initialisation varies widely, no agent achieves a risk-adjusted edge. In other
words, the conclusions drawn from the full-length runs are robust to randomness in both
learning time and seed choice.
53
Figure 12: Median portfolio trajectories from multiple random seeds.
54
Figure 13: Median annualised Sharpe ratios from multiple seeds.
55
To summarise the performance metrics discussed so far, Table 7 provides a concise com-
parison. The benchmark 1/N strategy remains superior on every headline figure: it posts
the highest final value, CAGR, and Sharpe ratio, and also the shallowest maximum draw-
down. Among the DRL agents, DDPG comes closest to the benchmark in both return and
Sharpe (2.55 final value, 19.6 % CAGR, 0.916 Sharpe), albeit with a deeper drawdown (-
32.5 %) and the second-highest transaction-cost drag. A2C shows a slightly higher CAGR
and Sharpe than PPO but suffers the worst drawdown of the group (-33.4 %), indicating
an unstable risk profile despite its relatively low trading cost. PPO delivers the lowest
CAGR (17.1 %) and Sharpe (0.805) of the three DRL strategies and a drawdown of –31.9
%, so its supposed stability advantage does not translate into better risk-adjusted returns.
Although the raw trade counts in the table put the benchmark at 1 258 executed orders
(monthly rebalances of small size) versus, for example 421 for PPO, the cost column re-
veals the practical picture: the DRL agents incur an order of magnitude higher cumulative
transaction costs because their trades tend to be much larger reallocations. These costs
help explain why none of the DRL policies convert their tactical activity into a Sharpe
ratio that beats the simple equal-weight portfolio.
Given the relatively small performance differentials observed (and the inherent noise in
financial returns), it is essential to test whether any apparent advantage of one strategy
over another is statistically significant or could simply be due to random chance. This is
tested by employing a block–bootstrap hypothesis test by Ledoit and Wolf (2008).
The null hypothesis that each DRL strategy has the same Sharpe ratio as the (1/N)
benchmark is tested against the alternative that the Sharpe ratios differ. Because returns
are time-dependent and non-normal, a naive analytic test is inappropriate. Instead, the
robust procedure of Ledoit and Wolf (2008) is used, which constructs a studentised, time-
series bootstrap confidence interval for the difference in Sharpe ratios. If zero is outside
this interval (equivalently, if the bootstrap p-value ≤ 0.05), the null hypothesis of equal
Sharpe ratios is rejected.
A circular block bootstrap with a block length equal to one month (21 trading days) is
implemented to resample paired returns of each strategy and the benchmark. After 1000
bootstrap iterations, none of the DRL strategies shows a significant Sharpe-ratio difference
from 1/N at the 5% or even 10% level. For A2C vs. 1/N the p-value is ≈ 0.4670. In this
context, the p-value represents the proportion of the 1000 resampled time-series in which
a random difference in Sharpe ratios was at least as large as the one originally observed,
under the assumption that no true performance difference exists between the strategies.
PPO vs. 1/N was even less significant p ≈ 0.5010 and DDPG vs. 1/N yields p ≈ 0.4740.
Pairwise tests among the DRL agents likewise fail to reach significance at 5 %. Hence,
it cannot be concluded that any DRL method out- or underperformed the benchmark in
56
terms of Sharpe ratio.
The lack of statistically significant differences implies that there is not strong evidence
that any DRL strategy truly beats the 1/N portfolio on a risk adjusted basis. This echoes
previous literature by (DeMiguel et al., 2009) that finds sophisticated methods often fail to
outperform naive diversification out of sample. One reason is sample length: test window
covers only a few years, far too short for mean-variance or DRL models to show decisive
superiority. Moreover, the agents were trained offline and then tested without further
learning causing regime shifts in rates and volatility likely eroded their static policies.
In an online learning or meta learning setup, sustained adaptation might enlarge any
performance gap, but that lies beyond the current scope.
Finally, it is acknowledged that significance testing in finance has its pitfalls: return
distributions are non-normal, and with limited realized path of returns, the power to
detect differences is limited. Using block bootstrapping to at least account for time
dependence, which is an improvement over naive i.i.d. assumptions. The results from
Ledoit and Wolf (2008) method, in particular, gives confidence that if there were a true
difference in Sharpe of moderate size, it likely would have detected it. Since it did not, the
safest conclusion is that the DRL strategies and the 1/N strategy performed comparably
in a statistical sense.
5.4 Interpretation and Critical Discussion
The empirical results provide a rich ground for interpretation. On one hand, the DRL
agents demonstrated the ability to adjust and find some profitable opportunities (they
all achieved positive returns and average Sharpe ratios near 1, which is respectable in
absolute terms). On the other hand, none clearly dominated the simple equal-weight
strategy. When considering why the DRL didn’t significantly beat 1/N, there are several
potential reasons. First, the efficiency of the markets for these large-cap stocks might
simply be too high that is consistent with the Efficient Market Hypothesis (EMH) which
posits that in a competitive market, it’s hard to consistently attain excess risk-adjusted
returns. The agents tried to time the market, but any patterns they exploited may have
been fleeting or not sufficiently profitable after costs. It’s telling that the 1/N, which
doesn’t attempt any prediction, did as well or better. This suggests there were no easy
arbitrages available. Second, estimation error and overfitting likely played a role. The
DRL models have many parameters and were trained on a finite sample. They might
have “learned” spurious correlations that did not hold up in the test period. For example,
A2C’s aggressive trades that led to underperformance could be due to it chasing a pattern
that turned out to be noise. Overfitting is a notorious issue in quantitative strategies:
without extremely robust validation, complex models can fool themselves with in-sample
57
noise. The true market is always somewhat different from historical data.
The DRL strategies frequently rebalanced in response to market changes, effectively per-
forming a form of dynamic hedging or trend following. For instance, the agents might
have been selling during high volatility and buying in calm periods, effectively performing
a volatility-timing strategy. This could yield a positive average return, but in a sudden
volatility spike it could backfire (some of the big drawdowns). In short, the DRL strategies
may be taking on “hidden” risks like tail risk or liquidity risk. This would explain why
their Sharpe ratios didn’t surpass 1/N whilst overperforming in some time periods, any
extra return was compensation for those hidden risks. While 1/N was chosen as the bench-
mark, one might wonder if the DRL strategies would fare better against other baselines.
For example, compared to a marketcap weighted index of these stocks (which would be
dominated by the largest companies like AAPL and GOOGL), maybe the DRL perform-
ance would look different. If the 1/N portfolio happened to do exceptionally well (it often
outperforms cap-weighted indices in diversified sets (DeMiguel et al., 2009, 1930-1933)),
then beating 1/N is a high bar. However, the goal was specifically to test against the
naive diversification as that is often a tough benchmark. Indeed, the findings reinforce
the notion that 1/N is hard to beat consistently.
The practical implementation of these DRL strategies would be challenging due to their
high turnover and instability. The weight plots make clear that positions were changing
frequently and sometimes dramatically. In reality, such frequent trading could incur
market impact costs (which was was not modeled) in addition to the fixed 10 basis point
fees. It could also raise questions of whether the strategy could be executed with large
capital, would there be enough liquidity to support constant rebalancing among these
stocks without slippage. The equal-weight portfolio is straightforward to implement and
scale. This highlights a classic trade-off: a complex strategy might promise higher returns
on paper, but simplicity often wins in net terms when execution costs and constraints are
considered.
It’s informative to consider how each DRL algorithm’s nature might have influenced its
performance. PPO, with its more stable policy updates (via clipping), perhaps learned a
more robust strategy: this correlates with its relatively moderate trading count, making
big updates to portfolio less frequently. A2C, which updated more aggressively (every 5
steps) and might converge to a local optimum faster, perhaps overfit and ended up with
a even more generalizable strategy. DDPG, being an off-policy algorithm, had the ability
to learn from replay and might have found a strategy that was a bit different from the on-
policy ones. Its performance being intermediate could be due to a combination of factors:
it may have benefitted from replay (learning from more data), but as a deterministic
policy it might have struggled with exploration, getting stuck in a suboptimal trading
58
pattern at times. Indeed, the noisiness of DDPG’s weights suggests it was tweaking
continuously, possibly reflecting the inherent instability of training a continuous control
agent in a non-stationary financial environment.
This outcome has several implications for financial theory. The results lend some support
to the EMH, particularly its assertion that finding excess risk-adjusted returns in liquid
markets is extremely difficult. If markets were highly inefficient, it would be expected that
a complex algorithm finds patterns and consistently beats a simplistic strategy. Instead,
the DRL agents struggled to surpass 1/N, suggesting that the market pricing of these
assets was efficient enough that even nonlinear, dynamic strategies could not easily exploit
mispricings. One could argue this is consistent with at least a weak-form EMH: past price
patterns alone were not sufficient to guarantee better performance, as the DRL (which
has access to historical data and technical features) did not create a winning strategy
out-of-sample. While EMH provides one lens, the Adaptive Market Hypothesis offers
another perspective that is quite relevant. The results can be interpreted through AMH
as follows: The DRL agents possibly did find some inefficiencies during training (patterns
that worked in that period), but once deployed in the test period, those patterns may
have dissipated or new conditions emerged. Markets in the test period were different,
for example the regime shift in interest rates. In an AMH sense, the market adapted,
and the agents needed to adapt further. They were not online updated (rolling), so they
were somewhat “static” in their learned behavior, which could not fully capitalize on
new anomalies. Nevertheless, there was short episodes of outperformance, for example
A2C during a certain rally which could hint at temporary inefficiencies that the agent
exploited until they vanished. The mixed performance aligns with an AMH view that
profit opportunities are episodic: sometimes present, sometimes not. The fact that the
agents didn’t consistently win might indicate that any edge they had was quickly nullified
by market changes or other participants. If one looks at the high turnover and occasional
suboptimal trades by the DRL agents, it raises an analogy to human behavioral biases. At
times the agents overreacted to short-term fluctuations (not unlike a human trader who
chases noise). This wasn’t programmed, but emerged from the learning process essentially,
the agents can exhibit “behavioral” traits too (like trend chasing or fear based selling) if
the reward signal fools them. In a way, the DRL strategies underperformance could hint
that they sometimes fell for patterns that weren’t truly there, drawing a parallel to how
investors fall for hot streaks or panic sell. This is speculative but an interesting cross-link
between AI behavior and behavioral finance.
The study also highlights many practical challenges in applying DRL to real-world port-
folio management. There was a transaction cost of 0.1 % which is reasonable for retail
trading of large-cap stocks, but for institutional scale or for assets with less liquidity, costs
can be higher. The DRL strategies, especially with their high turnover, would face signi-
59
ficant performance erosion from trading costs. Indeed, even that small cost was enough to
take away their potential edge. Moreover, the simulation did not include market impact
(the price slippage caused by one’s own trades), which could be substantial given how
often the agents trade. In reality, a strategy that reallocates 20 % of a portfolio daily in
large stocks could move the market if done with enough capital.
Study showed that results can vary with hyperparameter choices and due to stochasti-
city. Even with chosen hyperparameters, these algorithms have inherent variability: two
training runs might yield somewhat different policies due to different weight initializations
or sampling randomness. This stochasticity can be problematic for a practitioner: one
cannot be sure if this particular trained agent is the optimal one or if another training
run would do better. In contrast, a fixed strategy like 1/N has no such uncertainty. Ad-
ditionally, the non-convex optimization in DRL means there’s a chance the agent finds
a locally optimal but globally suboptimal strategy. Financial data, especially if restric-
ted to a single market or a handful of assets, is limited in quantity and rife with regime
changes. Non-stationarity (the fact that the statistical properties of returns change over
time) is a deep challenge. A strategy trained on 2010–2018 data might not be ready for
2020’s pandemic crash or 2022’s rate hikes. The test showed some cracks in the strategies
when facing such new events. To use DRL in practice, one might need to continuously
retrain or adapt the models, which introduces its own difficulties. This highlights how
data limitations constrain what DRL can achieve in finance. With more data or an abil-
ity to simulate realistic scenarios, perhaps the agents could learn more robust strategies.
Another practical issue is that DRL policies are largely black-box. If a portfolio manager
wants to know “Why is the agent investing 50% in JPM today?”, it’s hard to get a straight
answer from the network’s weights. This opacity can reduce trust. These insights will be
carried into the concluding chapter, summarizing this thesis and proposing what future
improvements might be pursued.
60
6 CONCLUSIONS
This thesis was set out to investigate whether advanced Deep Reinforcement Learning
(DRL) techniques can improve upon a simple equal-weight (1/N) portfolio strategy in a
quantitative finance setting. Portfolio management was framed as a sequential decision
problem with a reward that emphasizes risk adjusted returns under realistic trading fric-
tions. The empirical evidence did not show statistically significant outperformance of
the DRL agents over 1/N in the test period. Final wealth and Sharpe ratios were very
close across strategies, with the benchmark marginally ahead, and the differences were
not significant at conventional levels.
In this setting there is no decisive evidence that DRL improves upon 1/N. These findings
are consistent with the view that extracting stable excess risk adjusted returns in liquid
equities is difficult. Under a weak form EMH reading, historical and technical features
alone were insufficient to deliver persistent gains out of sample. The agents may have
exploited patterns during training that did not survive the regime changes in the test
period, and the fixed policies used here were not updated online to track those shifts.
Practical factors likely contributed to the outcome. Turnover amplifies the effect of trans-
action costs, even at modest assumed fees. The return generating process is nonstation-
ary, and the quantity of informative data is limited relative to model capacity. Training
is stochastic and nonconvex, which complicates stability and reproducibility. The opa-
city of neural policies also poses challenges for oversight and governance compared with
transparent rules such as 1/N.
The main limitations of the study are a small asset universe, static rather than online
policies, a restricted feature set, and simplified execution that does not model market
impact or slippage. These choices support clarity and comparability but they constrain
external validity. Answering the research question, DRL did not deliver superior risk ad-
justed performance to 1/N in the experiments. Equal weighting remains a very demanding
benchmark, consistent with prior literature such as DeMiguel et al. (2009).
Future work could prioritize continual or online adaptation to regime change, explicit
risk objectives and constraints including drawdown and CVaR, richer and economically
motivated features with broader universes and controlled use of shorting and leverage,
realistic execution with impact and turnover regularization during training, and stronger
interpretability and benchmarking against factor and rule based strategies. Naive diver-
sification remains robust, and whether DRL can consistently surpass it likely depends on
progress along these fronts.
61
References
Almahdi, S. – Yang, S. Y. (2017) An adaptive portfolio trading system: A risk-return
portfolio optimization using recurrent reinforcement learning with expected
maximum drawdown. Expert Systems with Applications, vol. 87, 267–279.
Bai, Y., Gao, Y., Wan, R., Zhang, S. – Song, R. (2024) A review of reinforcement learning
in financial applications. Annual Review of Statistics and Its Application,
vol. 12.
Barberis, N. – Thaler, R. (2003) A survey of behavioral finance. Handbook of the Eco-
nomics of Finance, vol. 1, 1053–1128.
Betancourt, C. – Chen, W.-H. (2021) Deep reinforcement learning for portfolio man-
agement of markets with a dynamic number of assets. Expert Systems with
Applications, vol. 164, 114002.
Beysolow II, T. (2019) Market making via reinforcement learning. In Applied Reinforce-
ment Learning with Python: With OpenAI Gym, Tensorflow, and Keras,
77–94, Springer.
Black, F. (1986) Noise. The journal of finance, vol. 41 (3), 528–543.
Bouchaud, J.-P., Ciliberti, S., Lempérière, Y., Majewski, A., Seager, P. – Ronia, K. S.
(2017) Black was right: Price is within a factor 2 of value. arXiv preprint
arXiv:1711.04717.
Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J. – Za-
remba, W. (2016) Openai gym. In Proceedings of Deep Reinforcement Learn-
ing Workshop, NIPS.
Buehler, H., Gonon, L., Teichmann, J. – Wood, B. (2019) Deep hedging. Quantitative
Finance, vol. 19 (8), 1271–1291.
Carhart, M. M. (1997) On persistence in mutual fund performance. The Journal of finance,
vol. 52 (1), 57–82.
DeMiguel, V., Garlappi, L. – Uppal, R. (2009) Optimal versus naive diversification: How
inefficient is the 1-n portfolio strategy? The Review of Financial Studies,
vol. 22 (5), 1915–1953, URL: https://EconPapers.repec.org/RePEc:oup:
rfinst:v:22:y:2009:i:5:p:1915-1953.
62
Fama, E. F. (1965) The behavior of stock-market prices. The journal of Business,
vol. 38 (1), 34–105.
Fama, E. F. (1970) Efficient capital markets: A review of theory and empirical work. The
journal of Finance, vol. 25 (2), 383–417.
Fujimoto, S., Hoof, H. – Meger, D. (2018) Addressing function approximation error in
actor-critic methods. arXiv preprint arXiv:1802.09477.
Glorot, X., Bordes, A. – Bengio, Y. (2011) Deep sparse rectifier neural networks. In
Proceedings of the fourteenth international conference on artificial intelligence
and statistics, 315–323, JMLR Workshop and Conference Proceedings.
Gu, S., Kelly, B. – Xiu, D. (2020) Empirical asset pricing via machine learning. The
Review of Financial Studies, vol. 33 (5), 2223–2273, URL: https://doi.
org/10.1093/rfs/hhaa009.
Guan, M. – Liu, X.-Y. (2021) Explainable deep reinforcement learning for portfolio man-
agement: an empirical approach. In Proceedings of the second ACM interna-
tional conference on AI in finance, 1–9.
Huang, G., Zhou, X. – Song, Q. (2020) Deep reinforcement learning for long-short portfolio
optimization. arXiv e-prints, arXiv–2012.
Huang, G., Zhou, X. – Song, Q. (2022) Deep reinforcement learning for portfolio manage-
ment. URL: https://arxiv.org/abs/2012.13773.
Jeong, G. – Kim, H. Y. (2019) Improving financial trading decisions using deep q-learning:
Predicting the number of shares, action strategies, and transfer learning.
Expert Systems with Applications, vol. 117, 125–138.
Jiang, Z. – Liang, J. (2017) Cryptocurrency portfolio management with deep reinforce-
ment learning. In 2017 Intelligent systems conference (IntelliSys), 905–913,
IEEE.
Jiang, Z., Xu, D. – Liang, J. (2017) A deep reinforcement learning framework for the
financial portfolio management problem. arXiv preprint arXiv:1706.10059.
Jones, S. L. – Netter, J. M. (2008) Efficient capital markets. The Concise Encyclopedia
of Economic, vol. 15 (14), 87–98.
Lecun, Y., Bengio, Y. – Hinton, G. (2015) Deep learning. Nature, vol. 521 (7553), 436–
444, publisher Copyright: © 2015 Macmillan Publishers Limited. All rights
63
reserved.
Ledoit, O. – Wolf, M. (2008) Robust performance hypothesis testing with the sharpe
ratio. Journal of Empirical Finance, vol. 15 (5), 850–859.
Lee, J., Kim, R., Yi, S.-W. – Kang, J. (2020) Maps: Multi-agent reinforcement learning-
based portfolio management system. arXiv preprint arXiv:2007.05402.
Liang, Z., Chen, H., Zhu, J., Jiang, K. – Li, Y. (2018) Adversarial deep reinforce-
ment learning in portfolio management. URL: https://arxiv.org/abs/
1808.09940.
Lillicrap, T. P. et al. (2015) Continuous control with deep reinforcement learning. arXiv
preprint arXiv:1509.02971.
Lindner, T., Puck, J. – Verbeke, A. (2022) Beyond addressing multicollinearity: Robust
quantitative analysis and machine learning in international business research.
Journal of International Business Studies, vol. 53 (7), 1307–1314.
Liu, X.-Y., Xiong, Z., Zhong, S., Yang, H. – Walid, A. (2018) Practical deep reinforcement
learning approach for stock trading. arXiv preprint arXiv:1811.07522.
Lo, A. W. (2004) The adaptive markets hypothesis: Market efficiency from an evolutionary
perspective. Journal of Portfolio Management, Forthcoming.
Malkiel, B. G. (2003) The efficient market hypothesis and its critics. Journal of economic
perspectives, vol. 17 (1), 59–82.
Markowitz, H. (1952) Portfolio selection. The Journal of Finance, vol. 7 (1),
77–91, URL: https://onlinelibrary.wiley.com/doi/abs/10.1111/j.
1540-6261.1952.tb01525.x.
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D. – Kavuk-
cuoglu, K. (2016) Asynchronous methods for deep reinforcement learning. In
International conference on machine learning, 1928–1937, PmLR.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves,
A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G. et al. (2015) Human-level
control through deep reinforcement learning. nature, vol. 518 (7540), 529–533.
Mohammed, S., Bealer, R. – Cohen, J. (2021) Embracing advanced ai/ml to help investors
achieve success: Vanguard reinforcement learning for financial goal planning.
arXiv preprint arXiv:2110.12003.
64
Moody, J., Wu, L., Liao, Y. – Saffell, M. (1998) Performance functions and reinforcement
learning for trading systems and portfolios. Journal of forecasting, vol. 17 (5-
6), 441–470.
Musiol, M. (2016) Speeding up deep learning computational aspects of machine learning.
Nevmyvaka, Y., Feng, Y. – Kearns, M. (2006) Reinforcement learning for optimized trade
execution. In Proceedings of the 23rd international conference on Machine
learning, 673–680.
Park, H., Sim, M. K. – Choi, D. G. (2020) An intelligent financial portfolio trading
strategy using deep q-learning. Expert Systems with Applications, vol. 158,
113573.
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin,
Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito,
Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai,
J. – Chintala, S. (2019) Pytorch: An imperative style, high-performance
deep learning library. In Advances in Neural Information Processing Systems
(NeurIPS).
Pendharkar, P. C. – Cusatis, P. (2018) Trading financial indices with reinforcement learn-
ing agents. Expert Systems with Applications, vol. 103, 1–13.
Raffin, A., Hill, A., Gleave, A., Kanervisto, A., Ernestus, N. – Dormann, P. (2021) Stable-
baselines3: Reliable reinforcement learning implementations. Journal of Ma-
chine Learning Research, https://github.com/DLR-RM/stable-baselines3.
Safari, S. A. – Schmidhuber, C. (2025) Trends and reversion in financial markets on time
scales from minutes to decades. arXiv preprint arXiv:2501.16772.
Schulman, J., Moritz, P., Levine, S., Jordan, M. – Abbeel, P. (2015) High-dimensional
continuous control using generalized advantage estimation. arXiv preprint
arXiv:1506.02438.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A. – Klimov, O. (2017) Proximal policy
optimization algorithms. URL: https://arxiv.org/abs/1707.06347.
Schwert, G. W. (2003) Anomalies and market efficiency. Handbook of the Economics of
Finance, vol. 1, 939–974.
Sharpe, W. F. (1964) Capital asset prices: A theory of market equilibrium under condi-
tions of risk. The journal of finance, vol. 19 (3), 425–442.
65
Shen, W., Wang, J., Jiang, Y.-G. – Zha, H. (2015) Portfolio choices with orthogonal
bandit learning. In IJCAI, vol. 15, 974–980.
Shiller, R. J. (2017) Narrative economics. The American Economic Review, vol. 107 (4),
967–1004, URL: http://www.jstor.org/stable/44251584.
Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D. – Riedmiller, M. (2014) Determ-
inistic policy gradient algorithms. In International conference on machine
learning, 387–395, Pmlr.
Sola, J. – Sevilla, J. (1997) Importance of input data normalization for the application
of neural networks to complex industrial problems. IEEE Transactions on
Nuclear Science, vol. 44 (3), 1464–1468.
Soleymani, F. – Paquet, E. (2021) Deep graph convolutional reinforcement learning for fin-
ancial portfolio management–deeppocket. Expert Systems with Applications,
vol. 182, 115127.
Sutton, R. S., Barto, A. G. et al. (1998) Reinforcement learning: An introduction, vol. 1.
MIT press Cambridge.
Sutton, R. S., Barto, A. G. et al. (2018) Reinforcement learning: An introduction 2nd ed.
MIT press Cambridge, vol. 1 (2), 25.
Tian, Y., Gao, M., Gao, Q. – Peng, X.-H. (2024) Trading in fast-changing markets
with meta-reinforcement learning. Intelligent Automation & Soft Computing,
vol. 39 (2).
Uhlenbeck, G. E. – Ornstein, L. S. (1930) On the theory of the brownian motion.
Phys. Rev., vol. 36, 823–841, URL: https://link.aps.org/doi/10.1103/
PhysRev.36.823.
Wang, J., Zhang, Y., Tang, K., Wu, J. – Xiong, Z. (2019) Alphastock: A buying-winners-
and-selling-losers investment strategy using interpretable deep reinforcement
attention networks. In Proceedings of the 25th ACM SIGKDD international
conference on knowledge discovery & data mining, 1900–1908.
Wang, Z., Huang, B., Tu, S., Zhang, K. – Xu, L. (2021) Deeptrader: a deep reinforcement
learning approach for risk-return balanced portfolio management with market
conditions embedding. In Proceedings of the AAAI conference on artificial
intelligence, vol. 35, 643–650.
Watkins, C. J. – Dayan, P. (1992) Q-learning. Machine learning, vol. 8, 279–292.
66
Williams, R. J. (1992) Simple statistical gradient-following algorithms for connectionist
reinforcement learning. Machine learning, vol. 8, 229–256.
Wu, M.-E., Syu, J.-H., Lin, J. C.-W. – Ho, J.-M. (2021) Portfolio management system
in equity market neutral using reinforcement learning. Applied Intelligence,
vol. 51 (11), 8119–8131.
Xiong, Z., Liu, X.-Y., Zhong, S., Yang, H. – Walid, A. (2018) Practical deep reinforcement
learning approach for stock trading. arXiv preprint arXiv:1811.07522, vol. 25,
27.
Yang, H., Liu, X.-Y., Zhong, S. – Walid, A. (2020) Deep reinforcement learning for
automated stock trading: An ensemble strategy. In Proceedings of the first
ACM international conference on AI in finance, 1–8.
Ye, Y., Pei, H., Wang, B., Chen, P.-Y., Zhu, Y., Xiao, J. – Li, B. (2020) Reinforcement-
learning based portfolio management with augmented asset movement predic-
tion states. In Proceedings of the AAAI conference on artificial intelligence,
vol. 34, 1112–1119.
Zhang, Z., Zohren, S. – Roberts, S. (2019) Deep reinforcement learning for trading. arXiv
preprint arXiv:1911.10107.
Zou, J., Lou, J., Wang, B. – Liu, S. (2024) A novel deep reinforcement learning based
automated stock trading system using cascaded lstm networks. Expert Sys-
tems with Applications, vol. 242, 122801.
67
Appendices
APPENDIX I
Algorithm 1: Advantage Actor-Critic (A2C)
1: Initialize policy network πθ(s) and value network Vϕ(s) with random weights.
2: for iteration = 1 to M (M updates) do
3: for each environment worker k = 1 to K (in parallel) do
4: Run policy πθ for n steps or until episode ends:
5: Collect states st, actions at, rewards rt for t = 1, . . . , n.
6: if episode ended at step n then
7: Set R = 0
8: else
9: Set R = Vϕ(sn+1) {bootstrap from last state}
10: end if
11: for t = n down to 1 do
12: R← rt + γR
13: Calculate advantage: At ← R− Vϕ(st)
14: Accumulate gradients: gθ ← gθ + ∇θ[− log πθ(at|st)At] {policy gradient (max-
imize advantage)}
15: Accumulate gradients: gϕ ← gϕ +∇ϕ[A2t ] {value gradient (MSE loss)}
16: end for
17: end for
18: θ ← θ + αθ gθK {update policy weights; ascent on advantage}
19: ϕ← ϕ− αϕ gϕK {update value weights; descent on loss}
20: end for
68
APPENDIX II
Algorithm 2: Deep Deterministic Policy Gradient (DDPG)
1: Initialize actor network µθ(s), critic network Qϕ(s, a) with random weights
2: Initialize target networks: θ′ ← θ, ϕ′ ← ϕ
3: Initialize replay buffer D
4: for episode = 1 to E do
5: Receive initial state s0
6: for t = 0 to T − 1 (until end of episode or max steps) do
7: // Actor selects action with exploration noise
8: at = µθ(st) + noiset {noise ∼ N (0, σ) or Ornstein-Uhlenbeck}
9: Execute action at, observe reward rt and new state st+1
10: Store (st, at, rt, st+1) in replay buffer D
11: st ← st+1
12: // Update step (every step or every few steps)
13: Sample minibatch of B experiences (si, ai, ri, s′i) from D
14: for each sample i in batch do
15: yi = ri + γQϕ′(s′i, µθ′(s
′
i)) {compute target Q}
16: end for
17: Update critic by minimizing loss:
L =
1
B
∑︂
i
(Qϕ(si, ai)− yi)2
18: Update actor using deterministic policy gradient:
∇θJ ≈ 1
B
∑︂
i
∇θµθ(si)∇aQϕ(si, a)|a=µθ(si)
19: Perform gradient ascent on θ (e.g., Adam optimizer)
20: Soft update target networks:
θ′ ← τθ + (1− τ)θ′, ϕ′ ← τϕ+ (1− τ)ϕ′
21: end for
22: end for
69
APPENDIX III
Algorithm 3: Proximal Policy Optimization (PPO)
1: Initialize policy network πθ(s) and value network Vϕ(s)
2: for iteration = 1 to Niter do
3: Collect set of trajectories D = {τ} by running policy πθ in environment (with T steps,
possibly parallel environments)
4: Compute rewards-to-go Rt and advantage estimates Aˆt for each timestep in D (using
current Vϕ, e.g., via GAE)
5: Save current policy parameters as θold ← θ
6: for epoch = 1 to K (multiple epochs per batch) do
7: for each minibatch M ⊂ D do
8: Compute ratios:
rt(θ) =
πθ(at|st)
πθold(at|st)
, ∀(st, at) ∈M
9: Compute policy loss:
Lpg = −EM
[︂
min
(︂
rt(θ)Aˆt, clip(rt(θ), 1− ϵ, 1 + ϵ)Aˆt
)︂]︂
10: Compute value loss:
Lvf = EM
[︁
(Vϕ(st)−Rt)2
]︁
11: Compute entropy bonus (encourage exploration):
Lent = −EM [β H(πθ(st))]
12: Compute total loss:
L = Lpg + c1Lvf + c2Lent
13: Update θ, ϕ by gradient descent/ascent on total loss
14: end for
15: end for
16: end for
70
APPENDIX IV
Algorithm 4: Simplified Pseudocode for the StockTradingEnv
1: Input: Price data, feature data, macro/fundamental data, and config parameters.
2:
3: // — Environment Initialization —
4: Store data arrays and configuration parameters.
5: Define continuous action space A for portfolio weights.
6: Define observation space S based on the concatenated dimensions of all input data over
the lookback window.
7:
8: // — Reset Function Logic —
9: function Reset()
10: Set ‘current_step‘ to its initial value (‘window_size‘ - 1).
11: Set ‘portfolio_value‘ to 1.0.
12: Set ‘current_weights‘ to the initial distribution (e.g., 1/N).
13: ‘observation‘ ← Construct observation for the initial state.
14: return ‘observation‘.
15: end function
16:
17: // — Step Function Logic —
18: function Step(‘action‘)
19: ‘previous_weights‘ ← ‘current_weights‘.
20: ‘target_weights‘ ← Normalize ‘action‘ so that weights sum to 1.
21: ‘turnover‘ ←∑︁|target_weights− previous_weights|.
22: ‘transaction_cost‘ ← ‘portfolio_value‘ × ‘turnover‘ × ‘cost_rate‘.
23:
24: // Calculate return based on weights held during the period
25: ‘asset_returns‘ ← Get returns for the current step from price data.
26: ‘portfolio_return‘ ← ‘previous_weights‘ · ‘asset_returns‘.
27:
28: // Update value and compute reward
29: ‘previous_value‘ ← ‘portfolio_value‘.
30: ‘portfolio_value‘ ← ‘portfolio_value‘ × (1 + ‘portfolio_return‘) -‘cost‘.
31: ‘reward‘ ← log(portfolio_value/previous_value).
32:
33: // Transition to the next state
34: ‘current_weights‘ ← ‘target_weights‘.
35: ‘current_step‘ ← ‘current_step‘ + 1.
36: ‘done‘ ← (‘current_step‘ ≥ ‘max_steps‘).
37: ‘next_observation‘ ← Construct observation for the new state.
38: return ‘next_observation‘, ‘reward‘, ‘done‘.
39: end function