Which update target does SARSA use during learning?
2
Why is Q‑Learning considered off‑policy?
3
In a stochastic FrozenLake environment, which algorithm tends to produce a safer policy for a physical robot and why?
4
What is the main source of overestimation bias in standard Q‑Learning?
5
How does Double Q‑Learning reduce the maximization bias?
6
When both SARSA and Q‑Learning are applied to a deterministic 3×3 grid world, which statement is true about the learned actions from the safe row to the goal row?
7
Which of the following best describes the role of the ε‑greedy behavior policy in SARSA?
lightbulb
Explanation
<p>La bonne réponse est <strong>It introduces stochasticity so the update target reflects possible mistakes</strong> parce que la politique ε‑greedy ajoute de l’aléatoire : parfois l’agent choisit une action sous‑optimale, ce qui fait que la cible de mise à jour intègre les erreurs possibles et évite de se bloquer sur une seule trajectoire. Le piège est de croire que ε‑greedy rend toujours l’action optimale (c’est la réponse 1) alors qu’il sert justement à explorer. Imagine que tu explores un labyrinthe en prenant parfois des détours inattendus ; ces détours te montrent où tu pourrais te tromper et t’aident à apprendre plus robustement.</p>
<em>Quel aspect de ε‑greedy te semble le plus important ? 1️⃣ Introduire du hasard, 2️⃣ Garantir l’optimalité immédiate, 3️⃣ Supprimer le taux d’apprentissage</em>
8
In Double Q‑Learning, what determines which of the two Q‑tables (QA or QB) is updated after a transition?
9
Why does SARSA tend to achieve higher win rates during training on a slippery FrozenLake compared to Q‑Learning?
10
Which scenario would show negligible difference between Q‑Learning and Double Q‑Learning?
11
What is the effect of setting the learning rate α to 1 in the Q‑Learning update rule?
12
In the SARSA update equation Q(S,A) ← Q(S,A) + α[ R + γ Q(S',A') – Q(S,A) ], what does the term R + γ Q(S',A') represent?
13
Which of the following best explains why Double Q‑Learning can be advantageous in environments with many actions and high stochasticity?
14
When applying Q‑Learning to a slippery FrozenLake, why might the learned policy recommend paths that pass near holes?
15
What is the primary purpose of the interpolation form Q ← (1‑α) Q + α target in TD learning?
16
In the context of Double Q‑Learning, what does the notation QA←QA+α[ R + γ QB(S',A*) – QA(S,A) ] signify?
17
Which algorithm explicitly separates the selection and evaluation steps to mitigate the maximization bias?
18
When the discount factor γ is set to 0, what effect does this have on the TD target for both SARSA and Q‑Learning?
19
Which of the following statements about the ε‑greedy policy used in Q‑Learning is accurate?
0%
Score
0
Correct
0
Incorrect
Want to go further?
Create a free account and generate unlimited quizzes from your own study material.