Control Algorithms in Reinforcement Learning

1

Which update target does SARSA use during learning?

2

Why is Q‑Learning considered off‑policy?

3

In a stochastic FrozenLake environment, which algorithm tends to produce a safer policy for a physical robot and why?

4

What is the main source of overestimation bias in standard Q‑Learning?

5

How does Double Q‑Learning reduce the maximization bias?

6

When both SARSA and Q‑Learning are applied to a deterministic 3×3 grid world, which statement is true about the learned actions from the safe row to the goal row?

7

Which of the following best describes the role of the ε‑greedy behavior policy in SARSA?

8

In Double Q‑Learning, what determines which of the two Q‑tables (QA or QB) is updated after a transition?

9

Why does SARSA tend to achieve higher win rates during training on a slippery FrozenLake compared to Q‑Learning?

10

Which scenario would show negligible difference between Q‑Learning and Double Q‑Learning?

11

What is the effect of setting the learning rate α to 1 in the Q‑Learning update rule?

12

In the SARSA update equation Q(S,A) ← Q(S,A) + α[ R + γ Q(S',A') – Q(S,A) ], what does the term R + γ Q(S',A') represent?

13

Which of the following best explains why Double Q‑Learning can be advantageous in environments with many actions and high stochasticity?

14

When applying Q‑Learning to a slippery FrozenLake, why might the learned policy recommend paths that pass near holes?

15

What is the primary purpose of the interpolation form Q ← (1‑α) Q + α target in TD learning?

16

In the context of Double Q‑Learning, what does the notation QA←QA+α[ R + γ QB(S',A*) – QA(S,A) ] signify?

17

Which algorithm explicitly separates the selection and evaluation steps to mitigate the maximization bias?

18

When the discount factor γ is set to 0, what effect does this have on the TD target for both SARSA and Q‑Learning?

19

Control Algorithms in Reinforcement Learning

Which update target does SARSA use during learning?

Why is Q‑Learning considered off‑policy?

In a stochastic FrozenLake environment, which algorithm tends to produce a safer policy for a physical robot and why?

What is the main source of overestimation bias in standard Q‑Learning?

How does Double Q‑Learning reduce the maximization bias?

When both SARSA and Q‑Learning are applied to a deterministic 3×3 grid world, which statement is true about the learned actions from the safe row to the goal row?

Which of the following best describes the role of the ε‑greedy behavior policy in SARSA?

In Double Q‑Learning, what determines which of the two Q‑tables (QA or QB) is updated after a transition?

Why does SARSA tend to achieve higher win rates during training on a slippery FrozenLake compared to Q‑Learning?

Which scenario would show negligible difference between Q‑Learning and Double Q‑Learning?

What is the effect of setting the learning rate α to 1 in the Q‑Learning update rule?

In the SARSA update equation Q(S,A) ← Q(S,A) + α[ R + γ Q(S',A') – Q(S,A) ], what does the term R + γ Q(S',A') represent?

Which of the following best explains why Double Q‑Learning can be advantageous in environments with many actions and high stochasticity?

When applying Q‑Learning to a slippery FrozenLake, why might the learned policy recommend paths that pass near holes?

What is the primary purpose of the interpolation form Q ← (1‑α) Q + α target in TD learning?

In the context of Double Q‑Learning, what does the notation QA←QA+α[ R + γ QB(S',A*) – QA(S,A) ] signify?

Which algorithm explicitly separates the selection and evaluation steps to mitigate the maximization bias?

When the discount factor γ is set to 0, what effect does this have on the TD target for both SARSA and Q‑Learning?

Which of the following statements about the ε‑greedy policy used in Q‑Learning is accurate?

Stop highlighting.
Start learning.

Control Algorithms in Reinforcement Learning

Which update target does SARSA use during learning?

Why is Q‑Learning considered off‑policy?

In a stochastic FrozenLake environment, which algorithm tends to produce a safer policy for a physical robot and why?

What is the main source of overestimation bias in standard Q‑Learning?

How does Double Q‑Learning reduce the maximization bias?

When both SARSA and Q‑Learning are applied to a deterministic 3×3 grid world, which statement is true about the learned actions from the safe row to the goal row?

Which of the following best describes the role of the ε‑greedy behavior policy in SARSA?

In Double Q‑Learning, what determines which of the two Q‑tables (QA or QB) is updated after a transition?

Why does SARSA tend to achieve higher win rates during training on a slippery FrozenLake compared to Q‑Learning?

Which scenario would show negligible difference between Q‑Learning and Double Q‑Learning?

What is the effect of setting the learning rate α to 1 in the Q‑Learning update rule?

In the SARSA update equation Q(S,A) ← Q(S,A) + α[ R + γ Q(S',A') – Q(S,A) ], what does the term R + γ Q(S',A') represent?

Which of the following best explains why Double Q‑Learning can be advantageous in environments with many actions and high stochasticity?

When applying Q‑Learning to a slippery FrozenLake, why might the learned policy recommend paths that pass near holes?

What is the primary purpose of the interpolation form Q ← (1‑α) Q + α target in TD learning?

In the context of Double Q‑Learning, what does the notation QA←QA+α[ R + γ QB(S',A*) – QA(S,A) ] signify?

Which algorithm explicitly separates the selection and evaluation steps to mitigate the maximization bias?

When the discount factor γ is set to 0, what effect does this have on the TD target for both SARSA and Q‑Learning?

Which of the following statements about the ε‑greedy policy used in Q‑Learning is accurate?

Want to go further?

Stop highlighting. Start learning.

In the SARSA update equation Q(S,A) ← Q(S,A) + α[ R + γ Q(S',A') – Q(S,A) ], what does the term R + γ Q(S',A') represent?

What is the primary purpose of the interpolation form Q ← (1‑α) Q + α target in TD learning?

In the context of Double Q‑Learning, what does the notation QA←QA+α[ R + γ QB(S',A*) – QA(S,A) ] signify?

Stop highlighting.
Start learning.