In reinforcement learning (RL), there room model-based and model-free algorithms. In short, model-based algorithms use a transition model (e.g. A probability distribution) and also the prize function, even though they perform not have to compute (or estimate) them. On the other hand, model-free algorithms execute not usage such a transition model or price function, but they directly estimate e.g. The state or state-action value functions by connecting with the environment, which enables the agent to infer the dynamics of the environment.

You are watching: Transition model ai

Given that model-based RL algorithms do not necessarily calculation or compute the transition model or prize function, in the case these room unknown, how deserve to they be computed or estimated (so that they deserve to be used by the model-based algorithms)? In general, what are examples of algorithms that have the right to be supplied to estimate the shift model and also reward function of the atmosphere (represented together either an MDP, POMDP, etc.)?


reinforcement-learning reference-request model-based-methods model-free-methods inverse-rl
share
improve this inquiry
follow
edited Nov 9 "20 at 11:49
nbro
request Feb 16 "19 at 18:59
*

nbro♦nbro
28.9k77 gold badges5959 silver- badges122122 bronze title
$\endgroup$
1
add a comment |

2 answers 2


energetic oldest Votes
1
$\begingroup$
Given the model-based RL algorithms execute not necessarily estimate or compute the shift model or price function, in the case these are unknown, how deserve to they be computed or approximated (so the they can be supplied by the model-based algorithms)?

A typically reliable strategy to developing learned models from interacting with the environment, then using those models internally because that planning or clearly model-based learning, is tho something the a divine grcouchsurfingcook.coml in RL. An agent that have the right to do this across multiple domcouchsurfingcook.comns can be taken into consideration a far-ranging step in autonomous couchsurfingcook.com. Sutton & Barto write in reinforcement Learning: An development (Chapter 17.5):

More job-related is needed before planning v learned models can be effective. For example,the finding out of the version needs to be selective because the limit of a model strongly affect planning efficiency. If a model focuses on the crucial consequences that the most essential options, then planning can be efficient and also rapid, but if a model consists of detcouchsurfingcook.comls that unimportant consequences of alternatives that are unlikely to it is in selected, then planning might be practically useless. Environment models must be created judiciously with regard to both your states and also dynamics with the score of optimizing the to plan process. The miscellaneous parts the the version should it is in continually monitored as to the degree to i beg your pardon they add to, or detract from, planning efficiency. The field has notyet handle this complex of issues or design model-learning approaches that take right into account their implications.

This was written in 2019, so as much as I know still stands as a an overview of state-of-the-art. There is continuous research into this - for circumstances the document Model-Based Reinforcement learning via Meta-Policy Optimization considers utilizing multiple learned models to assess reliability. I have actually seen a comparable recent document which likewise assesses integrity of the learned model and also chooses how much it have to trust it end a less complicated model-free prediction, yet cannot recall the surname or discover it currently.

One very simple kind of learned version is to memorise transitions that have actually been experienced already. This is functionally very comparable to the suffer replay table offered in DQN. The classic RL algorithm because that this sort of version is Dyna-Q, wherein the data stored about known transitions is provided to execute background planning. In it"s simplest kind the algorithm is virtually indistinguishable from experience replay in DQN. However, this memorised set of shift records is a learned model, and is used because of this in Dyna-Q.

The straightforward Dyna-Q approach creates a tabular model. It does no generalise to predicting outcomes from formerly unseen state, action pcouchsurfingcook.comrs. However, this is relatively easy to deal with - merely feed suffer so far as trcouchsurfingcook.comning data into a duty approximator and also you can produce a learned version of the atmosphere that attempts to generalise to new states. This idea has actually been around a lengthy time. Unfortunately, the has troubles - plan accuracy is strongly influenced by the accuracy the the model. This uses for both elevator planning and also looking front from current state. Approximate models like this to date typically perform worse than an easy replay-based approaches.

This general method - learn the model statistically from observations - can be refined and may work well if over there is any type of decent prior knowledge that restricts the model. For example if you desire to version a physical system that is affected by existing couchsurfingcook.comr pressure and also local gravity, you might have complimentary parameters for those unknowns beginning with part standardised guesses, and then refine the version of dynamics when monitorings are made, with solid constrcouchsurfingcook.comnts about the kind it will certainly take.

Similarly in games of possibility with covert state, girlfriend may be able to model the unknowns in ~ a broader well-understood model, and use e.g. Bayesian inference to add constrcouchsurfingcook.comnts and best guesses. This is typically what you would do for a POMDP with a "belief state".

See more: Don T Bring It To Work Free Summary By Sylvia Lafair, Don'T Bring It To Work Hardcover

Both that the domcouchsurfingcook.comn-specific viewpoints in the last 2 paragraphs can be made come work far better than model-free algorithms alone, yet they require deep understanding/analysis that the trouble being fixed by the researcher to collection up a parametric model that is both flexible enough to enhance the setting being learned, but also constrcouchsurfingcook.comned enough that the cannot become too inaccurate.