### Assignments

**See the exercises page for assignment results!**

**Assignment 1: Search. **The objective of the assignment is to familiarize with state-space search algorithms. **Given out:** January 19, **Return:** Thursday February 2 at 23:59

- Implement state representation for 8-puzzle: a typical representation would be a vector for 3 X 3 grid cells, indicating what tile is in each cell (the number of the tile) or if the cell is empty (denoted by 0). Include the possibility of using any number of tiles from 1 to 8.
- Implement a h-function for the 8-puzzle (the one based on Manhattan distances from the lecture.)
- Implement A*, WA*, and Greedy Best-First Search.
- You can use any programming language you prefer.
- Test each of the three algorithms with different number of tiles (1 to 8), from randomly generated initial states (have a look at the Wikipedia article on the 8-puzzle: not all initial configurations are solvable!)
- Collect statistics on the runs: number of nodes expanded and runtime.
- Report your results in a short document (text file or PDF), 1 or 2 pages.
- Return the report and all of your source code in a single archive file (tar, .gz, .Z, .bz2).
- The
**name**of the archive file is**your student ID**followed by the extensions of the archive file format .tar, .gz, .Z, or .bz2. - Email the archive file to
**jussi.rintanen@aalto.fi**. - The deadline is
**23:59 on Thursday, February 2**.

- The

** Assignment 2: Logic and Constraint Programming** The objective of the assignment is to introduce to the use of logic and constraints in problem solving.

**Assignment 3: Automated Planning. **The objective of the assignment is to introduce model-based methods: modeling languages, model-based problem solving with general-purpose search methods.

**Assignment 4: Decision-Theoretic Planning and Reinforcement Learning.** The assignment introduces to decision-making with stochastic models. **Deadline:** March 31

**The objective of the assignment is to introduce Constraint Satisfaction as a general framework for formalizing and solving both static (untimed) and dynamic (timed) problems.****Given out:**February 2. A**ssignment emailed to all registered participants, and also accessible as http://rintanen.info/ASS2/***YOURPERSONALSTUDENTID*.pdf.**Return:**Sunday February 19 at midnight 23:59:59**Downloads: grounder (64-bit Linux binary executable) grounder (Mac executable), zebra.fma, sudoku.fma, robotmoves.fma, MathSAT (website) language documentation (PDF)**Write**Instructions:****solution specification as instructed in the lecture sl**ides. Feed the specification (example:*zebra.fma*) to the grounder to produce a corresponding SMT file, which is fed to MathSAT to obtain a solution valuation (truth-values for all atomic propositions).**.****/grounder zebra.fma > zebra.smt****./mathsat -model < zebra.smt > zebra.output****There may be a lot of output, and to interpret the**valuation given by MathSAT,**typically it is us**eful to look at only those lines that correspond to true atomic propositions.**grep true zebra.output****Returning the assignment:**Submit your .fma file, your .smt file, and a short document that shows the solution produced by MathSAT in human-readable form, including an explanation of what the solution means.*Submit the files separately (not as an archive!)*, with your student ID as the prefix of the filenames.The goal of the assignment is to familiarize with model-based problem solving and specification languages that express transition system models. Planning / sequential decision-making problem is formalized in the NDL language, problem is solved with the

*C**elebes*planner, and the solutions are reported as before (NDL specification with comments, output plan, explanation/illustration of the solution plan.)**Obtaining the assignment:**The assignment (personal) was emailed to all registered participants. It is also accessible as http://rintanen.info/ASS3/*YOURPERSONALSTUDENTID*.pdf.**Returning the assignment:**Through the Assignments page on MyCourses, as**separate files**(ndl + PDF or txt file(s)).**Deadline:**Thursday March 9, 2017 at 23:55 (for bonus points)**Instructions:**The NDL specifications are written with a text editor, and then fed to the Celebes planner by writing e.g../celebes rushhour.ndl

Celebes will parse and simplify the specification, and then repeatedly call MathSAT to find a plan, and output it. The search strategy is affected by the maximum amount of time allowed: the default is 1800 seconds, and often plans can be found (much) faster by reducing the max time limit, e.g. with the command line option

*-T 200*.When Celebes finds a plan, it outputs it. Each line of the plan output first contains the "step" of the plan. The steps are numbered from 0 up, and there may be multiple actions at one step (or a step does not necessarily have any actions.) Actions within a step can be ordered arbitrarily, as they are independent (according to the interference condition explained in the lecture.)

MathSAT does not know that the number of actions in a plan should be minimized, so it returns any plan of a given horizon-length that reaches the goals. So the plan may contain unnecessary actions. You force it to find a plan with the minimal horizon length with the option -s 1 (so that horizon length is increased always by 1, not by 5.) This tends to reduce the number of unnecessary actions.

If you encounter problems getting Celebes to solve your problem, you can investigate what happens in Celebes' front end, for example to make sure that some actions or state variables are eliminated because of irrelevance or unreachability (possibly indicating a modeling error.) You can inspect the grounded and simplified specifications by adding the command line option

*-v 2*(for verbosity level 2). If you want to see the formulas fed to MathSAT, use the command line options*-s 3 -t 5 -O*to generate .smt files*output.{0,1,2,3,4}.smt*for horizon lengths 0, 3, 6, 9, 12.If the planner quickly generates a long sequence of "UNSAT" (for a problem that should have short and simple solutions), there is most likely no solutions, i.e. the goals cannot be reached for some reason (modeling error?). See the planner intermediate results (grounded and simplified actions with -v 2) to see what might be going wrong, or try out simpler goals (with correspondingly simpler plans) to see that things work as expected.

The assignment can be solved with a narrow subset of NDL features. For some of the assignments it is enough to use Boolean state variables, some others benefit from integers. All can be solved without IF-THEN-ELSE statements, but some may benefit from them.

The solutions are not overly complicated: In general, if you think you need very many actions (more than 10) or the actions are getting complicated (several lines, complex constructs, ...), you are probably doing something wrong or in a too complex way. Think about what the essence of the solution is, and try to come with a small number of understandable actions.

**Notice:***The celebes binary is going to be updated occasionally (bug fixes etc.), so if you encounter problems, please check if there is a new version available. If you encounter bugs, cryptic error messages, other unexplainable behavior, please report so that I can investigate (and fix).***Downloads:**celebes (64-binary for Linux March 13 17:17; write "chmod a+x celebes" to make the file executable), NDL documentation (PDF), examples: rushhour.ndl, simpleelevator.ndl , 8puzzle.ndlThe components of this assignment are:

- Model a grid navigation problem as a Markov Decision Process
- Implement the Value Iteration algorithm, and test it with the navigation problem.
- Implement the Q-learning algorithm, and test it with the grid navigation problem (the algorithm only knows about the state-space and the actions, not about the transition probabilities and rewards in the MDP.)

The grid is the one familiar from Exercise 6, with a small modification. All move actions in the NE corner cell result in teleporting the agent to the START-cell in the SW corner with probability 1.0. Reaching the cells marked with +1 and -1 have the respective immediate rewards, and all other cells have immediate reward 0. There are 4 move actions, for North, South, East and West, with which the desired move is achieved with probability 0.8 (if move in that direction is possible), and otherwise with probability 0.1 one ends up to the left, and with probability 0.1 to the right to a position 90 degrees from the desired direction (as shown in the figure below.) If move to some of these directions is not possible, the move action has no effect.

First, implement Value Iteration and calculate the values of all states (cells) with sufficient accuracy.

Then implement the Q-learning algorithm to investigate its convergence, that is, how many steps you have to run the algorithm to learn Q-values with which the best action for each cell/state is close to the values of the corresponding MDP (as calculated by the Value Iteration algorithm.) Consider the values to be "close" if they differ by no more than 0.1. (You may have to experiment with this parameter, to terminate the computation in a reasonable amount of time and to get a good quality policy as a result.)

The execution of the Q-learning process starts from the START-cell. For every cell/state encountered, an action is chosen using a decision rule (see next paragraph), and then your code randomly chooses one of the possible next states with this action according to the probilities of the possible successor states (you must use the random number generator to choose the successor state), and then update the Q-value of the state-action pair.

For the decision rule in choosing which action to take, use the optimal Multi-Arm-Bandit rule from Lecture 9. For this you have to store the number of times each action has been taken so far in each cell/state, and the sum of the values of that action has had in s (in order to compute their average). The value of the action a in state s with successor state s' is R(s,a,s') + γ·max

_{a'}Q(s',a') (immediate reward + highest possible future discounted rewards.)For both Value Iteration and Q-learning use discount factor γ=0.95. For Q-learning use λ=0.01.

Optional: Experiment with different values for λ in Q-learning.

Reporting:

- Illustrate the optimal MDP policy you found with VI (and that should be the same as the one you found with Q-learning), indicating which way to go in each cell, and the value of every grid cell.
- To illustrate convergence of Q-learning, plot the Q-value of the best action for the START-state as a function of the number of iterations of the Q-learning algorithm (and indicate the value of the START-state in the optimal MDP policy in the plot, too). The number of iterations needed for Q-learning is high, tens of thousands.
- Return the assignment as a PDF document that includes the above plots, and all the files for your program code. (ALL FILES SUBMITTED SEPARATELY, NOT AS AN ARCHIVE!)

- Model a grid navigation problem as a Markov Decision Process