## Abstract

State-space reduction is the most important acceleration technique in every robotics solver. Techniques like sub-policy, Learning from demonstrations, qualitative physics and natural language instructions are used together with classical RRT planning. This blogpost introduces some of the concepts for a target audience, who is everybody and is interested in programming dexterous robotics with OpenAI gym.

## Table of Contents

1 Reinforcement Learning

1.1 Learning vs. Planning

1.2 Hidden Markov Model

1.3 Reinforcement Learning plus natural language instructions

Literature

1.4 Q-Learning

1.5 Function approximation for inverted pendulum

1.6 Control policy explained

2 Neural Network

2.1 How powerful is deeplearning?

2.2 Qualitative physics with neural networks

2.3 Recurrent neural networks

2.4 Combining planning with neural networks

associative memory

3 OpenAI gym

3.1 Very basic OpenAI gym agent

3.2 OpenAI gym with sub-policy

4 Learning from Demonstration

4.1 Makes learning from demonstration sense?

4.2 Micromanipulation planning

4.3 Learning of a dynamic system

4.4 Robobarista

4.5 Static motion primitive

4.6 Reward function in Inverse Reinforcement learning

4.7 Learning from Demonstration

4.8 Sensor-Motor-Primitives

## 1 Reinforcement Learning

## 1.1 Learning vs. Planning

In the robotics domain often the word “Learning” is used. For example in papers about “trajectory learning”, “biped balance learning” or “motion learning”. It seems, that everybody is using the word, but only few can explain why. It is possible to explain the terminus in detail. The word learning comes from early AI history which was done in 1950’s. In this time, AI was called cybernetics and was a sub-discipline of psychology. The question from this time was of how humans and animals think. The famous example from this time was a maze experiment with a mouse. The mouse is sitting in a maze. It walks around, searches for the cheese and while it is doing so, the mouse learns the labyrinth. That means, that the environment is stored into the memory of the mouse brain.

The early AI scientists tried to reproduce this behaviours with robots and computers. The idea is here, to replace the mouse with a robot, let him doing random steps in the maze, and store the sensor information inside a LSTM-network. “Trajectory learning” and “motor control learning” is equal to see a robot like a mouse.

But the concept has a huge difference. It is not focussed on results or technical problems, instead it is based on neural networks. Other problem solving techniques which also could bring the robot into the goal are missing. The alternative to learning is planning. The goal is the same, to bring the robot into the goal. But this time, planning is a technique which is realized by algorithm and not by psychological models. A planning algorithm could be for example A*. A* is not derived from biology or psychology, instead A* was invented from scratch. It doesn’t exists in nature, so it is not really part of core-science.

The contrast between learning and planning is a historic relict from the contrast between GOFAI and Narrow AI. GOFAI is learning-based which is equal to Cybernetics, while modern Narraw AI is done with planning and engineering techniques which are invented from scratch. In many universities today is Robotics learning very popular. But not because it is so superior, but because the professors who are teaching it, are very old. They have learned AI in the 1950s together with psychology and can’t imagine, that something different is possible.

The question which has to be answered is: what do we want? Is the aim to understand a mouse, while she is running through a labyrinth or do we want build robots which are competitive?

In theory it is possible to create a maze learning robot. This is robot which memorizes the obstacles and possible moves into a neural network and retrieves the information for finding the way out. But in reality, a working robot was never presented, and perhaps it is too difficult to build such a system. Instead it is easier to build a planning robot, which runs a normal algorithm. Such systems are reliable and can be bug-fixed.

GOFAI and Narrow AI are working on the same subject: Artificial intelligence. The difference is the precondition. GOFAI tries to rebuild nature. At first human and animal intelligence is studied and then reproduced with machines. While Narrow AI has at the beginning a machine and tries then to make it intelligent. The difference between the two is how to generate new knowledge about topics. In GOFAI inspiration are based on understanding of nature and from other science-discplines like biology. While Narrow AI is a social discipline which is separated from alchemy. A typical way for finding new knowledge in Narrow AI are robot-challanges, in which different teams are programming their robot and trying to be better than the opponent. The standard-lego mindstorm competition has nothing to do mathematics, physics or biology in the classical sense it is more like a poem writing challenge.

## 1.2 Hidden Markov Model

Until now the open question is how to programming the motion primitive in a robot-control-system. One possibility is procedural animation, which is the same what Craig Reynolds under the term “steering” has described. [6] From a computational point of view, steering is indeed the best idea. There is a C++ method called “drive”, this function calculates something and as a result the robot moves forward.

The steering function is normally used for controlling the direction of a car. The car has a position and a direction, and there is goal in a certain angle. A formula is used which calculates the new angle of the wheel and the car drives to the goal. There is only problem: the formula must be programmed. And this in most cases very difficult.

I want to give another example: biped walking. According to literature the best practice method here is to use the ZMP method (zero movement point). This is a physical model to calculate the servo motors for a walking robot, which is comparable to steering of a car. But the problem is, that the formula is very general, that means it can not calculate exactly the right parameter. For doing so, the model must be more complicated. The formula is only an approximation.

The question is how to deal with uncertainty? And here comes the hidden markov model (HMM) into the game. In general, HMM is a probabilistic algorithm, which means in every run the result is different. A HMM is a pseudorandom-generator. Ok, let us a go step back. A random-generator prints a random number to the screen. Pseudorandom means, that the randomness is there but only in a small portion. An example for a pseudorandom-generator is:

print(randint(3, 9))

This is the python sourcecode for printing out numbers between 3 and 9. On one hand it is unclear what the next number looks like, on the other hand it is clear that it will between a range. A HMM can be called an advanced pseudo-random-generator. Inside the model we have a table which is similar to the q-matrix at qlearning, for example this one:

1 2 3
1 0 0 0
2 0 0 0
3 0 0 0

And instead of the zeros, the transition probability is given to switch between the state. After executing the table, the system gave us random-numbers, but they have also a structure. With HMM and other stochastic models like LSTM it is possible to generate noisy output. It is the same principle like in the python range-randomizer in which the timeseries is in certain band.

The remarkable aspect is, that on one hand we have modelled the system, on the other hand we have not, because important aspects of the model are unspecified. Instead they are determined by randomness which is equal to “we don’t know”. So pseudo-random-generator is a hybrid model which is on one hand specified by meaning and on the other hand it is probabilistic.

## 1.3 Reinforcement Learning plus natural language instructions

The reinforcement technique is used to generate a policy on the fly. That means, that after the learning step the agent can solve the task by its own. For generating the policy only some cpu-ressources and a reward function are needed. The easiest example is an agent inside a maze, but the same principle can be used for grasping task. Here the maze is the optimal control problem.

The advantage of reinforcement learning is, that the task must not be specified by hand. So reinforcement learning can be called a function approximation. The main disadvantage of the concept is, that for bigger problems the state-space grows rapidly, so that even the fastest cpu are unable to find the right q-table.

But it is not necessary to solve bigger problems with reinforcement learning. High-level aspects of a game can be done with another technique called “natural language instructions”. Only the motion primitive must be calculated automatically. That means, the agent must not learn of how to clean the kitchen, but he only needs to learn simpler tasks like grasping objects, releasing objects and walking to a place. In the literature, the concept is often called multimodal or hierarchical reinforcement learning, which means that on the top layer the user is typing in commands like “grasp”, “open gripper”, and on the lower layer every command is connected to a q-table which executes the action.

Let us go into the details of the motion primitive. A simple motion primitive is a push-action. The robotfinger has a position x,y and the aim is to push an object. The question is: what are the right parameters? One trajectory of the robot could be: 10,0–12,0–14,0, another trajectory would be: 10,0–10,2–12,4 but also different trajectories are possible. The abstract problem description is, that the robotfinger can do something in the x-y-space and the system is effected by this. Somewhere at the end, the reward function is signalling that the goal was reached. It is a classical reinforcement learning problem. All motion primitives can be described in that way. Always, the robothand has some degree of freedom to act and this effects the system. At the end the reward is given for completing the task.

A normal robothand consists of more then one finger. The hand has fingers. That means we have an multi-agent-system. The number of literature about this topic is smaller, but it seems that it is also possible to solve the task. The difference is, that the statespace is bigger.

The classical example in q-learning is about an inverted pendulum. Instead of defining the control rule explicit, the algorithm determines the q-table by itself. The remarkable thing is, that it works. That means, after some trial a stable q-table is found. If the length of the pendulum is different, that the q-table has to be calculated from scratch.

But I want to go a step back. The inverted pendulum problem consists at first of an algorithm. That is a rule what to do in which situation. For example, the pendulum is on the left, and it is falling downward. The player must react properly. The finial q-table takes such a decision. It stores all the rules for every situation. The second aspect of the problem is how to find the q-table. This is done by the reinforcement algorithm, which is mostly a search algorithm for maximizing the reward function.

Let us take a look at a perfect q-table. The q-table controls the pendulum. Surprisingly the q-table not consists of mathematical equations or sourcecode but of state-action-pairs.

**Literature**

I’m not the first author who is describing a mixture of natural language instruction plus reinforcement learning. In the Robobarista project this idea was implemented on the PR2 robot.[5] The first half of the paper is nothing in special. .The author is explaining of how a recurrent network works. The innovative aspect from the paper is, that the network is connected to 230 words in a dataset, which is some kind of multimodal learning. With that feature it is possible not only to generate a single trajectory but solving complex tasks.

It is not the first paper, which connects an Neural network with “natural language instruaction”. The same principle was used for solving Atari games [3] which was also published in the year 2017. But the robobarista paper goes a step further, because additional in the project a real robot was used.

The principle is not only solving internal problems of Artificial intelligence like generate a trajectory with a neural network, but this time the PR2 robot solves a problem in a real environment. Even for people who are not interested in robotics the video looks amazing.

## 1.4 Q-Learning

The short answer what q-learning is, has to do with genetic programming. A given small program is improved by an algorihm for solving a problem. In the following chapter, the inner working are explained in detail.

A Markov decision process (MDP) is equal to a probabilistic turing machine. It is an automaton which is doing the same, as a C++ program does. Normally a MDP has a structure of a q-table, that means there are states and for every state, some actions possible. If an action has the probability of 0,5 it is equal that in 50% of all cases the action is called. The principle can be called a learning program, because it is not necessary to code it manual, instead the q-learning algorithm finds the answer with datamining.

+--------+-----------+-----------+----------+
| state | action 1 | action 2 | action 3 |
+--------+-----------+-----------+----------+
+--------+-----------+-----------+----------+
| 1 | 0,2 | 0,5 | 0,3 |
+--------+-----------+-----------+----------+
| 2 | 1,0 | 0 | 0 |
+--------+-----------+-----------+----------+
| 3 | 0,33 | 0,33 | 0,33 |
+--------+-----------+-----------+----------+
Figure 1: q-table

Let’s take a look into the figure “q-table”. The automaton has three possible states. If the current situation is equal to state 2, then the automate selects action 1, because it is only one. In case of the other states, the automaton uses a random-generator according to the probability in the q-table.

So what can we do with this principle? Tthe same as we can do with genetic programming too: evolving a program until a goal is reached. Like in genetic programming, the learning algorithm needs extreme much cpu-time for finding an answer. But for small problems which are not too complicated like steering to a point in a maze, the system works great.

A good possibility to recognize the power of q-learning, it is recommended to compare it with procedural animation which was described by Reynolds [6]. Reynolds has manually programmed sourcecode which controls the steering wheel of a robot. The sourcecode consists of an equation like “goal=targetangle-sourceangle” and adds some additional if-then-statements. If the reynolds-car doesn’t steer correctly, the programmer must improve the algorithm on the sourcecode level. In contrast, the q-learning algorithm uses a q-table for storing the steering-equation and it is updated automatically.

So, if q-learning is so powerful why not all computersoftware is programmed with this technique? The answer has to do with complexity. Steering a car is an easy task, which consists of a few steps. The q-learning algorithm is able to find the solution in a small amount of time. On the other hand, most problems in computing like programming an operating system are complex tasks. The state-space is much bigger, the q-learning algorithm wouldn’t find a solution in a short time.

What today’s Artificial intelligence researchers are trying, is to use q-learning and similar techniques as much as they can. Because programming a q-learning system is easier than programming the control program by hand. Another example is the cartpole-problem. In theory it is possible to solve the problem with procedural animation. At first we need a mathematical equation which calculates how the balance is, and this is used to control the game. The disadvantage is, that testing such a formula is very complicated, and if the problem is slightly different, perhaps a double inverted pendulum, the equation is wrong and must found again. In contrast, the q-learning concept can be adapted to nearly all problems.

## 1.5 Function approximation for inverted pendulum

The inverted pendulum problem is well known in the reinforcement learning community. It is an example for an optimal control problem. The policy for solving the task can be described as a state-action-vector, which means that if the pendulum is in state 1 the correct action is right, in state 2 it is also right and so on.

In the q-learning terminology, the policy is written into the q-table, this can be also expressed in a graphical notation. In some videos the graphic representation is shown which has on the x-axes the current angle and on the y-axis a colorcode, which is symbolic for the correct action. A more abstract way to talk about the subject is to call the problem a function approximation task. in a x-y-diagram same points are given and the task is to paint a line through them. If the function is executed on the inverted pendulum it will stand upwards.

And here it is possible to explain of how to transfer this technique to more sophisticated problems. Normally the inverted pendulum problem is not very advanced. It is the standard-problem which is given in every beginner tutorial about q-learning. A more advanced task is to control a robot hand. The fascinating aspect is, that the technique is the same. The movement of the hand are captures by a dataglove. And like in the q-learning task the next step is to do a function approximation. That means to find a compact representation to describe the function and to interpolate between unknown points. For doing this, there are many mathematical techniques out there. The simplest form is qlearning better known as a q-table, but also Fourier-transformation, radial-basis-network, neural networks or Dynamic movement primitive are possible. .Sometimes the function approximation is done with principle component analysis. All of these techniques are working with the same principle. At first, points in a x-y-coordinate system are given, and the algorithm searches for a function to connect the dots.

Now follows an explanation how the robot itself works after he has learned the function. We are going back to the inverted pendulum problem, because it is easier to explain. The robot has some input, that is the current speed of the pendulum, the direction in which it is falling and the current angle. So it is description of the current situation or more general the input-vector. Now it is up to the robot to take a decision. He can do nothing, or move the cart to left or to right. For getting the information, the robot looks into a lookup-table, which is also called q-table. He searches for the state, and sees in the row which action is the right one. This action will be executed.

## 1.6 Control policy explained

In the area of reinforcement learning quite often the term “policy” is used. Normally it is some kind of function approximation between input-state and output-action. For example: if in the inverted pendulum problem, the pendulum is left at angle -20 then the action is -1. The aim is to find the correct action for the complete state-space, so that the robot knows in every situation what to do. In contrast to a normal computer program, there is no further calculation done, instead the policy is similar to a lookup-table. A neural network is able to story the table with a high compression rate, so that millions of input-output-situations can be stored.

## 2 Neural Network

## 2.1 How powerful is deeplearning?

On the first hand not very much. Even the newly invented Tensorflow Processing Units from Google have only a capacity in the Teraflop area. They are on the hardware level very fast, but in comparison to the problems which have to be solved in robotics the performance is not enough. I want to give an example. Suppose, we want to calculate the shortest path between 100 cities. For finding the optimal solution every current available CPU is overwhelmed. That means, the algorithm can not be executed until it will stop. The reason is, that the state-space for the travelling salesman problem is huge.

But to call Deeplearning a waste of time is too pessimistic, instead the brute-force power has to use wisely. That means, before calculating the neural network itself, some pre-decisions have to be taken for reducing the problem space. Normally this is done with a high-level-symbolic planner. In computing literature this is often called PDDL planning or natural language instructions and means to subdivide a problem into smaller parts. Instead of calculating what the robot the next 60 minutes should do, the task is subdivided into an action like “grasp the object”. This motion primitive has to be solved with deeplearning, and this works great. A small problem with a minimal state-space is exactly this kind of tasks what a deeplearning GPU can solve.

Solving is another word for avoiding manual programming. Instead of entering a rule or formula which drives the robot arm to the object a genetic algorithm is used for reaching the goal. Only the reward function has to be defined manually.

## 2.2 Qualitative physics with neural networks

Qualitative physics is a semantic description of physical events. For example, an inverted pendulum can have the event “is-falling-down”. In most papers, qualitative physics is simply ignored because it seems to complicated. But it can be used, to improve the learning speed of neural networks.

Normally an event in a qualitative physics model is given by the system. The event “pendulum-is-falling-down” can be derived from the angle. Such a semantic description can guide the learning progress of the neural network. In that sense, that the network decides if the event is important or not. So the idea is not only give the minimum information to the neural network, but all known information.

In an early paper of 1992 the concept was described. [2] The paper is not very well written, must details remain unclear. I want to give the overall idea a bit better. At first, we control a system manual. The example is to steer a car on a road with topdown physics. The result is, that we drive the car into the goal. While we are driving, the logfile is generated. It stores for every second the position of the car, the direction and the driving wheel.

In the next step we add a qualitative physics variable. The first event is called “car-is-near-border” and the second event is “car-is-in-curve”. Both events are optional, the game can be solved without knowing them. And it is in theory possible to derive them from the given information. But we decide to store them explicit in the logfile. The result is, that in the second case in which more information are given it is easier for the system to construct a model. Perhaps the reinforcement controller would reduce the speed automatically if the car is inside a curve. Driving the car only with the two qualitative events is not a good idea. The amount of information is too low. Every event can only be true or false. That is not enough to drive a car. But in combination with the absolute positions and the other information it is possible to construct a controller.

The reason why this works is hard to understand. Let us first research what a neural network is. A neural network is using datamining technique for generate an controller. The relationship inside the datamining table are unknown, the neural networks explores them with changing the weights. So the networks decides, which values are important and how to get a decision.

In literature the concept is sometimes labelled as “Linguistic variables”. It is possible to combine linguistic knowledge with neural networks:

“The neuro-fuzzy system uses the linguistic knowledge of fuzzy inference system and the learning capability of neural network” [8]

## 2.3 Recurrent neural networks

As an example I want to use inverted pendulum problem which is relatively easy to control. The plain vanilla strategy is, that the input values like angle and velocity of the pendulum are feed into the neural network, there the weights are calculating something and the output neuron print 0 for moving the cartpole to left or 1 for moving it right.

How can we improve the neural network? The only information which is provided to the network is the angle and the velocity. This is not enough. What the network really needs are information from the past and even from the future. Information from the past are easy to get. We are taking the angle and velocity from the previous time step. For example, the angle from the last framestep was 44 degree and the current is 47 degree. So the network knows, in which direction the pendulum is moving.

The information from the future step are a bit more difficult to get. The angle of the next step is not known yet, because no control-impulse was send to the system. But we can make a forward simulation and sending both possibilities. We let run the simulation one step forward with output neuron 0 for left and 1 for right, and we are measuring the angle in both case. So we have the information of what the pendulum will do. So the new input vector for the neural network is:

previous angle, current angle, future angle on action left, future angle on action right, velocity.

With this rich amount of information it is very easy for the neural network to solve the problem. He is informed about the past, present and future and knows the values for the angle and the velocity. The only task which is left, is to bring all the information into an order and calculate the correct output for controlling the system.

## 2.4 Combining planning with neural networks

So called Bidirectional neural networks are using data from future states of a system for calculating the control signals. The future states can be retrieved with a forward simulation, that is normally done with RRT. Such a neural network is in reality both: a classical physics planner and a neural network. Here some details:

In the Cartpole problem of OpenAI gym are two actions possible: 0=left, 1=right. If we move the cart left, then the system will be in a new state. How to decide which direction is right? The best decision is based on as much information as possible. The current system, the past system and what the system will look like if we are doing a certain action. Additional “qualitative physics” information are useful.[footnote:[4] describes on page 9 a car which is controlled by a neuro-fuzzy system.] So our control-policy is feed with many different information. Some of them are easy to get, for example the current angle of the pendulum. Other a bit tricky to retrieve, for example the angle from the step before, and some are really hard to get. For example the information in which state the system will be, if we use the 0=left action. This information can only be get with forward information.

The overall input-signal are stored in long array. This is far bigger, than the standard observation array which is normally used in the OpenAI gym. The question is, how exactly we are using these information for calculating the control signal? At first we are storing the information in a database. Then we are trying out some policy, which can be random or generated with learning from demonstration. :This enormous dataset is now feed into the neural network. As a result it will generate a policy of how to combine the information for generating the perfect policy.

I do not believe, that a complex neural network architecture like LSTM or deep q learning is necessary. Instead the minimal example is a single neuron, which has 100 input signals and one output signal. The input signals are the above described rich information and the output is the control-signal to the system.

Why are input data so important? Let us research an example. The pendulum is nearly ontop and we must decide what to do next. If we have a simulation environment for testing out what-if-scenarios, the answer is simple. We are testing action 0, action 1 and if one of the action generates a higher score we take it. This forward simulation can be done with RRT. Here is result of the simulation:

action0 -> 50 points

action1 -> 100 points

If we take this information as an input, the neural network can be very simple. Because 100 is greater 50, 100 is better, so action1 is better.

The interesting thing is, that in a standard neural network this information is not available. Usually a neural network knows only the current situation and can’t access to the data of a RRT simulation. So the neural network must calculates internally with lots of weight some policy to determine the right action, even if its doesn’t know what happens next. Is this assumption mean full? Is it necessary to withhold the information? No, it is a rhetorical question. Mostly this is done without a discussion about it.

In the literature the concept is not very usually. The neural network type which is using information from the future is called bidirectional neural network. If information from the past are used, then it is a recurrent neural network. If information from the qualitative physics are used, it is not known how the network can be called and the capability of language understanding is called “hierarchical reinforcement learning”. If we combine all, then our network can be called:

“hierarchical bidirectional recurrent neural network with linguistic variables”

The general idea is, to extend the number of input-values and use as the neural network a standard-perceptron. From a perceptron it is known, that it can solve easy problems with a bit training, and we as the programmer must only ensure, that the problem will be easy. I want to give another example how to make the bidirectional neural network working.

A normal Cartpole problem in OpenAI gym is defined by one input variable. We have the angle of the pendulum right now. So the structure is:

angle -> neural network -> output neuron

The task of the neural network is, to calculate the output according the current angle. This task is very complex. .We can make it easier if we help the network a bit. We are testing out our game-engine with different actions and measure the angle of the future. So the structure is:

(angle current, future angle on action left, future angle on action right) -> neural network -> output neuron

This time the task is easier to solve. The number of weights can be smaller, and the learning process takes less time. We can help the network even more. For example if we calculate also the angle for 2 steps in the future. So it is no longer a classical neural network, it is more a planning algorithm which is tuned by a neural network.

**associative memory**

It is possible to increase the abstraction level a little bit. The input neurons of a neural network can be imagined as an associative memory. Onto the information in the memory, the neural network is doing simple operations. Instead of using a perceptron like neural network the more general idea is, to use a stack-based turing-machine. The input signals are stored in the stack in a linear order. Then the program is running and is doing something, and at the end a result is printed out. To find a computerprogram is done with genetic algorithm which are testing out many possibilities. The reason why in machine learning neural networks are used and not turing-machines has to do with the fact that a neural network can calculate the result quicker. Instead of executing complex algorithm, every neurons sums up his input and that’s it.

Neural networks are not fully turing-ready, but they are good enough for easy tasks. They have more in common with function approximators.

## 3 OpenAI gym

## 3.1 Very basic OpenAI gym agent

In the standard-version the OpenAI-gym, software has only a random-action-agent. That means that there is no policy and the task is not done. The alternative is to use a table with weights and multiplying them with the current-state-vector. To finding the weights the easiest possibility is to use random-sampling algorithm, that means in every step the weights are initialized by random, and if the reward is better than in the last episode, the new weight-vector is used as policy.[footnote:[https://www.pinchofintelligence.com/getting-started-openai-gym/]]

The concept is not totally new, it is the simplest form of a neural network. 4 input numbers are mapped to 4 weights and no further layer is in use. The interesting thing is, that after some iteration a better weight-combination is found. How to improve the policy? At first it is important to replace the weight-matrix with a more powerful turing-machine. The simplest one is used in the busy-beaver-challange. A turing-machine can also be imagined as a weight-matrix, but it is possible to do more tasks.

The next question is of how to find the parameters for a turing-machine. The problem itself is called genetic programming because on every iteration the performance is better. The main problem is, that it works only for small problems with small input vectors. What the reinforcement learning and genetic programming community are doing is to find algorithm, weights and solvers for finding the solution faster. Mostly, this is not possible, because machine learning itself has limits.

On one hand it is right, that a randomized initialized busybeaver machine which is evolved by brute-force is not very powerful. On the other hand, even sophisticated deeplearning algorithm are not much better. They can learn faster but not as much as needed. I would suggest, if a naive brute-force-solver which adapts the weights via random needs 100 seconds, then the best deeplearning algorithm needs perhaps 20 seconds for the same task.

The conclusion is, that machine learning can be reduced to a simple turing-machine which has parameters, and these are evolved by an inefficient algorithm.

## 3.2 OpenAI gym with sub-policy

The best environment for research in detail who to implement a robot-control-system is the OpenAI gym software. Around this tools, a broad community is available and the games are standardized. As a consequence the discussion about problems and solvers are easier. If somebody wants to tune his neural network for solving pacman inside OpenAI gym, he doesn’t need to explain what a neural network or pacman is, instead it is clear what he wants. The disadvantage of OpenAI gym is that only Python is supported but not C++. But that small mistake can be ignored.

So what have the OpenAI gym community found out how to solve the games? More advanced techniques are called sub-policy. That means a form of multi-modal learning, in which a bot supports different commands. For example, a biped walker can stand up, walk forward and jump. Each command is learned separate and at the end an additional layer decides which sub-policy has to be activated. With that strategy it is possible to generate more complex bots.

Another idea for improving the AI is to use “qualitative physics”. That means, that the observations from the game are enhanced with linguistic variables. The programmer tries to encode additional knowledge into the game. An example: In the pacman domain, the rawdata is feed into a event-parser. Possible events are: enemy-is-near and pacman-on-border. The event can be true or false. The calculation is done manually in sourcecode. The event-variable is feed into the neural network and supports the rawdata. So the neural network can learn quickly. The third option what improves the performance of bots dramatically is dedicated GPU hardware which provides teraflop-range performance. All three strategies (sub-policy, qualitative physics and gpu) combined are realizing high-end AI which can solve many games.

The reason why reinforcement learning is so powerful is because manually coded heuristics can be combined with machine learning. The normal machine learning bot is not very efficient. He uses a trial&error strategy and must try many weights until the neural network is able to solve a problem. But, if a programmer gives some short hints like possible actions, and linguistic variables the neural network will learn much quicker. The interesting aspect is, that the human programmer must not understand the game in detail, and there is no need for programming the bot in detail. It is enough to give only a part of the solution, the rest is calculated by the neural network with brute force.

The best example is the famous game “Montezuma’s Revenge”. A vanilla neural network is not able to solve the game. Because the state space of all weights is too huge. Even after training many weeks on gpu hardware, the bot fails to go to door. But, with simple information the learning process can be improved dramatically. At first, sub-policies are defined like “walktodoor”, “down-ladder” and “pick-up-key”. Then linguistic variables are defined like “enemy-left”, “bot-in-middle”. And all the information are feed into the neural network. Now the search for weights starts again, and this time the bot is able to play the game. This is called guided policy search and means, that the brute-force-search-technique for finding the right policy is supported by some simple manual programming, which divides the problem space into smaller pieces.

## 4 Learning from Demonstration

## 4.1 Makes learning from demonstration sense?

In the robotics literature there is a huge amount of papers out there which are explaining the “learning from demonstration” (LfD) paradigm. For all, who are not familiar with the concept, at first a short introduction. We have in a map a robot, an obstacle and a goal. The normal idea for bring the robot to the goal is to use the brute-force RRT algorithm which randomly samples possible trajectories and if he found a way through the goal he is finished. According to the LfD paradigm, this idea is wrong. Instead it is necessary to guide the search for a trajectory. The first step is, that a human operator draws some trajectories into the map, and the search for the final trajectory is guided around the data.

But why it is necessary to guide the search? Why the human-operator must draw examples in the map? Answering the question is not easy, because it is the precondition of LfD that this behaviour is right. According to my knowledge about pathplanning it makes no difference if a human-operator teaches trajectory before the solver is searching for a solution. The LfD process can be cancelled. Instead a brute-force open horizon solver is the better alternative which samples all possible trajectories from scratch.

The reason why some people think that this is not optimal has to do with the huge state space. Normally a RRT solver is not able to find a solution, because the number of possible trajectories is too big. So in real life applications, RRT is often combined with a heuristic. Learning from demonstration is such a heuristics which can accelerate the search. But inside the RRT universe, other heuristics are also possible. Mostly they are using cost-functions, which are customized to the domain. Other heuristics are dividing the map into small submaps and solving each sector separate.

I think we should differentiate between the algorithm itself and possible tweaks to let running him faster. .The algorithm itself to find a trajectory is brute-force-sampling. That means, the robot selects with a random generator if he wants to go north, south or east and is testing out, if there is a way or not. Tweaking the algorithm means, to store the graph in a database, search with a heuristics or using a LfD routine. But, all these tweaks are not important, in easy problems, they can be leave out.

Let’s take a look of a how a standard pathplanner in the RRT context works. Normally the operator must define the goal, for example the robot go to position (10,20). Then the operator defines some conditions like that robot should not collide with an obstacle and must stay away from the corner. Now the operator presses the run button and the solver presents a solution. The trajectory to the goal is calculated with trying out possible alternatives and evaluating each with a score. The inner working is, that the operator defines an abstract goal, and with cpu-intensive task a solution is found.

Somebody may argue, that we can optimize the algorithm, to calculate the solution in a shorter amount of time and with less cpu-power. But that is not a must have. In general a trivial brute-force solver works reasonable well. And i would go a step further and state the explaining the inner working of an improved RRT solver which is using mathematical tricks to find the trajectory faster is an anti-pattern in explaining the overall task. IMHO the LfD paradigma explain of how to make the trajectory search faster and leaves out the explanation of what the goal is.

## 4.2 Micromanipulation planning

The most effective techniques in AI which was ever invented is the brute-force search. This technique is able to solve every game and every robotics problem. Not only for chess and pathplanning problems finds the solver a solution but also for Mario AI, Starcraft AI and dexterous grasping. The only problem is the high cpu consumption. In reality, only current hardware is fast enough and the robot must find in a solution in under 1 minute. With these constraints a brute-force solver is difficult to implement.

Somebody may argue, that a brute force solver is the wrong way, and that other techniques has to be found. No it is not. The question is only of how to using the solver in the right way. There are two possibilities for saving cpu-time:

• calculate only micromanipulation tasks

• using heuristics

I want to discuss both in detail. A complete pick&place task can’t be solved with brute force in mind. The trajectory would consists of many seconds of continuous actions and todays computing power is not very powerful. But what would happen if we subdivide the task in smaller subactions? For example, the robot hand has contact with an object and the solver has to calculate the next step. That means he should answer the question if he must push the robotarm strong or less against the object to bring it into the goal. Such a detailed question can be answered by a bruteforcesolver, because the time-horizon in which the push-action takes places is small.

Additionally the action-space can reduced further with dedicated heuristics. .The programmer can say, how the push-action looks like in general way So he reduces the number of possibilities down. These can be calculated on normal hardware in around in few seconds.

In the literature the concept of subactions which have only a small time-horizon is known. Many concepts were discussed for addressing the problem. The easiest form is to use procedural animation. But this technique has the problem that the sourcecode is static and for most problems it is not know of how the mathematical formula looks like. Also in literature the “Learning from demonstration” is discussed. But this idea has the disadvantage that it is unclear of how to store the policy. For example, Hidden markov model, LSTM or a q-matrix is possible. In my opinion the most powerful and easy to understandable solution is a brute-force-planner. That means, that there is no policy, instead the physics engine is tested with trial&error on the same way, as a RRT pathplanner will find a trajectory through the maze. The programmer has the task to define where the goal is, and define some constraints. The rest is done by the solver.

To sum the principle up, it is right to say that the system consists of the following elements:

• high-level “natural language instructions”

• motion primitive which consists of a goal and constraints

• a high-level GOAP like solver

• a motion primitive brute-force RRT solver

In my opinion these technique will allow to find a solution with low cpu-usage. The only disadvantage is, that until now i haven’t it tried out in reality. So it is only a guess. Implementing such a system in working C++ code is a bit complicated. Perhaps for the beginning only the motion primitive solver would be enough. So that not the overall robot-system has to be programmed but only only motion primitive like “closegripper”.

What does “closegripper” means? Normally it has a certain goal. The robotarm must get in touch with the object and reach a certain amount of pressure. How exactly the gripper has to be closed is depended on the position and the object size. So there is much room for a solver for trying out different alternatives.

## 4.3 Learning of a dynamic system

Instead of describing a standard optimal control problem like the inverted pendulum, I want to give an easier example from the area of pathplanning. The advantage is, that this problem is well understood in computing. A maze is given which consists of obstacles, and the robot must find a trajectory through the maze. With a pathplanner like RRT he finds a way. Now, we take the robot and place him a position small left from the original position. The old trajectory is no longer valid, instead a new one has to be computed.

This problem is similar to a grasping problem, where the object is a bit more left in the demonstration so the robot must adapt his grasp trajectory. How to deal with the situation? I asked Google Scholar but nobody seems to have an answer. There is no algorithm available which modifies a given trajectory slightly so that the robot can drive to the goal. Instead, in real life a different strategy is used which is called anytime planning. That means, to calculate the trajectory from scratch.

I selected the pathplanning problem in a maze, because it is easier observable what is going on there. If the robot is moved only a small amount and should go to the same goal, the resulting trajectory is completely different. It is the same problem like in a chess game, where we only remove one figure from the board and now every player must recalculate his strategy from scratch.

If it’s not possible to modify a given trajectory for a pathplanner it is also not possible to modify a trajectory for optimal control problems. The problem space there is bigger and the complexity is growing. So from my point of view, in optimal control problem a complete replanning is necessary.

## 4.4 Robobarista

For the Robobarista project there are some papers online for example this one: [7]. The authors are describing a very advanced robotics system. The most interesting aspect is the combination between deeplearning for the robot trajectory and the grounded vocabulary for activating a single trajectory. According to my knowledge, this is the first paper worldwide, which describes this in such a detailed form. The robobarista project itself consists additional of other elements for example, some kind of crowdsourcing and a image-recognition engine, but these are not so exciting.

In the research of deeplearning since many years it was known, that neural networks are powerful function-approximators which can be trained for every problem. The famous paper about the atari game-playing has shown this in detail, but neural networks can be used also for trajectory generation. The only problem was over long time, that it was not possible to integrate the neural network in a robot-control-system. And to use a neural network for controlling complex tasks was not possible, because the training episodes would take to much time. In the robobarista project a solution was found. On the high-level, a “natural language instruction” system is used for describing a task. This is similar to the PDDL paradigm, and every command is connected to a neural network. which is robust on the lowlevel against errors. A nice extra feature is, that the overall system is no longer a blackbox, like most neural networks, instead it can controlled interactively like a text-adventure. A possible interaction with such a robot would be:

1. open gripper

2. grasp object

3. close gripper

4. move to place A

5. open gripper

and so forth

The advantage is, that the human-operator has control over the system, because the robot makes exact this, what he types in. On the other hand, the neural network is capable of learning the task on the fly and act like a black box, where nobody is sure, how a trajectory is generated. I think, this combination is the key ingredients for future robotics system.

The most remarkable aspect of the Robobarista project was, that today a youtube demonstration is available which is showing the robot in front of coffee-machine. So the technology is not pure science fiction, but it is working. Completely new is Robobarista not. The idea of grounding with natural language and the idea of deeplearning was described in earlier papers. But in the robobarista project at first time a functional integrative system was described in detail. According to Google Scholar the papers were published between 2015 and 2017. As far as i know, this is the most advanced robotics project which is documented as openaccess ever. It was done by the Cornell university as a group work, not by a singe person.

## 4.5 Static motion primitive

Usually, motion primitives are the part of a robot control system which are created with Deeplearning. So called adaptive PID-control or neural networks for optimal control are used for regulating a dynamic system. The best example is the inverted pendulum problem in which a policy is used for bringing the system into a goal state. The assumption is, that the task is so complicated that the actions must be generated from scratch and this can only be done with a neural network.

This assumption is wrong. Surprisingly, static motion primitive are working perfect in the domain of manipulation. A corpus of static motion primitive consists of around 50 possible actions, and each one has a fixed trajectory. For example “pushleft” means, that the robotarm is going 10 cm left, and “pushsmallleft” means, that the arm is going 5 cm left. Normally it seems not possible to use static trajectories for something useful, but it depends on the right order. A simple experiment in the simulator has shown, that with only 50 primitives it is possible to bring an object in every desired position. No additional parameters were used to modify a trajectory, instead a planner works only on the high-level-symbolic layer. That means, a motion pattern like “pushleft, pushright, pushleftsmall” is different from “pushright, pushright”.

How exactly can a static motion primitive be defined? Normally it consists of a name in natural language. This makes it easier for the human operator to memorize the primitive and he can now enter the name interactively for controlling the robot. Additionally, the motion primitive consists of the action himself. In the easiest form it is a relative movement for example (-10,0) means to move the robotarm 10 cm left. That’s all. No complicated spline trajectories are needed and also not a policy which uses additional state-information like in the reinforcement learning area. Instead the success of the overall system depends of the symbolic planner. He must bring the motion primitives in the right order. For doing so, a physics-based RRT planner is the right choice. The physics-simulation is tested out with different plans, and the plan with the highest score is printed out.

## 4.6 Reward function in Inverse Reinforcement learning

Ironically Inverse Reinforcement Learning (IRL) is often described as an alternative to a handcrafted reward function. But in reality both is the same. According to the literature, in IRL the reward function is defined with feature expectation. That is the probability, about which sensor inputs should be there. Let us investigate this in detail.

The idea can be dated back to computer chess. There is a subproblem happen, which is called “board evaluation”. The question is, how good a certain position is. Board evaluation in computer-chess is done, with counting the bishops, and all the other figures for calculating a score. It is common to weight the different features according their importance. The same is done inside IRL. The difference is, that in computerchess it is clear, that the reward function is handcrafted, while in IRL in the papers is written that it is done automatically.

In computerchess a board-evaluation is done, because it reduces the CPU-consumption dramatically. The chess engine knows, if a state in the game-tree is good or not and can selective search regions. The same is true for reinforcement learning tasks like in the OpenAI gym environment. In theory it is possible to solve all the games without a board-evaluation. But it takes much time to find a policy. The better approach is to use at least a handcoded reward function which guides the policy search into the right direction. The consequence is, that the search for the policy doesn’t takes hours but is done in seconds. The best way to understand the importance of a reward function. is to not using it. This results into a training phase where the neural net can’t improve his reward.

A good introduction with the example “inverted pendulum” is given by an older paper [1]. On page 4 a chart is shown, with the timestep on the x-axis, and the angle on the y-axis. At first, a human experts demonstrates the swing up task. A certain curve is plotted into the chart which shows the angle over 2 seconds. Then the robot must repeat the task. The aim is to follow the chart. For example, on timecode 1 seconds, the angle must be -4 radians.

The interesting feature of such a reward function is, that is robust against mistakes. For example, if the robot can’t reach the optimal -4 radians at the timecode, he is not totally lost. Only the overall score is lower.

Let us compare Learning from demonstration with a fixed trajectory which is programmed. A fixed trajectory means, that the robot repeats the demonstration accurately.. He follows the waypoints, even it makes no sense. Instead, learning from demonstration has the aim to get a reward function, and this can be used by the solver. And if a solver is in the loop, that means, the trajectory is generated from scratch and is able to detect obstacles. It is same principle like a chess engine is working.

The reason why a reward function is created and LfD is used has to do with speed. In theory it is possible to solve a game without a reward function. This results into the above cited non-improvement of a neural network, even if the fastest CPU is involved. With a LfD and a reward function the CPU usage in the training phase is smaller. That means, the solver finds a good policy and this will drive the robot to the goal. So LfD is in robotics the same what alpha-beta-prunning is in computerchess: a way to accelerate the search in the gametree.

The paper [1] was published before the famous paper from Andrew Y. Ng in which he described the apprenticeship idea. In my opinion, the paper from Schaal et. al. is easier to understand because it reduces the task dramatically and give a concrete example. To sum up the paper it is enough to know, that on a time-axis a value is measured, and the reward is measured by reaching that value. In the paper the value was equal to the angle of the pendulum. On timeindex 0.5 seconds it has a value, on timecode 1.0 seconds another and so forth. Doing the task is right, is equal if the robot reaches the same value at the same timecode.

The first impression is, that this strict condition makes no sense, because it is possible to swing up the pendulum in a different timescale. For example, faster or with another swingup period. Yes, that is right, but then we have the problem that we don’t know how the reward over the timescale is. The only feedback the robot will get, if he solved the problem at the end. Such a delayed reward after a long period without any reward is very difficult to find. It is possible with a brute-force solver and also with a neural network which is trained over many episodes, but the cpu-consumption is high.

Let us investigate this in details. The robot starts the swingup task. In the openai gym he would have two possible actions of move the cart left or right. A possible action-sequence could be:

0.0=left, 0.1=left, 0.2=right, 0.3=left, 0.4=right, 0.5=left

The first parameter is the timecode and the second the action of the robot. The consequence is, that the pendulum starts to move and in the last moment it is possible to answer the question if the task is solved or not. How knows the robot a point in time perhaps at timecode 0.2 if he is right or not? Usually this can be answered with a brute-force-solver. From the current state all futures states are calculated and a path to the goal can be found. The problem with RRT and similar algorithm is, that the nodes in the graph has no score. We don’t know how far they are away from the goal. Every node looks equal. What is missing is a reward function. That is in algorithm which can tell for every state how good it is. So it is possible to identify a right path, before the goal was reached. And here comes the idea of LfD into the game. A demonstration given by an expert can answer the question of how to scoring a RRT-node. The algorithm works in a way, that for every state, the feature difference to the expert demonstration is calculated. For example, according to the demonstration the angle should be 30 degree at timecode 0.2. If the robot reaches at the same timecode an angle with 32 degree he is very good.

It remains the question open, of how to reach a certain angle at a certain timecode. This can be answered by a solver. Either a RRT solver or a neural network which has learned a policy. With an RRT solver it is easier to explain the idea. We start at timecode 0.0. The robot has two options, he can do the left-action or the right action. According to the RRT paradigm both actions are tested. And now comes the magic. The follow-state which has a higher score is better and will be sampled with a higher priority. If some cpu time is available, the solver can also check out the other RRT node, but this is optional. So it is possible to only follow the rrt nodes which have a high reward.

Again, the robot is not doing the same actions like the human demonstrator. Perhaps, the human has executed the following action-sequence: left, left, right, left. But the robot will decide to do: right, left, right, left. What counts is only the difference of the feature, that is the angle in a certain timestep.

## 4.7 Learning from Demonstration

Most newbies in the area of robotics have heard about Learning from Demonstration. In the associated youtube-video mostly a human-operator is taken a robot arm and makes a move with it. Then the robot handles the task by its own. But how does this magic trick works? Or is it only a show?

At first we must separate between Teach-in robotics programming and LfD. Teach-in programming was invented in the 1980s and is equal to set up waypoints on a fixed value. p1 is (10,20), p2 is (40,20) and so forth. The robot arm moves along the line on the waypoints. If something happen for example an obstacle is in the line, that the robot is blocked.

In contrast, Lfd works with a trajectory solver. That means, the task is seen like a game of chess and the robot wants to win. In the LfD training phase the constraints of the task are defined. For example, if the human operator guides the robotarm to pick and place an object, then it is clear that winning means to move the object to the goal. From a mathematical point of view, the initial demonstration generates a cost function. In the second autonomous phase the robot is generating the complete gametree and goes into a direction in which he is near of the demonstration.

Now a short example: The human operator moves the robot arm to pick&place an object. He grasp the object, moves the arm and then releases the object. The robot knows, that the goal of the game is, to release the object at the goal place. If we are now blocking the robot with an obstacle, he will try to solve the game in every case. He calculates internally a way around the obstacle. That is the difference to Teach-in programming.

We can increase the abstraction level a bit. Every task, what a robot can do is described as an optimal control problem. That means, that from the current state some actions are executed to reach the goal state. With this principle every game can be solved. The question is only, which actions should be done in which direction. A naive solution for the problem is a brute force solver. The robot calculates the complete gametree until one node is equal to the goal state. The disadvantage is, that most robotic problems are too complex for solving it. Learning from demonstration is a technique of reducing the gametree. It evaluates the nodes of the gametree with a score, so the solver needs only compute a fraction of the gametree. In the domain of computerchess this is sometimes called Branch&bound, sometimes alpha-beta-prunning and means to speed up the algorithm. The remarkable aspect is not, that the robot can solve the task, the interesting feature is to minimize the CPU-usage.

On youtube is another video available which is called “Teaching a Robot to Roll Pizza Dough A Learning from Demonstration Approach (short)”. What is the task? The current state is a ball of dough and the goal state is rolled out dough. The transition to the goal state is done by actions. It would make no sense, to use teach-in for setup a fixed trajectory because the dough ball can be bigger or smaller. Instead a solver is needed which calculates the trajectory from scratch. Every LfD system has such a solver. He uses the data from the demonstration to bring the system in the goal state. Perhaps a short example: The robot starts to roll out the dough. His RRT Solver generates a new random action and the dough is falling from the table down to the ground. According to the learned demonstration this behaviour isn’t normal. So this RRT node gets a negative score and follow action hasn’t to calculated. Instead the solver takes one of the remaining RRT nodes and tries out what will happen if he is doing there a random action. So it is similar to computerchess, where a certain node in the gametree can be ignored and all following nodes too.

## 4.8 Sensor-Motor-Primitives

Reinforcement Learning and “Motor-learning” was described in many papers. Unfortunately most paper are going very fast forward and have no time for longer explanations. What the authors are trying to do, is to invent advanced motor-learning strategies which are from a technical point of view interesting but they forget that some newbies might be interested in the general concept. In the following section, I want to explain the idea as an introduction without too much complexity.

At first it is important to know, that Learning from Demonstration can be done without neural networks and without q-learning. It is only a scoring strategy for faster RRT-sampling. But let us start in the beginning with an RRT solver. RRT can bring a system into a goal state, for example the robot controls a balancing board which holds a marble and the aim is to bring the marble into the hole. What RRT does, is to testing out different plans. In a timeperiod random steps are possible, and after executing them the board is changing the direction and also the marble on it.

If our computer speed is unlimited fast, we have now found the solution. We are simply trying out all 100 billion possibilities and one of them brings the marble into the hole. But an unlimited fast computer is not available so we need some performance tricks. Sensor-Motor-Learning is one of them. The idea is, that at first a human experts demonstrates the task and in the background all movements are tracked. We are storing the sensor informations which are equal to the system state and we’re storing the movements which are equal to the actions.

Now we want to replay the recording. The easiest trial is, if we’re placing the marble exact onto the same startplace and bring the board exact in the same position. Then we’re executing our logfile and the marble is running into the hole. But what happens, if the startposition of the marble is slightly different? Executing the same actions would no longer solve the puzzle. but that is not the plan. Instead the above cited RRT solver helps us. We are sampling the action space but this time according to the expert-demonstration. That means, all actions which are similar to the actions from the demonstrations get a good score, and all sensordata in which the marble has the same or similar position like in the demonstration get a high priority too.

Our RRT solver consists now of two modules. At first, a random sampling technique and secondly a scoring module which gives every RRT node a point between 0 and 1. 1 means perfect, 0 means that this node is very different to the demonstration. The RRT solver samples at first all the rrt nodes with the maximum score. If the marble is on the same place or near the startposition then this will bring the marble to the goal. If not, the RRT sampler tries out RRT nodes which have a smaller similarity and so forth.

What happens if the repetition is completely different from the demonstration? Than nearly all RRT nodes will get a score of 0, that means it is unclear which path is correct, so the rrt solver must testing them all, which can take a long time. But, if the startposition is equal or semi-equal the solver will find the solution much quicker. He has clear information in which direction the RRT path must be sampled and which action might be useful.

So here ends my short explanation about Learning from demonstration. It is possible to extend the idea with many demonstrations instead of once and to parametrize them with the aim to fasten up the rrt solver further. And yes, it is indeed possible to use also deeplearning which helps to reduce the workload. But even the above described vanilla LfD algorithm which uses only a RRT solver works reasonable good. The idea is, to bring some kind of robustness into the game.

## References

[1] Atkeson, Christopher G and Schaal, Stefan, “Robot learning from demonstration”, in ICML vol. 97, (, 1997), pp. 12–20.

[2] Hsu, Yih Yuan and Yu, Cheng Ching, “A self-learning fault diagnosis system based on reinforcement learning”, Industrial & engineering chemistry research 31, 8 (1992), pp. 1937–1946.

[3] Kaplan, Russell and Sauer, Christopher and Sosa, Alexander, “Beating Atari with Natural Language Guided Reinforcement Learning”, arXiv preprint arXiv:1704.05539 (2017).

[4] Lin, C-T and Lee, C. S. George, “Neural-network-based fuzzy logic control and decision system”, IEEE Transactions on computers 40, 12 (1991), pp. 1320–1336.

[5] Mus, DA, “Generating Robotic Manipulation Trajectories with Neural Networks”, (2017).

[6] Reynolds, Craig W, “Steering behaviors for autonomous characters”, in Game developers conference vol. 1999, (, 1999), pp. 763–782.

[7] Sung, Jaeyong and Jin, Seok Hyun and Lenz, Ian and Saxena, Ashutosh, “Robobarista: Learning to Manipulate Novel Objects via Deep Multimodal Embedding”, arXiv preprint arXiv:1601.02705 (2016).

[8] Zaidi, Abdallah and Rokbani, Nizar and Alimi, Adel M, “A hierarchical fuzzy controller for a biped robot”, in Individual and Collective Behaviors in Robotics (ICBR), 2013 International Conference on (, 2013), pp. 126–129.