Multimodal planning is here to stay

After reading some of the latest movies and view some of the few videos on the internet who are showing the algorithm in action it is easy to say, that so called “Multimodal planning” is a high-performance robotics planner. Explaining the principle is easy: a graph is created like in PRM action planning with the addition that symbolic actions are allowed. A symbolic action allows to jump in the state-space to any point for example it could be “move object to the middle”. With this symbolic actions it is possible to create keyframes and then it is possible to create transitions between the keyframes. All with the help of the same graph datastructure. Multimodal planning can be called an knowledge based enhanced planner which allows to plan complex tasks like “moving all the bottle in a line including regrasping bottles if necessary”.

Unfortunately, I have not programmed a working prototype right now. The main reason is, that no tutorial is available right now, so it would be a trial and error task which can take a lot of time. But in theory, such a prototype would work great. It allows to search in a large space space without much cpu consumption.

It is a bit hard to find sourcecode from previous projects. What is available on github right now, are multimodal traffic planners. The idea is, that a person can walk by feet, take the car or take the bus, and the planner is utilizing this knowledge in a REST API. What I also found is an extension to the OMPL planning library (ROS) which implements a multi-modal robotics planner, but this project is bad documented. So it seems, that the idea itself make sense, some researchers are playing with it, but a wider audience isn’t aware of it. But in theory such a multimodal planner can solve lots of daily life problems in the Lego Mindstorms community (following a black line), in grasping challanges (Amazon picking challange) or for computer animation.

The main problem in multimodal planning is, that the resulting graph is hard to visualize. In one paper i found some kind of 3d map which shows different layers of a graph. Other authors have tried to explain the graph like an automaton. That means a mode is equal to procedure in a computer program. But i would guess the better way to communicate the concept into the mainstream robotics community is with the help of a youtube video which shows the resulting actions of the robot, together with the graph. This video would look similar to a normal RRT pathplanner, except the additional modes which make the system more powerful.


Learning from Halle 54 at VW

The project “Halle 54” at the german company VW is a good example of a failed robotics project in the 1980s. The aim was to increase the productivity with newly developed robotics and the programmer failed to fulfill the task. What went wrong? The answer is, that the challenge was too complicated. That means, the gap between goal and technology was too big. The answer is not to avoid robotics project, the answer is to reduce the challenge. A better example of a todays robotics project is the Amazon picking challenge. Here is the goal not directly to improve the productivity, instead a synthetic benchmark is created against the team must win. The benchmark are the rules of the Amazon picking challenge, and it is possible to fulfill the challenge.

The remarkable insight is, that either at Halle 54 nor at the Amazon challenge a robot is used for doing useful things. In the case of VW the installed robotics are doing nothing but costs only money, and the same is true for the Amazon project. That means, even if the winner team picks successful a box from the shelf, it is not enough for using the robot under real conditions. That means, it is a loose-loose-game like in the VW Halle 54. But something is different, this time even a lost challenge brings technology forward. Such competitions are not done with the purpose of increase the annual income of VW or Amazon, but to produce new papers about robotics and write new software which is hosted on github.

Newly founded robotics project are no longer focused on productivity. Instead other synthetic rules are implemented. One benchmark is the score in a challange, another benchmark is the page number of the produced paper which is describing the system. I would call these measurements soft-goals because they are not connected into raw productivity. They are derived from the future goal of using robots in a real factory. The main difference to the Halle 54 example is, that the soft-goals can be reaached. That means, even without reaching the main goal the team was successful.

For future engineers it is important to invent such challenges. They are games inside the game with reduced difficulty. The subgoals can be reached within weeks and with a small amount of energy. I would guess, that inventing a robotics challenge is more important than inventing a robot-control-system which fulfills the task.

Let us analyze the idea of a robotics challenge in detail. The general idea is, that person A invents a game, with the aim that other people will play the game. The game could be to grasp a box with the Baxter robot. Like any games, the game itself and the rules are making no sense. It is something which can’t be used for practical reason. Instead it is a leisure activity. The interesting aspect is, that this changes the focus. Instead of dealing with robotics itself, the participants are trying to solve the game. They are doing so, because they believe, that they can learn something from it.

The question which remains open is, how a good robotics teaching game looks like? One constraint is, that it shouldn’t be too hard, another constraint is, that it should bring robotics forward. In any case, the robotics game is done outside of normal work or normal challenges. That means it has nothing to do with increase the stockprize of VW. Instead the game follows his own rules.

Let us explain what solving a robotics game means. In all cases the solution has to do with programming. So in general it is a agent programming competition. The team members are taking a predefined API and are using it for getting the maximum score. From a certain point of view, it is possible to criticize such challenges. Because the inventor of the game is failed himself to program a robot and now is creating software without any purposes except that other programmer should program against his software.

A good example is the robocode project which is mainly a java program which can run other Java programs. The robocode challange is far away from a Halle 54 like project. That means, no Agent in the simulation can improve the productivity of the german car maker and will not doing so in future. Instead the aim of Robocode is the robocode game by itself. That means it has per default no purpose.

The aim of Halle 54 at VW was to invent a robot which is doing a task (assembly of cars). The aim of the Amazon Challenge and Lego Mindstorms challanges, is to invent a game in which the teams can doing tasks. The first one is oriented on machines and AI. That means the idea is to do something with a robot, while modern robotics challenge are educational motivated. The idea is to teach people in programming robots. Halle 54 has failed, and will fail today. But educational robotics can be called a success. Many people have attended the games and they will doing so tomorrow.

Is it possible to improve the productivity in a today’s car manufacturuy with robots? No, because noone is able to program the software. The task is too complicated, and there are no papers available which are describing it. It is not possible to use Artificial Intelligence for something that works. But, apart from real life applications it is possible to program robots that works in Robocode, Lego Mindstorms League or Robocup. The reason is, that there we have a reduced difficulty and each year some new papers are published which are describing interesting inventions.

Manual control

It is important to define the difference between working technology and wishful thinking. Working technology is limited to remote control. For example a manual driven car or a manual driven cable crane can be used productively in a company for car manufacturing. That means it will be useful work tool and helps to improve the shareholder value. Everything which goes beyond manual control is not available. That are autonomous car, autonomous robots. Such technology is not available and projects who have the aim to implement it will fail. The bad news is, that robotics can not be researched in real car assembly. The only place where they can be researched is outside of practical usage. In artificial games with reduced difficulty.

The result is, that on one hand in real factories we see manual driven car and assembly lines with robots while in game environments and in education we see lots of robots. The gap between them has to do with a lack of knowledge. That means, robots in education are not real robots, they are simulating the reality on an easier model. Lack of knowledge means, that it is not possible to increase the game-difficulty. In theory, it is possible to invent complex games which are going beyond the robocode challenge, the problem is no one is there who knows how to play them autonomously.

On the other hand it is not possible to recreate the workplace in a robot-friendly way. Because car-assembly has certain constraints which can not be reduced. And this results into the above described gap. The gap has to do with future technologies. That means in 10 years or so the gap will be smaller. Filling the gap means, to use robots for practical things: in the household and in car-assembly. And one day it will be possible. Until then we will see progress in robotics games, for example the Mindstorems EV3 series is more advanced then the EV2 series.

Peer-review wiki makes progress

My own Peer-review wiki is working well. Because I have slightly modified the aim of the project. Instead of simply call it a “peer-review wiki” for summarize current robotics papers, the better description is an “Annotated bibliography”. That means, it is list of important literature in the field which is categorized into groups and annotated by short subjective comment. This set-up makes it easier to extend the wiki, because it is clear what ontopic is. It is primary a bookmark directory, comparable to bibsonomy, but in a wiki format with additional information. The aim is to extend the document in future, so the wiki-format is well suited.

Around 28 literature sources are part of the wiki, plus some general information about robotics. It has around 8 pages DIN A4, if the wiki is printed out. The main idea is, to put the references not at the end of a paper, but create the bibliography for itself, that means, to focus entirely on the literature list.

Good example for a failed software project

I can report about a failed software project. It was initiated by myself. Here is the URL The idea was to create a wiki for peer-reviewing papers about robotics. So nice so good. The first version was created by myself. The problem was, that after the first draft I had no further ideas how to improve it. The section “Learning from Demonstration” needs triage, but my own knowledge is not high enough for the task. So I’ve searched a bit in social networks like Google Plus and Facebook for help. But I couldn’t find anybody. So the project “Wiki” has stopped in this early stage. Forking it is technical possible, but none has done this. So what went wrong?

The first problem was, that it was a one-man-only show. And my own knowledge wasn’t sufficient. The hope was, that the group of contributes will increase, but this wasn’t happening. As a consequence I would call the project a fail. It started with a good idea, but after a while nothing else was happening. Even today, i think it would be a great idea for building a wiki for reviewing robotics paper, but it seems, that the community who is interested in the goal is missing. Perhaps, it would be better to start such projects from an existing community with clear responsibility for the success and a higher instance which is interested in the progress of the project.

Trajectory planning speed up with Learning from Demonstration


State-space reduction is the most important acceleration technique in every robotics solver. Techniques like sub-policy, Learning from demonstrations, qualitative physics and natural language instructions are used together with classical RRT planning. This blogpost introduces some of the concepts for a target audience, who is everybody and is interested in programming dexterous robotics with OpenAI gym.

Table of Contents

1 Reinforcement Learning
1.1 Learning vs. Planning
1.2 Hidden Markov Model
1.3 Reinforcement Learning plus natural language instructions
1.4 Q-Learning
1.5 Function approximation for inverted pendulum
1.6 Control policy explained
2 Neural Network
2.1 How powerful is deeplearning?
2.2 Qualitative physics with neural networks
2.3 Recurrent neural networks
2.4 Combining planning with neural networks
associative memory
3 OpenAI gym
3.1 Very basic OpenAI gym agent
3.2 OpenAI gym with sub-policy
4 Learning from Demonstration
4.1 Makes learning from demonstration sense?
4.2 Micromanipulation planning
4.3 Learning of a dynamic system
4.4 Robobarista
4.5 Static motion primitive
4.6 Reward function in Inverse Reinforcement learning
4.7 Learning from Demonstration
4.8 Sensor-Motor-Primitives

1 Reinforcement Learning

1.1 Learning vs. Planning

In the robotics domain often the word “Learning” is used. For example in papers about “trajectory learning”, “biped balance learning” or “motion learning”. It seems, that everybody is using the word, but only few can explain why. It is possible to explain the terminus in detail. The word learning comes from early AI history which was done in 1950’s. In this time, AI was called cybernetics and was a sub-discipline of psychology. The question from this time was of how humans and animals think. The famous example from this time was a maze experiment with a mouse. The mouse is sitting in a maze. It walks around, searches for the cheese and while it is doing so, the mouse learns the labyrinth. That means, that the environment is stored into the memory of the mouse brain.

The early AI scientists tried to reproduce this behaviours with robots and computers. The idea is here, to replace the mouse with a robot, let him doing random steps in the maze, and store the sensor information inside a LSTM-network. “Trajectory learning” and “motor control learning” is equal to see a robot like a mouse.

But the concept has a huge difference. It is not focussed on results or technical problems, instead it is based on neural networks. Other problem solving techniques which also could bring the robot into the goal are missing. The alternative to learning is planning. The goal is the same, to bring the robot into the goal. But this time, planning is a technique which is realized by algorithm and not by psychological models. A planning algorithm could be for example A*. A* is not derived from biology or psychology, instead A* was invented from scratch. It doesn’t exists in nature, so it is not really part of core-science.

The contrast between learning and planning is a historic relict from the contrast between GOFAI and Narrow AI. GOFAI is learning-based which is equal to Cybernetics, while modern Narraw AI is done with planning and engineering techniques which are invented from scratch. In many universities today is Robotics learning very popular. But not because it is so superior, but because the professors who are teaching it, are very old. They have learned AI in the 1950s together with psychology and can’t imagine, that something different is possible.

The question which has to be answered is: what do we want? Is the aim to understand a mouse, while she is running through a labyrinth or do we want build robots which are competitive?

In theory it is possible to create a maze learning robot. This is robot which memorizes the obstacles and possible moves into a neural network and retrieves the information for finding the way out. But in reality, a working robot was never presented, and perhaps it is too difficult to build such a system. Instead it is easier to build a planning robot, which runs a normal algorithm. Such systems are reliable and can be bug-fixed.

GOFAI and Narrow AI are working on the same subject: Artificial intelligence. The difference is the precondition. GOFAI tries to rebuild nature. At first human and animal intelligence is studied and then reproduced with machines. While Narrow AI has at the beginning a machine and tries then to make it intelligent. The difference between the two is how to generate new knowledge about topics. In GOFAI inspiration are based on understanding of nature and from other science-discplines like biology. While Narrow AI is a social discipline which is separated from alchemy. A typical way for finding new knowledge in Narrow AI are robot-challanges, in which different teams are programming their robot and trying to be better than the opponent. The standard-lego mindstorm competition has nothing to do mathematics, physics or biology in the classical sense it is more like a poem writing challenge.

1.2 Hidden Markov Model

Until now the open question is how to programming the motion primitive in a robot-control-system. One possibility is procedural animation, which is the same what Craig Reynolds under the term “steering” has described. [6] From a computational point of view, steering is indeed the best idea. There is a C++ method called “drive”, this function calculates something and as a result the robot moves forward.

The steering function is normally used for controlling the direction of a car. The car has a position and a direction, and there is goal in a certain angle. A formula is used which calculates the new angle of the wheel and the car drives to the goal. There is only problem: the formula must be programmed. And this in most cases very difficult.

I want to give another example: biped walking. According to literature the best practice method here is to use the ZMP method (zero movement point). This is a physical model to calculate the servo motors for a walking robot, which is comparable to steering of a car. But the problem is, that the formula is very general, that means it can not calculate exactly the right parameter. For doing so, the model must be more complicated. The formula is only an approximation.

The question is how to deal with uncertainty? And here comes the hidden markov model (HMM) into the game. In general, HMM is a probabilistic algorithm, which means in every run the result is different. A HMM is a pseudorandom-generator. Ok, let us a go step back. A random-generator prints a random number to the screen. Pseudorandom means, that the randomness is there but only in a small portion. An example for a pseudorandom-generator is:

print(randint(3, 9))

This is the python sourcecode for printing out numbers between 3 and 9. On one hand it is unclear what the next number looks like, on the other hand it is clear that it will between a range. A HMM can be called an advanced pseudo-random-generator. Inside the model we have a table which is similar to the q-matrix at qlearning, for example this one:

  1 2 3

1 0 0 0

2 0 0 0

3 0 0 0

And instead of the zeros, the transition probability is given to switch between the state. After executing the table, the system gave us random-numbers, but they have also a structure. With HMM and other stochastic models like LSTM it is possible to generate noisy output. It is the same principle like in the python range-randomizer in which the timeseries is in certain band.

The remarkable aspect is, that on one hand we have modelled the system, on the other hand we have not, because important aspects of the model are unspecified. Instead they are determined by randomness which is equal to “we don’t know”. So pseudo-random-generator is a hybrid model which is on one hand specified by meaning and on the other hand it is probabilistic.

1.3 Reinforcement Learning plus natural language instructions

The reinforcement technique is used to generate a policy on the fly. That means, that after the learning step the agent can solve the task by its own. For generating the policy only some cpu-ressources and a reward function are needed. The easiest example is an agent inside a maze, but the same principle can be used for grasping task. Here the maze is the optimal control problem.

The advantage of reinforcement learning is, that the task must not be specified by hand. So reinforcement learning can be called a function approximation. The main disadvantage of the concept is, that for bigger problems the state-space grows rapidly, so that even the fastest cpu are unable to find the right q-table.

But it is not necessary to solve bigger problems with reinforcement learning. High-level aspects of a game can be done with another technique called “natural language instructions”. Only the motion primitive must be calculated automatically. That means, the agent must not learn of how to clean the kitchen, but he only needs to learn simpler tasks like grasping objects, releasing objects and walking to a place. In the literature, the concept is often called multimodal or hierarchical reinforcement learning, which means that on the top layer the user is typing in commands like “grasp”, “open gripper”, and on the lower layer every command is connected to a q-table which executes the action.

Let us go into the details of the motion primitive. A simple motion primitive is a push-action. The robotfinger has a position x,y and the aim is to push an object. The question is: what are the right parameters? One trajectory of the robot could be: 10,0–12,0–14,0, another trajectory would be: 10,0–10,2–12,4 but also different trajectories are possible. The abstract problem description is, that the robotfinger can do something in the x-y-space and the system is effected by this. Somewhere at the end, the reward function is signalling that the goal was reached. It is a classical reinforcement learning problem. All motion primitives can be described in that way. Always, the robothand has some degree of freedom to act and this effects the system. At the end the reward is given for completing the task.

A normal robothand consists of more then one finger. The hand has fingers. That means we have an multi-agent-system. The number of literature about this topic is smaller, but it seems that it is also possible to solve the task. The difference is, that the statespace is bigger.

The classical example in q-learning is about an inverted pendulum. Instead of defining the control rule explicit, the algorithm determines the q-table by itself. The remarkable thing is, that it works. That means, after some trial a stable q-table is found. If the length of the pendulum is different, that the q-table has to be calculated from scratch.

But I want to go a step back. The inverted pendulum problem consists at first of an algorithm. That is a rule what to do in which situation. For example, the pendulum is on the left, and it is falling downward. The player must react properly. The finial q-table takes such a decision. It stores all the rules for every situation. The second aspect of the problem is how to find the q-table. This is done by the reinforcement algorithm, which is mostly a search algorithm for maximizing the reward function.

Let us take a look at a perfect q-table. The q-table controls the pendulum. Surprisingly the q-table not consists of mathematical equations or sourcecode but of state-action-pairs.


I’m not the first author who is describing a mixture of natural language instruction plus reinforcement learning. In the Robobarista project this idea was implemented on the PR2 robot.[5] The first half of the paper is nothing in special. .The author is explaining of how a recurrent network works. The innovative aspect from the paper is, that the network is connected to 230 words in a dataset, which is some kind of multimodal learning. With that feature it is possible not only to generate a single trajectory but solving complex tasks.

It is not the first paper, which connects an Neural network with “natural language instruaction”. The same principle was used for solving Atari games [3] which was also published in the year 2017. But the robobarista paper goes a step further, because additional in the project a real robot was used.

The principle is not only solving internal problems of Artificial intelligence like generate a trajectory with a neural network, but this time the PR2 robot solves a problem in a real environment. Even for people who are not interested in robotics the video looks amazing.

1.4 Q-Learning

The short answer what q-learning is, has to do with genetic programming. A given small program is improved by an algorihm for solving a problem. In the following chapter, the inner working are explained in detail.

A Markov decision process (MDP) is equal to a probabilistic turing machine. It is an automaton which is doing the same, as a C++ program does. Normally a MDP has a structure of a q-table, that means there are states and for every state, some actions possible. If an action has the probability of 0,5 it is equal that in 50% of all cases the action is called. The principle can be called a learning program, because it is not necessary to code it manual, instead the q-learning algorithm finds the answer with datamining.

| state  | action 1  | action 2  | action 3 |
|   1    |   0,2     |   0,5     |   0,3    |
|   2    |   1,0     |    0      |    0     |
|   3    |   0,33    |   0,33    |   0,33   |
Figure 1: q-table

Let’s take a look into the figure “q-table”. The automaton has three possible states. If the current situation is equal to state 2, then the automate selects action 1, because it is only one. In case of the other states, the automaton uses a random-generator according to the probability in the q-table.

So what can we do with this principle? Tthe same as we can do with genetic programming too: evolving a program until a goal is reached. Like in genetic programming, the learning algorithm needs extreme much cpu-time for finding an answer. But for small problems which are not too complicated like steering to a point in a maze, the system works great.

A good possibility to recognize the power of q-learning, it is recommended to compare it with procedural animation which was described by Reynolds [6]. Reynolds has manually programmed sourcecode which controls the steering wheel of a robot. The sourcecode consists of an equation like “goal=targetangle-sourceangle” and adds some additional if-then-statements. If the reynolds-car doesn’t steer correctly, the programmer must improve the algorithm on the sourcecode level. In contrast, the q-learning algorithm uses a q-table for storing the steering-equation and it is updated automatically.

So, if q-learning is so powerful why not all computersoftware is programmed with this technique? The answer has to do with complexity. Steering a car is an easy task, which consists of a few steps. The q-learning algorithm is able to find the solution in a small amount of time. On the other hand, most problems in computing like programming an operating system are complex tasks. The state-space is much bigger, the q-learning algorithm wouldn’t find a solution in a short time.

What today’s Artificial intelligence researchers are trying, is to use q-learning and similar techniques as much as they can. Because programming a q-learning system is easier than programming the control program by hand. Another example is the cartpole-problem. In theory it is possible to solve the problem with procedural animation. At first we need a mathematical equation which calculates how the balance is, and this is used to control the game. The disadvantage is, that testing such a formula is very complicated, and if the problem is slightly different, perhaps a double inverted pendulum, the equation is wrong and must found again. In contrast, the q-learning concept can be adapted to nearly all problems.

1.5 Function approximation for inverted pendulum

The inverted pendulum problem is well known in the reinforcement learning community. It is an example for an optimal control problem. The policy for solving the task can be described as a state-action-vector, which means that if the pendulum is in state 1 the correct action is right, in state 2 it is also right and so on.

In the q-learning terminology, the policy is written into the q-table, this can be also expressed in a graphical notation. In some videos the graphic representation is shown which has on the x-axes the current angle and on the y-axis a colorcode, which is symbolic for the correct action. A more abstract way to talk about the subject is to call the problem a function approximation task. in a x-y-diagram same points are given and the task is to paint a line through them. If the function is executed on the inverted pendulum it will stand upwards.

And here it is possible to explain of how to transfer this technique to more sophisticated problems. Normally the inverted pendulum problem is not very advanced. It is the standard-problem which is given in every beginner tutorial about q-learning. A more advanced task is to control a robot hand. The fascinating aspect is, that the technique is the same. The movement of the hand are captures by a dataglove. And like in the q-learning task the next step is to do a function approximation. That means to find a compact representation to describe the function and to interpolate between unknown points. For doing this, there are many mathematical techniques out there. The simplest form is qlearning better known as a q-table, but also Fourier-transformation, radial-basis-network, neural networks or Dynamic movement primitive are possible. .Sometimes the function approximation is done with principle component analysis. All of these techniques are working with the same principle. At first, points in a x-y-coordinate system are given, and the algorithm searches for a function to connect the dots.

Now follows an explanation how the robot itself works after he has learned the function. We are going back to the inverted pendulum problem, because it is easier to explain. The robot has some input, that is the current speed of the pendulum, the direction in which it is falling and the current angle. So it is description of the current situation or more general the input-vector. Now it is up to the robot to take a decision. He can do nothing, or move the cart to left or to right. For getting the information, the robot looks into a lookup-table, which is also called q-table. He searches for the state, and sees in the row which action is the right one. This action will be executed.

1.6 Control policy explained

In the area of reinforcement learning quite often the term “policy” is used. Normally it is some kind of function approximation between input-state and output-action. For example: if in the inverted pendulum problem, the pendulum is left at angle -20 then the action is -1. The aim is to find the correct action for the complete state-space, so that the robot knows in every situation what to do. In contrast to a normal computer program, there is no further calculation done, instead the policy is similar to a lookup-table. A neural network is able to story the table with a high compression rate, so that millions of input-output-situations can be stored.

2 Neural Network

2.1 How powerful is deeplearning?

On the first hand not very much. Even the newly invented Tensorflow Processing Units from Google have only a capacity in the Teraflop area. They are on the hardware level very fast, but in comparison to the problems which have to be solved in robotics the performance is not enough. I want to give an example. Suppose, we want to calculate the shortest path between 100 cities. For finding the optimal solution every current available CPU is overwhelmed. That means, the algorithm can not be executed until it will stop. The reason is, that the state-space for the travelling salesman problem is huge.

But to call Deeplearning a waste of time is too pessimistic, instead the brute-force power has to use wisely. That means, before calculating the neural network itself, some pre-decisions have to be taken for reducing the problem space. Normally this is done with a high-level-symbolic planner. In computing literature this is often called PDDL planning or natural language instructions and means to subdivide a problem into smaller parts. Instead of calculating what the robot the next 60 minutes should do, the task is subdivided into an action like “grasp the object”. This motion primitive has to be solved with deeplearning, and this works great. A small problem with a minimal state-space is exactly this kind of tasks what a deeplearning GPU can solve.

Solving is another word for avoiding manual programming. Instead of entering a rule or formula which drives the robot arm to the object a genetic algorithm is used for reaching the goal. Only the reward function has to be defined manually.

2.2 Qualitative physics with neural networks

Qualitative physics is a semantic description of physical events. For example, an inverted pendulum can have the event “is-falling-down”. In most papers, qualitative physics is simply ignored because it seems to complicated. But it can be used, to improve the learning speed of neural networks.

Normally an event in a qualitative physics model is given by the system. The event “pendulum-is-falling-down” can be derived from the angle. Such a semantic description can guide the learning progress of the neural network. In that sense, that the network decides if the event is important or not. So the idea is not only give the minimum information to the neural network, but all known information.

In an early paper of 1992 the concept was described. [2] The paper is not very well written, must details remain unclear. I want to give the overall idea a bit better. At first, we control a system manual. The example is to steer a car on a road with topdown physics. The result is, that we drive the car into the goal. While we are driving, the logfile is generated. It stores for every second the position of the car, the direction and the driving wheel.

In the next step we add a qualitative physics variable. The first event is called “car-is-near-border” and the second event is “car-is-in-curve”. Both events are optional, the game can be solved without knowing them. And it is in theory possible to derive them from the given information. But we decide to store them explicit in the logfile. The result is, that in the second case in which more information are given it is easier for the system to construct a model. Perhaps the reinforcement controller would reduce the speed automatically if the car is inside a curve. Driving the car only with the two qualitative events is not a good idea. The amount of information is too low. Every event can only be true or false. That is not enough to drive a car. But in combination with the absolute positions and the other information it is possible to construct a controller.

The reason why this works is hard to understand. Let us first research what a neural network is. A neural network is using datamining technique for generate an controller. The relationship inside the datamining table are unknown, the neural networks explores them with changing the weights. So the networks decides, which values are important and how to get a decision.

In literature the concept is sometimes labelled as “Linguistic variables”. It is possible to combine linguistic knowledge with neural networks:

“The neuro-fuzzy system uses the linguistic knowledge of fuzzy inference system and the learning capability of neural network” [8]

2.3 Recurrent neural networks

As an example I want to use inverted pendulum problem which is relatively easy to control. The plain vanilla strategy is, that the input values like angle and velocity of the pendulum are feed into the neural network, there the weights are calculating something and the output neuron print 0 for moving the cartpole to left or 1 for moving it right.

How can we improve the neural network? The only information which is provided to the network is the angle and the velocity. This is not enough. What the network really needs are information from the past and even from the future. Information from the past are easy to get. We are taking the angle and velocity from the previous time step. For example, the angle from the last framestep was 44 degree and the current is 47 degree. So the network knows, in which direction the pendulum is moving.

The information from the future step are a bit more difficult to get. The angle of the next step is not known yet, because no control-impulse was send to the system. But we can make a forward simulation and sending both possibilities. We let run the simulation one step forward with output neuron 0 for left and 1 for right, and we are measuring the angle in both case. So we have the information of what the pendulum will do. So the new input vector for the neural network is:

previous angle, current angle, future angle on action left, future angle on action right, velocity.

With this rich amount of information it is very easy for the neural network to solve the problem. He is informed about the past, present and future and knows the values for the angle and the velocity. The only task which is left, is to bring all the information into an order and calculate the correct output for controlling the system.

2.4 Combining planning with neural networks

So called Bidirectional neural networks are using data from future states of a system for calculating the control signals. The future states can be retrieved with a forward simulation, that is normally done with RRT. Such a neural network is in reality both: a classical physics planner and a neural network. Here some details:

In the Cartpole problem of OpenAI gym are two actions possible: 0=left, 1=right. If we move the cart left, then the system will be in a new state. How to decide which direction is right? The best decision is based on as much information as possible. The current system, the past system and what the system will look like if we are doing a certain action. Additional “qualitative physics” information are useful.[footnote:[4] describes on page 9 a car which is controlled by a neuro-fuzzy system.] So our control-policy is feed with many different information. Some of them are easy to get, for example the current angle of the pendulum. Other a bit tricky to retrieve, for example the angle from the step before, and some are really hard to get. For example the information in which state the system will be, if we use the 0=left action. This information can only be get with forward information.

The overall input-signal are stored in long array. This is far bigger, than the standard observation array which is normally used in the OpenAI gym. The question is, how exactly we are using these information for calculating the control signal? At first we are storing the information in a database. Then we are trying out some policy, which can be random or generated with learning from demonstration. :This enormous dataset is now feed into the neural network. As a result it will generate a policy of how to combine the information for generating the perfect policy.

I do not believe, that a complex neural network architecture like LSTM or deep q learning is necessary. Instead the minimal example is a single neuron, which has 100 input signals and one output signal. The input signals are the above described rich information and the output is the control-signal to the system.

Why are input data so important? Let us research an example. The pendulum is nearly ontop and we must decide what to do next. If we have a simulation environment for testing out what-if-scenarios, the answer is simple. We are testing action 0, action 1 and if one of the action generates a higher score we take it. This forward simulation can be done with RRT. Here is result of the simulation:

action0 -> 50 points
action1 -> 100 points

If we take this information as an input, the neural network can be very simple. Because 100 is greater 50, 100 is better, so action1 is better.

The interesting thing is, that in a standard neural network this information is not available. Usually a neural network knows only the current situation and can’t access to the data of a RRT simulation. So the neural network must calculates internally with lots of weight some policy to determine the right action, even if its doesn’t know what happens next. Is this assumption mean full? Is it necessary to withhold the information? No, it is a rhetorical question. Mostly this is done without a discussion about it.

In the literature the concept is not very usually. The neural network type which is using information from the future is called bidirectional neural network. If information from the past are used, then it is a recurrent neural network. If information from the qualitative physics are used, it is not known how the network can be called and the capability of language understanding is called “hierarchical reinforcement learning”. If we combine all, then our network can be called:

“hierarchical bidirectional recurrent neural network with linguistic variables”

The general idea is, to extend the number of input-values and use as the neural network a standard-perceptron. From a perceptron it is known, that it can solve easy problems with a bit training, and we as the programmer must only ensure, that the problem will be easy. I want to give another example how to make the bidirectional neural network working.

A normal Cartpole problem in OpenAI gym is defined by one input variable. We have the angle of the pendulum right now. So the structure is:

angle -> neural network -> output neuron

The task of the neural network is, to calculate the output according the current angle. This task is very complex. .We can make it easier if we help the network a bit. We are testing out our game-engine with different actions and measure the angle of the future. So the structure is:

(angle current, future angle on action left, future angle on action right) -> neural network -> output neuron

This time the task is easier to solve. The number of weights can be smaller, and the learning process takes less time. We can help the network even more. For example if we calculate also the angle for 2 steps in the future. So it is no longer a classical neural network, it is more a planning algorithm which is tuned by a neural network.

associative memory

It is possible to increase the abstraction level a little bit. The input neurons of a neural network can be imagined as an associative memory. Onto the information in the memory, the neural network is doing simple operations. Instead of using a perceptron like neural network the more general idea is, to use a stack-based turing-machine. The input signals are stored in the stack in a linear order. Then the program is running and is doing something, and at the end a result is printed out. To find a computerprogram is done with genetic algorithm which are testing out many possibilities. The reason why in machine learning neural networks are used and not turing-machines has to do with the fact that a neural network can calculate the result quicker. Instead of executing complex algorithm, every neurons sums up his input and that’s it.

Neural networks are not fully turing-ready, but they are good enough for easy tasks. They have more in common with function approximators.

3 OpenAI gym

3.1 Very basic OpenAI gym agent

In the standard-version the OpenAI-gym, software has only a random-action-agent. That means that there is no policy and the task is not done. The alternative is to use a table with weights and multiplying them with the current-state-vector. To finding the weights the easiest possibility is to use random-sampling algorithm, that means in every step the weights are initialized by random, and if the reward is better than in the last episode, the new weight-vector is used as policy.[footnote:[]]

The concept is not totally new, it is the simplest form of a neural network. 4 input numbers are mapped to 4 weights and no further layer is in use. The interesting thing is, that after some iteration a better weight-combination is found. How to improve the policy? At first it is important to replace the weight-matrix with a more powerful turing-machine. The simplest one is used in the busy-beaver-challange. A turing-machine can also be imagined as a weight-matrix, but it is possible to do more tasks.

The next question is of how to find the parameters for a turing-machine. The problem itself is called genetic programming because on every iteration the performance is better. The main problem is, that it works only for small problems with small input vectors. What the reinforcement learning and genetic programming community are doing is to find algorithm, weights and solvers for finding the solution faster. Mostly, this is not possible, because machine learning itself has limits.

On one hand it is right, that a randomized initialized busybeaver machine which is evolved by brute-force is not very powerful. On the other hand, even sophisticated deeplearning algorithm are not much better. They can learn faster but not as much as needed. I would suggest, if a naive brute-force-solver which adapts the weights via random needs 100 seconds, then the best deeplearning algorithm needs perhaps 20 seconds for the same task.

The conclusion is, that machine learning can be reduced to a simple turing-machine which has parameters, and these are evolved by an inefficient algorithm.

3.2 OpenAI gym with sub-policy

The best environment for research in detail who to implement a robot-control-system is the OpenAI gym software. Around this tools, a broad community is available and the games are standardized. As a consequence the discussion about problems and solvers are easier. If somebody wants to tune his neural network for solving pacman inside OpenAI gym, he doesn’t need to explain what a neural network or pacman is, instead it is clear what he wants. The disadvantage of OpenAI gym is that only Python is supported but not C++. But that small mistake can be ignored.

So what have the OpenAI gym community found out how to solve the games? More advanced techniques are called sub-policy. That means a form of multi-modal learning, in which a bot supports different commands. For example, a biped walker can stand up, walk forward and jump. Each command is learned separate and at the end an additional layer decides which sub-policy has to be activated. With that strategy it is possible to generate more complex bots.

Another idea for improving the AI is to use “qualitative physics”. That means, that the observations from the game are enhanced with linguistic variables. The programmer tries to encode additional knowledge into the game. An example: In the pacman domain, the rawdata is feed into a event-parser. Possible events are: enemy-is-near and pacman-on-border. The event can be true or false. The calculation is done manually in sourcecode. The event-variable is feed into the neural network and supports the rawdata. So the neural network can learn quickly. The third option what improves the performance of bots dramatically is dedicated GPU hardware which provides teraflop-range performance. All three strategies (sub-policy, qualitative physics and gpu) combined are realizing high-end AI which can solve many games.

The reason why reinforcement learning is so powerful is because manually coded heuristics can be combined with machine learning. The normal machine learning bot is not very efficient. He uses a trial&error strategy and must try many weights until the neural network is able to solve a problem. But, if a programmer gives some short hints like possible actions, and linguistic variables the neural network will learn much quicker. The interesting aspect is, that the human programmer must not understand the game in detail, and there is no need for programming the bot in detail. It is enough to give only a part of the solution, the rest is calculated by the neural network with brute force.

The best example is the famous game “Montezuma’s Revenge”. A vanilla neural network is not able to solve the game. Because the state space of all weights is too huge. Even after training many weeks on gpu hardware, the bot fails to go to door. But, with simple information the learning process can be improved dramatically. At first, sub-policies are defined like “walktodoor”, “down-ladder” and “pick-up-key”. Then linguistic variables are defined like “enemy-left”, “bot-in-middle”. And all the information are feed into the neural network. Now the search for weights starts again, and this time the bot is able to play the game. This is called guided policy search and means, that the brute-force-search-technique for finding the right policy is supported by some simple manual programming, which divides the problem space into smaller pieces.

4 Learning from Demonstration

4.1 Makes learning from demonstration sense?

In the robotics literature there is a huge amount of papers out there which are explaining the “learning from demonstration” (LfD) paradigm. For all, who are not familiar with the concept, at first a short introduction. We have in a map a robot, an obstacle and a goal. The normal idea for bring the robot to the goal is to use the brute-force RRT algorithm which randomly samples possible trajectories and if he found a way through the goal he is finished. According to the LfD paradigm, this idea is wrong. Instead it is necessary to guide the search for a trajectory. The first step is, that a human operator draws some trajectories into the map, and the search for the final trajectory is guided around the data.

But why it is necessary to guide the search? Why the human-operator must draw examples in the map? Answering the question is not easy, because it is the precondition of LfD that this behaviour is right. According to my knowledge about pathplanning it makes no difference if a human-operator teaches trajectory before the solver is searching for a solution. The LfD process can be cancelled. Instead a brute-force open horizon solver is the better alternative which samples all possible trajectories from scratch.

The reason why some people think that this is not optimal has to do with the huge state space. Normally a RRT solver is not able to find a solution, because the number of possible trajectories is too big. So in real life applications, RRT is often combined with a heuristic. Learning from demonstration is such a heuristics which can accelerate the search. But inside the RRT universe, other heuristics are also possible. Mostly they are using cost-functions, which are customized to the domain. Other heuristics are dividing the map into small submaps and solving each sector separate.

I think we should differentiate between the algorithm itself and possible tweaks to let running him faster. .The algorithm itself to find a trajectory is brute-force-sampling. That means, the robot selects with a random generator if he wants to go north, south or east and is testing out, if there is a way or not. Tweaking the algorithm means, to store the graph in a database, search with a heuristics or using a LfD routine. But, all these tweaks are not important, in easy problems, they can be leave out.

Let’s take a look of a how a standard pathplanner in the RRT context works. Normally the operator must define the goal, for example the robot go to position (10,20). Then the operator defines some conditions like that robot should not collide with an obstacle and must stay away from the corner. Now the operator presses the run button and the solver presents a solution. The trajectory to the goal is calculated with trying out possible alternatives and evaluating each with a score. The inner working is, that the operator defines an abstract goal, and with cpu-intensive task a solution is found.

Somebody may argue, that we can optimize the algorithm, to calculate the solution in a shorter amount of time and with less cpu-power. But that is not a must have. In general a trivial brute-force solver works reasonable well. And i would go a step further and state the explaining the inner working of an improved RRT solver which is using mathematical tricks to find the trajectory faster is an anti-pattern in explaining the overall task. IMHO the LfD paradigma explain of how to make the trajectory search faster and leaves out the explanation of what the goal is.

4.2 Micromanipulation planning

The most effective techniques in AI which was ever invented is the brute-force search. This technique is able to solve every game and every robotics problem. Not only for chess and pathplanning problems finds the solver a solution but also for Mario AI, Starcraft AI and dexterous grasping. The only problem is the high cpu consumption. In reality, only current hardware is fast enough and the robot must find in a solution in under 1 minute. With these constraints a brute-force solver is difficult to implement.

Somebody may argue, that a brute force solver is the wrong way, and that other techniques has to be found. No it is not. The question is only of how to using the solver in the right way. There are two possibilities for saving cpu-time:

• calculate only micromanipulation tasks
• using heuristics

I want to discuss both in detail. A complete pick&place task can’t be solved with brute force in mind. The trajectory would consists of many seconds of continuous actions and todays computing power is not very powerful. But what would happen if we subdivide the task in smaller subactions? For example, the robot hand has contact with an object and the solver has to calculate the next step. That means he should answer the question if he must push the robotarm strong or less against the object to bring it into the goal. Such a detailed question can be answered by a bruteforcesolver, because the time-horizon in which the push-action takes places is small.

Additionally the action-space can reduced further with dedicated heuristics. .The programmer can say, how the push-action looks like in general way So he reduces the number of possibilities down. These can be calculated on normal hardware in around in few seconds.

In the literature the concept of subactions which have only a small time-horizon is known. Many concepts were discussed for addressing the problem. The easiest form is to use procedural animation. But this technique has the problem that the sourcecode is static and for most problems it is not know of how the mathematical formula looks like. Also in literature the “Learning from demonstration” is discussed. But this idea has the disadvantage that it is unclear of how to store the policy. For example, Hidden markov model, LSTM or a q-matrix is possible. In my opinion the most powerful and easy to understandable solution is a brute-force-planner. That means, that there is no policy, instead the physics engine is tested with trial&error on the same way, as a RRT pathplanner will find a trajectory through the maze. The programmer has the task to define where the goal is, and define some constraints. The rest is done by the solver.

To sum the principle up, it is right to say that the system consists of the following elements:

• high-level “natural language instructions”
• motion primitive which consists of a goal and constraints
• a high-level GOAP like solver
• a motion primitive brute-force RRT solver

In my opinion these technique will allow to find a solution with low cpu-usage. The only disadvantage is, that until now i haven’t it tried out in reality. So it is only a guess. Implementing such a system in working C++ code is a bit complicated. Perhaps for the beginning only the motion primitive solver would be enough. So that not the overall robot-system has to be programmed but only only motion primitive like “closegripper”.

What does “closegripper” means? Normally it has a certain goal. The robotarm must get in touch with the object and reach a certain amount of pressure. How exactly the gripper has to be closed is depended on the position and the object size. So there is much room for a solver for trying out different alternatives.

4.3 Learning of a dynamic system

Instead of describing a standard optimal control problem like the inverted pendulum, I want to give an easier example from the area of pathplanning. The advantage is, that this problem is well understood in computing. A maze is given which consists of obstacles, and the robot must find a trajectory through the maze. With a pathplanner like RRT he finds a way. Now, we take the robot and place him a position small left from the original position. The old trajectory is no longer valid, instead a new one has to be computed.

This problem is similar to a grasping problem, where the object is a bit more left in the demonstration so the robot must adapt his grasp trajectory. How to deal with the situation? I asked Google Scholar but nobody seems to have an answer. There is no algorithm available which modifies a given trajectory slightly so that the robot can drive to the goal. Instead, in real life a different strategy is used which is called anytime planning. That means, to calculate the trajectory from scratch.

I selected the pathplanning problem in a maze, because it is easier observable what is going on there. If the robot is moved only a small amount and should go to the same goal, the resulting trajectory is completely different. It is the same problem like in a chess game, where we only remove one figure from the board and now every player must recalculate his strategy from scratch.

If it’s not possible to modify a given trajectory for a pathplanner it is also not possible to modify a trajectory for optimal control problems. The problem space there is bigger and the complexity is growing. So from my point of view, in optimal control problem a complete replanning is necessary.

4.4 Robobarista

For the Robobarista project there are some papers online for example this one: [7]. The authors are describing a very advanced robotics system. The most interesting aspect is the combination between deeplearning for the robot trajectory and the grounded vocabulary for activating a single trajectory. According to my knowledge, this is the first paper worldwide, which describes this in such a detailed form. The robobarista project itself consists additional of other elements for example, some kind of crowdsourcing and a image-recognition engine, but these are not so exciting.

In the research of deeplearning since many years it was known, that neural networks are powerful function-approximators which can be trained for every problem. The famous paper about the atari game-playing has shown this in detail, but neural networks can be used also for trajectory generation. The only problem was over long time, that it was not possible to integrate the neural network in a robot-control-system. And to use a neural network for controlling complex tasks was not possible, because the training episodes would take to much time. In the robobarista project a solution was found. On the high-level, a “natural language instruction” system is used for describing a task. This is similar to the PDDL paradigm, and every command is connected to a neural network. which is robust on the lowlevel against errors. A nice extra feature is, that the overall system is no longer a blackbox, like most neural networks, instead it can controlled interactively like a text-adventure. A possible interaction with such a robot would be:

1. open gripper
2. grasp object
3. close gripper
4. move to place A
5. open gripper
and so forth

The advantage is, that the human-operator has control over the system, because the robot makes exact this, what he types in. On the other hand, the neural network is capable of learning the task on the fly and act like a black box, where nobody is sure, how a trajectory is generated. I think, this combination is the key ingredients for future robotics system.

The most remarkable aspect of the Robobarista project was, that today a youtube demonstration is available which is showing the robot in front of coffee-machine. So the technology is not pure science fiction, but it is working. Completely new is Robobarista not. The idea of grounding with natural language and the idea of deeplearning was described in earlier papers. But in the robobarista project at first time a functional integrative system was described in detail. According to Google Scholar the papers were published between 2015 and 2017. As far as i know, this is the most advanced robotics project which is documented as openaccess ever. It was done by the Cornell university as a group work, not by a singe person.

4.5 Static motion primitive

Usually, motion primitives are the part of a robot control system which are created with Deeplearning. So called adaptive PID-control or neural networks for optimal control are used for regulating a dynamic system. The best example is the inverted pendulum problem in which a policy is used for bringing the system into a goal state. The assumption is, that the task is so complicated that the actions must be generated from scratch and this can only be done with a neural network.

This assumption is wrong. Surprisingly, static motion primitive are working perfect in the domain of manipulation. A corpus of static motion primitive consists of around 50 possible actions, and each one has a fixed trajectory. For example “pushleft” means, that the robotarm is going 10 cm left, and “pushsmallleft” means, that the arm is going 5 cm left. Normally it seems not possible to use static trajectories for something useful, but it depends on the right order. A simple experiment in the simulator has shown, that with only 50 primitives it is possible to bring an object in every desired position. No additional parameters were used to modify a trajectory, instead a planner works only on the high-level-symbolic layer. That means, a motion pattern like “pushleft, pushright, pushleftsmall” is different from “pushright, pushright”.

How exactly can a static motion primitive be defined? Normally it consists of a name in natural language. This makes it easier for the human operator to memorize the primitive and he can now enter the name interactively for controlling the robot. Additionally, the motion primitive consists of the action himself. In the easiest form it is a relative movement for example (-10,0) means to move the robotarm 10 cm left. That’s all. No complicated spline trajectories are needed and also not a policy which uses additional state-information like in the reinforcement learning area. Instead the success of the overall system depends of the symbolic planner. He must bring the motion primitives in the right order. For doing so, a physics-based RRT planner is the right choice. The physics-simulation is tested out with different plans, and the plan with the highest score is printed out.

4.6 Reward function in Inverse Reinforcement learning

Ironically Inverse Reinforcement Learning (IRL) is often described as an alternative to a handcrafted reward function. But in reality both is the same. According to the literature, in IRL the reward function is defined with feature expectation. That is the probability, about which sensor inputs should be there. Let us investigate this in detail.

The idea can be dated back to computer chess. There is a subproblem happen, which is called “board evaluation”. The question is, how good a certain position is. Board evaluation in computer-chess is done, with counting the bishops, and all the other figures for calculating a score. It is common to weight the different features according their importance. The same is done inside IRL. The difference is, that in computerchess it is clear, that the reward function is handcrafted, while in IRL in the papers is written that it is done automatically.

In computerchess a board-evaluation is done, because it reduces the CPU-consumption dramatically. The chess engine knows, if a state in the game-tree is good or not and can selective search regions. The same is true for reinforcement learning tasks like in the OpenAI gym environment. In theory it is possible to solve all the games without a board-evaluation. But it takes much time to find a policy. The better approach is to use at least a handcoded reward function which guides the policy search into the right direction. The consequence is, that the search for the policy doesn’t takes hours but is done in seconds. The best way to understand the importance of a reward function. is to not using it. This results into a training phase where the neural net can’t improve his reward.

A good introduction with the example “inverted pendulum” is given by an older paper [1]. On page 4 a chart is shown, with the timestep on the x-axis, and the angle on the y-axis. At first, a human experts demonstrates the swing up task. A certain curve is plotted into the chart which shows the angle over 2 seconds. Then the robot must repeat the task. The aim is to follow the chart. For example, on timecode 1 seconds, the angle must be -4 radians.

The interesting feature of such a reward function is, that is robust against mistakes. For example, if the robot can’t reach the optimal -4 radians at the timecode, he is not totally lost. Only the overall score is lower.

Let us compare Learning from demonstration with a fixed trajectory which is programmed. A fixed trajectory means, that the robot repeats the demonstration accurately.. He follows the waypoints, even it makes no sense. Instead, learning from demonstration has the aim to get a reward function, and this can be used by the solver. And if a solver is in the loop, that means, the trajectory is generated from scratch and is able to detect obstacles. It is same principle like a chess engine is working.

The reason why a reward function is created and LfD is used has to do with speed. In theory it is possible to solve a game without a reward function. This results into the above cited non-improvement of a neural network, even if the fastest CPU is involved. With a LfD and a reward function the CPU usage in the training phase is smaller. That means, the solver finds a good policy and this will drive the robot to the goal. So LfD is in robotics the same what alpha-beta-prunning is in computerchess: a way to accelerate the search in the gametree.

The paper [1] was published before the famous paper from Andrew Y. Ng in which he described the apprenticeship idea. In my opinion, the paper from Schaal et. al. is easier to understand because it reduces the task dramatically and give a concrete example. To sum up the paper it is enough to know, that on a time-axis a value is measured, and the reward is measured by reaching that value. In the paper the value was equal to the angle of the pendulum. On timeindex 0.5 seconds it has a value, on timecode 1.0 seconds another and so forth. Doing the task is right, is equal if the robot reaches the same value at the same timecode.

The first impression is, that this strict condition makes no sense, because it is possible to swing up the pendulum in a different timescale. For example, faster or with another swingup period. Yes, that is right, but then we have the problem that we don’t know how the reward over the timescale is. The only feedback the robot will get, if he solved the problem at the end. Such a delayed reward after a long period without any reward is very difficult to find. It is possible with a brute-force solver and also with a neural network which is trained over many episodes, but the cpu-consumption is high.

Let us investigate this in details. The robot starts the swingup task. In the openai gym he would have two possible actions of move the cart left or right. A possible action-sequence could be:

0.0=left, 0.1=left, 0.2=right, 0.3=left, 0.4=right, 0.5=left

The first parameter is the timecode and the second the action of the robot. The consequence is, that the pendulum starts to move and in the last moment it is possible to answer the question if the task is solved or not. How knows the robot a point in time perhaps at timecode 0.2 if he is right or not? Usually this can be answered with a brute-force-solver. From the current state all futures states are calculated and a path to the goal can be found. The problem with RRT and similar algorithm is, that the nodes in the graph has no score. We don’t know how far they are away from the goal. Every node looks equal. What is missing is a reward function. That is in algorithm which can tell for every state how good it is. So it is possible to identify a right path, before the goal was reached. And here comes the idea of LfD into the game. A demonstration given by an expert can answer the question of how to scoring a RRT-node. The algorithm works in a way, that for every state, the feature difference to the expert demonstration is calculated. For example, according to the demonstration the angle should be 30 degree at timecode 0.2. If the robot reaches at the same timecode an angle with 32 degree he is very good.

It remains the question open, of how to reach a certain angle at a certain timecode. This can be answered by a solver. Either a RRT solver or a neural network which has learned a policy. With an RRT solver it is easier to explain the idea. We start at timecode 0.0. The robot has two options, he can do the left-action or the right action. According to the RRT paradigm both actions are tested. And now comes the magic. The follow-state which has a higher score is better and will be sampled with a higher priority. If some cpu time is available, the solver can also check out the other RRT node, but this is optional. So it is possible to only follow the rrt nodes which have a high reward.

Again, the robot is not doing the same actions like the human demonstrator. Perhaps, the human has executed the following action-sequence: left, left, right, left. But the robot will decide to do: right, left, right, left. What counts is only the difference of the feature, that is the angle in a certain timestep.

4.7 Learning from Demonstration

Most newbies in the area of robotics have heard about Learning from Demonstration. In the associated youtube-video mostly a human-operator is taken a robot arm and makes a move with it. Then the robot handles the task by its own. But how does this magic trick works? Or is it only a show?

At first we must separate between Teach-in robotics programming and LfD. Teach-in programming was invented in the 1980s and is equal to set up waypoints on a fixed value. p1 is (10,20), p2 is (40,20) and so forth. The robot arm moves along the line on the waypoints. If something happen for example an obstacle is in the line, that the robot is blocked.

In contrast, Lfd works with a trajectory solver. That means, the task is seen like a game of chess and the robot wants to win. In the LfD training phase the constraints of the task are defined. For example, if the human operator guides the robotarm to pick and place an object, then it is clear that winning means to move the object to the goal. From a mathematical point of view, the initial demonstration generates a cost function. In the second autonomous phase the robot is generating the complete gametree and goes into a direction in which he is near of the demonstration.

Now a short example: The human operator moves the robot arm to pick&place an object. He grasp the object, moves the arm and then releases the object. The robot knows, that the goal of the game is, to release the object at the goal place. If we are now blocking the robot with an obstacle, he will try to solve the game in every case. He calculates internally a way around the obstacle. That is the difference to Teach-in programming.

We can increase the abstraction level a bit. Every task, what a robot can do is described as an optimal control problem. That means, that from the current state some actions are executed to reach the goal state. With this principle every game can be solved. The question is only, which actions should be done in which direction. A naive solution for the problem is a brute force solver. The robot calculates the complete gametree until one node is equal to the goal state. The disadvantage is, that most robotic problems are too complex for solving it. Learning from demonstration is a technique of reducing the gametree. It evaluates the nodes of the gametree with a score, so the solver needs only compute a fraction of the gametree. In the domain of computerchess this is sometimes called Branch&bound, sometimes alpha-beta-prunning and means to speed up the algorithm. The remarkable aspect is not, that the robot can solve the task, the interesting feature is to minimize the CPU-usage.

On youtube is another video available which is called “Teaching a Robot to Roll Pizza Dough A Learning from Demonstration Approach (short)”. What is the task? The current state is a ball of dough and the goal state is rolled out dough. The transition to the goal state is done by actions. It would make no sense, to use teach-in for setup a fixed trajectory because the dough ball can be bigger or smaller. Instead a solver is needed which calculates the trajectory from scratch. Every LfD system has such a solver. He uses the data from the demonstration to bring the system in the goal state. Perhaps a short example: The robot starts to roll out the dough. His RRT Solver generates a new random action and the dough is falling from the table down to the ground. According to the learned demonstration this behaviour isn’t normal. So this RRT node gets a negative score and follow action hasn’t to calculated. Instead the solver takes one of the remaining RRT nodes and tries out what will happen if he is doing there a random action. So it is similar to computerchess, where a certain node in the gametree can be ignored and all following nodes too.

4.8 Sensor-Motor-Primitives

Reinforcement Learning and “Motor-learning” was described in many papers. Unfortunately most paper are going very fast forward and have no time for longer explanations. What the authors are trying to do, is to invent advanced motor-learning strategies which are from a technical point of view interesting but they forget that some newbies might be interested in the general concept. In the following section, I want to explain the idea as an introduction without too much complexity.

At first it is important to know, that Learning from Demonstration can be done without neural networks and without q-learning. It is only a scoring strategy for faster RRT-sampling. But let us start in the beginning with an RRT solver. RRT can bring a system into a goal state, for example the robot controls a balancing board which holds a marble and the aim is to bring the marble into the hole. What RRT does, is to testing out different plans. In a timeperiod random steps are possible, and after executing them the board is changing the direction and also the marble on it.

If our computer speed is unlimited fast, we have now found the solution. We are simply trying out all 100 billion possibilities and one of them brings the marble into the hole. But an unlimited fast computer is not available so we need some performance tricks. Sensor-Motor-Learning is one of them. The idea is, that at first a human experts demonstrates the task and in the background all movements are tracked. We are storing the sensor informations which are equal to the system state and we’re storing the movements which are equal to the actions.

Now we want to replay the recording. The easiest trial is, if we’re placing the marble exact onto the same startplace and bring the board exact in the same position. Then we’re executing our logfile and the marble is running into the hole. But what happens, if the startposition of the marble is slightly different? Executing the same actions would no longer solve the puzzle. but that is not the plan. Instead the above cited RRT solver helps us. We are sampling the action space but this time according to the expert-demonstration. That means, all actions which are similar to the actions from the demonstrations get a good score, and all sensordata in which the marble has the same or similar position like in the demonstration get a high priority too.

Our RRT solver consists now of two modules. At first, a random sampling technique and secondly a scoring module which gives every RRT node a point between 0 and 1. 1 means perfect, 0 means that this node is very different to the demonstration. The RRT solver samples at first all the rrt nodes with the maximum score. If the marble is on the same place or near the startposition then this will bring the marble to the goal. If not, the RRT sampler tries out RRT nodes which have a smaller similarity and so forth.

What happens if the repetition is completely different from the demonstration? Than nearly all RRT nodes will get a score of 0, that means it is unclear which path is correct, so the rrt solver must testing them all, which can take a long time. But, if the startposition is equal or semi-equal the solver will find the solution much quicker. He has clear information in which direction the RRT path must be sampled and which action might be useful.

So here ends my short explanation about Learning from demonstration. It is possible to extend the idea with many demonstrations instead of once and to parametrize them with the aim to fasten up the rrt solver further. And yes, it is indeed possible to use also deeplearning which helps to reduce the workload. But even the above described vanilla LfD algorithm which uses only a RRT solver works reasonable good. The idea is, to bring some kind of robustness into the game.


[1] Atkeson, Christopher G and Schaal, Stefan, “Robot learning from demonstration”, in ICML vol. 97, (, 1997), pp. 12–20.

[2] Hsu, Yih Yuan and Yu, Cheng Ching, “A self-learning fault diagnosis system based on reinforcement learning”, Industrial & engineering chemistry research 31, 8 (1992), pp. 1937–1946.

[3] Kaplan, Russell and Sauer, Christopher and Sosa, Alexander, “Beating Atari with Natural Language Guided Reinforcement Learning”, arXiv preprint arXiv:1704.05539 (2017).

[4] Lin, C-T and Lee, C. S. George, “Neural-network-based fuzzy logic control and decision system”, IEEE Transactions on computers 40, 12 (1991), pp. 1320–1336.

[5] Mus, DA, “Generating Robotic Manipulation Trajectories with Neural Networks”, (2017).

[6] Reynolds, Craig W, “Steering behaviors for autonomous characters”, in Game developers conference vol. 1999, (, 1999), pp. 763–782.

[7] Sung, Jaeyong and Jin, Seok Hyun and Lenz, Ian and Saxena, Ashutosh, “Robobarista: Learning to Manipulate Novel Objects via Deep Multimodal Embedding”, arXiv preprint arXiv:1601.02705 (2016).

[8] Zaidi, Abdallah and Rokbani, Nizar and Alimi, Adel M, “A hierarchical fuzzy controller for a biped robot”, in Individual and Collective Behaviors in Robotics (ICBR), 2013 International Conference on (, 2013), pp. 126–129.

Tools for robotics programming

In the last time, the number of postings dedicated to robotics was small. The reason was the transition from German-language to English and as a result the lower overall output. It is still the case that writing in English is a bit slower than writing the same text on German. So I have an excuse why the main topic in this blog, artificial intelligence, is currently in the background.

Today I want to heal this vexation with an introduction in general techniques of how to program a robot. I’m not sure, if the following tools are widely known but repetition is always good. At the beginning I want to start with a survey in programming languages. To make it a short: C++ is the best programming language. The syntax is easy to read, the execution speed is extreme high, the language can used for lowlevel- and highlevel-tasks and last but not least the object-oriented style is fully supported. I admit, that learning for beginners isn’t easy. The tutorial on is a bit longer and consists of bit-manipulation up to STL templates all, what the C++ expert needs. For most beginners this tutorial is perhaps a nightmare and I understand everybody who prefers Python over C++. On the other hand it is possible to use only a subset of the language, which is similar to normal python code, so that in reality even newbies will see no difference between both languages.

Here is a short example of a minimal C++ program which looks friendly:

#include <iostream>
#include <string>
#include <SFML/Graphics.hpp>

class GUI {
  sf::Event event;
  std::string selecttool="left";
  void run() {
    mysettings.window.create(sf::VideoMode(800, 600), "SFML");
  void out() {
    mysettings.window.clear(sf::Color::White); // clear
    mysettings.guimessage.push_back("selecttool "+selecttool);

int main()
  GUI mygui;;  
  return 0;

I have not tested the code for compilation and perhaps it will produce errors, but from the first impression it is not more complicated than Java, Python or C#. At the bottom is the main-function and above is a GUI class which consists of some methods. Thanks SFML most routines for drawing windows are implemented so it is possible to describe C++ as an easy to learn programming language. It is not more complicated to build the first application from scratch than with any other language.

What programming makes hard, is not the syntax of a concrete language it is more the right usage of external libraries. For example if not SFML but a 3D framework is needed plus a physics engine and some kind of animation routines, this would result in a more complicated software. But this is also true for non-C++ languages. At a conclusion my advice for novice is to switch as fast as possible to C++ because there is no language out there which is more powerful.

I have read some online discussions about what the best programming language is. Most arguments for example in “Perl vs. Python” are right and it is funny to follow such discussion. One detail in such comparisons is remarkable. Until now, i had never read a comparison between C++ and another language where at the end C++ was in disadvantage. Sometimes the discussion is held with “C++ vs. C#” in mind, but the arguments for C# are weak. Even hardcore C# users in the context of Unity3d are not so briskly that the claim that C# is superior. So it seems, that until now C++ is the queen of all languages and any other language accept this. I can say, that it is 99,9% sure that the next ten years nobody will start a thread where he seriously doubts the dominance the brainchild of Bjarne Stroustrup.

Apropos most dominant alpha-tool. Another software which is known as nearly perfect is the versioncontrol system “git”. Even programmer who are using something different are secretly fans of it. And it is not only used in Linux environment but also under Mac OS and Windows is git the most powerful software out there. To explain the benefits is not so easy as it looks. I would describe the software as a backup tool where every snapshot get a comment in natural language. Especially the multi-user feature of git makes it indispensable for every serious programmer.

After this short introduction of how to program a software in general now comes the part which has to do with robotics. What is robotics? The best answer to this question was given by Brosl Hasslacher and Mark W Tilden. Their essay “Living machines” from 1995 is the definitive introduction in the subject. What they left out is a detailed description of the nv-neurons, which are controlling the BEAM robots. nv-neuron looks similar to a positronic brain in Startrek, it is some kind of science-fiction which is not clearly defined in science. Some people say, a nervous network is a neural network like the LSTM network which was invented by Jürgen Schmidhuber, other say, that it is Robot-Control-System like ROS from Willowgarage. Perhaps the answer is somewhere between both?

Something is sure: with a neural network, plus a robot-control-system plus a robot-development-environment it is possible to matering robotics. That starts from walking robots, goes over flying robots, working robots, household robots and ends at social robots which are looking very cute. The discipline overall is fascinating because it is so widespread. Realizing a robot consists of endless subtasks which goes over all academic-topics. From mechanics, electronics, programming, language, humanities, algorithm, human-brain-interfaces up to biology is all necessary. So robotics and artificial intelligence is a meta-science which consists of everything. In the 1950’s the claim was named under the term Cybernetics which was a synonym for all and nothing at the same time. Today’s robotics has the same goal in mind, but this time it works.

At the end i want to betray what dedicated newbies can do, who have no experience with robotics or programming but want to create something. A good starting point is to program an Aimbot with the AutoIt language. How this can be done is explained in youtube-clips. The interesting aspect is, that Aimbots are normally not recognized as real robots, but they are. I would call them highly-developed examples for artificial intelligence, because it is possible to extend an Aimbot easily to an advanced robot.


In computing and especially in artificial intelligence the dominant form of getting feedback is negative connates. That means, what a programmer strives for is a bug or an error. That is the only possibility to learn. Writing a program means normally to producing a compiler error. And if the compiler says, the program is working than you need at least one feature problem, so that the programmer can write a bug report. Or to describe the dilemma colloquial: Nobody at stackoverflow wants to read of a working project, what the guys are interested in is a robot which doesn’t walk, a for-loop who doesn’t work or a problem which is unsolved.

Recent progress in robotics,5

A search query at google scholar for the well known “LSTM neural network” shows, that in the last 1-2 years, the number of papers is exploded. More then 10k papers were published, and perhaps it is more because some of them are behind the paywalls. But LSTM isn’t the only hot-topic in AI, another subject which is also interesting is “language grounding”. Both topics combined together realizing nothing less than the first-working Artificial Intelligence. This kind of software is capable of controlling a robot.

But why is the community so fascinated of LSTM and language grounding? At first, the LSTM-network is the most developed neural network to date. It is more powerful than any other neural network. LSTMs are not so perfect, like manual programmed code and some problems like finding prime numbers is difficult to formulate with LSTMs, but for many daily life problems like speech-recognition, image recognition and event-parsing LSTM is good enough. LSTM is not the same as a neural turing machine, so it is not a wonder-power for solving all computerscience-problems, but it is possible to do remarkable things with it.

The main reason why LSTM networks are often used together with language grounding is, that with natural language it is possible to divide complex problems into smaller ones. I want to give an example: If in a research project the robot should be trained with a neural network, to grasp the ball, move around the obstacle and put it into the basket, perhaps the project will fail. Because it takes to much training steps and the problem space is too big, for finding the right LSTM parameters. But with language grounding it is possible to solve each task separately. The network must only learn how to grasp, how to move and how to ungrasp and then the modules can be combined. Sometimes this concept is called multi-modal learning.

Another side-effect is, that the system, even if it was trained with neural networks, remains in control of the human operator. Without any natural commands, the network is doing nothing. Only if the operator types in “grasp”, the neural network is activated. So the system is not really autonomous which must be stopped by the red emergency button, instead it can communicate with the operator via the language which the LSTM network has learned. That makes it easier for finding problems. And if one subtask is too complex for mastering it with an LSTM network, that part can be programmed manual with normal C++ programming language.

“Beating Atari with Natural Language Guided Reinforcement Learning, 2017”,

In the last paper (Beating Atari) is a project described, which is capable for solving the game “MONTEZUMA’S REVENGE” which was in former DeepLearning projects not solvable by AI. What the researcher has done is combining an Neural Network with language grounding and voila, they get a well trainable and high-intelligent Bot.