Neural Turing Machine

Neural networks have a long history. Their latest achievement so far were the image recognition challenge with so called Convolutional neural networks. Additional to CNNs many other kinds are out there for example: RNN, LSTM, NTM and “Differentiable neural computer”. The last one is so far the most advanced neural network ever invented. It can do all the tasks what a normal turing-machine can do but it is trainable. It is similar to genetic programming.

So, what can we do with an Neural turing machine which has memory? Not as much, in the arxiv paper only simple examples are give like copying bits from place a to place b. That is not very impressive. A neural turing machine is capable for inferencing a single part of a broader algorithm. The question is, of how to make that usable for solving complex task? One possibility is to combine a NTM with “language grounding”. Grounding means to connect a problem solving technique with natural language. For example, task1 is called “grasp”, task2 is called “ungrasp” and task3 is called “move”. Every task is represented by a NTM and is learned separately. Solving complex tasks means to put the verbs together to a sentence like “grasp, move, ungrasp”.

The main problem with all Neural networks is the high cpu usage. In theory they can learn long programs, in reality this would takes millions of years to run on a gpu hardware from nvidia. Only if the problem is small and compact it is possible to train the neural network in a short time period. For example, a problem like “add 2 to input” can be learned by a neural network easily. After some examples “0+2=2, 10+2=12, -5+2=-3” the network has found the algorithm for his own. This function represents the word “add”. The problem is, that a normal computerprogram consists of many subtasks.

A good testbed for implementing neural turing machines is a multimodal games. That games have to be solved with “Reinforcement learning” in combination with language grounding. Instead of only go to a certain place (which can be learned easily by all kinds of neural networks) the agent has to do abstract tasks like climbing a ladder, open a door and so forth. This result is a hierarchical architecture which consists of a lowlevel and a highlevel layer. The lowlevel tasks are solved with neural networks, the high level tasks with language.


In a recently published paper “Beating Atari with Natural Language Guided Reinforcement Learning”, a sophisticated neural network is presented which can solve the game “Montezuma’s Revenge”. From a technical point of view, the authors (all from Stanford university) call this principle “Reinforcement learning”. But the main reason why the game was solved is located in the grammar. In the appendix of the paper the 18 different commands are presented, for example “go to the right side of the room”. These commands are working as options, which is equal to a multimodal-neural network. Has the neural network found the commands alone? No, the commands where coded by hand. So the solver is not really an example for reinforcement learning but is a handprogrammed hierarchical macro. The correct name for the paper would be “Scripting AI for language grounding in Atari Games”.

According to the paper, the subactions were learned by the network. But it would also be possible to coding them manual. For example the task “climp down the ladder” can be expressed as a short piece of code which navigates the agent the ladder down. Even without any kind of reinforcement learning it is possible to solve the Atari game.

It remains the question of how to find the commands and the implementations for a given problem. If the game is not “Montezuma’s Revenge” but Tetris, then the grammar looks different. Finding such grammar without human programmers seems impossible. To express this problem in detail: suppose we are removing one command of the task list. Instead of 18 commands, the agent has only 17 commands. And the task is to find the removed command. It seems difficult or even impossible to do so. For example: if we remove both climb commands, how should the neural network know, that climb is the right word for using the ladder? The only place where such information can be found is a walkthrough tutorial which is normally written in english language. So before building a reinforcement learning solver, a natural language parser is necessary and additional a searchengine for finding such walkthrough tutorials.

Another possibility are plantraces which is utilized by “Learning to Interpret Natural Language Navigation Instructions from Observations”, On page 2 they give some details about how the learning proecudure works. As input pattern the system gets:

– natural language instructions

– observed action sequence

– world state

From technical point of view it is possible to formalize the learning steps into an authoring system. Because the observed action sequence and the worldstate is given by the game, only the language instruction is missing. So if the human-operator gets a microphone and say loud the action what he is doing now, than a huge corpus can generated in short timeperiod. This corpus is feed into the neural network …