Inside Genie: Google DeepMind’s Super Model that can Generate Interactive Games from Text and Images

The model represents a major breakthrough in generative AI.

Published in

Towards AI

6 min readMar 4, 2024

I recently started an AI-focused educational newsletter, that already has over 165,000 subscribers. TheSequence is a no-BS (meaning no hype, no news, etc) ML-oriented newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers, and concepts. Please give it a try by subscribing below:

TheSequence | Jesus Rodriguez | Substack

The best source to stay up-to-date with the developments in the machine learning, artificial intelligence, and data…

thesequence.substack.com

The pace of research in generative AI is nothing short of remarkable. Even so, from time to time, there are papers that literally challenge our imagination about how far the generative AI space can go. Last week, Google DeepMind published some of that work with the release of Genie, a model that is able to generative interactive game environments from text and images.

If we think video generation with models like Sora areimpressive, imagine inferring interacting actions in those videos. Imagine a scenario where the vast array of videos available on the Internet could serve as a training ground for models to not only create new images and videos but also to forge entire interactive environments. This is the vision that Google DeepMind has turned into reality with Genie, a groundbreaking approach to generative AI. Genie can craft interactive environments from a mere text or image prompt, thanks to its training on over 200,000 hours of publicly available gaming videos from the Internet. What makes Genie stand out is its ability to be controlled on a frame-by-frame basis through a learned latent action space, even though it was trained without specific action or text annotations.

The Architecture

Genie’s architecture takes inspiration from the latest advancements in video generation technology, incorporating spatiotemporal (ST) transformers as a fundamental component across all its model parts. This system begins by processing a series of video frames, breaking them down into discrete tokens with a unique video tokenizer, and then determining the actions occurring between frames through a causal action model. These elements are integrated into a dynamics model that predicts the next frames in a sequence, enabling the generation of interactive experiences.

To put things in context, the following video illustrates some of the games generated by Genie:

Source: https://www.tomsguide.com/ai/ai-image-video/are-we-close-to-the-holodeck-google-unveils-genie-an-ai-model-creating-playable-virtual-worlds-from-a-single-image

A significant aspect of Genie’s design is the integration of ST transformers, which allow for nuanced processing of video data, balancing the need for model complexity with computational efficiency. These transformers are unique in that they apply attention mechanisms across both spatial and temporal dimensions in a way that is more targeted and efficient than traditional methods. This approach enables Genie to pay attention to the right details at the right times, facilitating the generation of coherent and contextually relevant video sequences.

Inside Genie

Genie’s system is composed of three main components: a latent action model, a video tokenizer, and a dynamics model. The process begins with the video tokenizer, which translates raw video frames into a structured format that the model can understand. Next, the latent action model interprets the actions occurring between frames. These components work in concert with the dynamics model to predict future frames based on past data and inferred actions. This two-phase training process, starting with the video tokenizer and followed by the joint training of the latent action model and dynamics model, ensures that Genie can generate interactive videos with precision.

Google DeepMind has developed a method to make video generation interactive and controllable, focusing on the concept of latent actions. The Latent Action Model (LAM) is a critical component of this approach, allowing the prediction of future video frames based on actions identified in preceding ones. Recognizing the challenge of obtaining action labels from internet videos, which are seldom annotated and expensive to label, Google DeepMind has opted for an unsupervised learning strategy. This strategy enables the Latent Action Model to deduce actions occurring between video frames without direct instruction, converting these insights into discrete tokens via a Video Tokenizer. The Dynamics Model then uses this information to anticipate the next frame, facilitating a two-phase training regimen that first focuses on the video tokenizer before co-training the Latent Action Model and the Dynamics Model.

The Latent Action Model is ingeniously designed to control video generation. It works by analyzing the sequence of frames, identifying latent actions without the need for explicit labeling. This process involves two stages: encoding, where the model assesses both past and upcoming frames to determine actions, and decoding, where it predicts the subsequent frame based on these actions. By employing a VQ-VAE-based approach, the model limits action possibilities to a concise set, ensuring that each action captures significant changes within the video sequence. These actions, while pivotal during training, are replaced by user inputs during actual use, allowing for customized interaction.

The training of the Video Tokenizer parallels this encode-decoder mechanism, efficiently translating video data into a manipulable latent space. This aspect of the model’s architecture underpins the Dynamics Model, which interprets and projects the sequence of the environment based on the defined actions, making the next frame prediction.

In practice, users initiate the interactive video generation by selecting a starting frame and an action from a predefined set of options. This input is processed by the Dynamics Model to produce the next sequence of frames, with the user continuously guiding the unfolding narrative through their selections. This interactive loop empowers users to shape their video experiences dynamically.

Training Genie

The training data for this sophisticated system is meticulously curated from a vast collection of internet videos, specifically filtered for content related to 2D platformer games, speedruns, and playthroughs while excluding unrelated materials like movies or unboxings. Videos are segmented into manageable clips, which are then rigorously evaluated for quality. A sophisticated selection process, driven by a trained classifier, ensures only high-quality content feeds into the training process. This methodical curation has distilled the dataset to approximately 30,000 hours of video from 6.8 million sources, significantly enhancing the model’s performance and efficiency.

Through this innovative approach, Google DeepMind sets a new benchmark in generative AI, enabling the creation of interactive video experiences tailored by user inputs, all grounded in a vast and meticulously prepared dataset.

In essence, Google DeepMind’s Genie represents a significant leap forward in the field of generative AI, offering an innovative way to create interactive environments from simple prompts. By leveraging a massive dataset and employing sophisticated architectural components, Genie opens up new possibilities for interactive experience generation, setting a new standard for what is possible in the realm of AI-driven content creation.