Introduction

The content on this article is based on the course by Prof. Prithviraj at UC San Diego. This whitepaper about AI Agents by Google is also a good read.

What is an agent?

Agent is an entity with agency. A Minecraft agent?. Agents see applications within the workspaces in the form of workflow automations, household or commercial robotics, software development and personal assistants. Generally, the theme is that agents take actions.

Historically, the use of agents started in the early 1900s in the field of control theory. They were used for dynamic control of flight systems, and in 1940s it expanded to flight guidance, etc. By the 1950s the concepts of MDPs and dynamic programming were being expanded to many use cases. Surprisingly, one of the first natural language chatbots, Eliza, was created as a psychotherapist simulator in the 1960s! Finally, reinforcement learning became a field of study in the 1990s for sequential decision making.

Sequential Decision Making

These tasks are different from other ML problems like classification. A model that has an accuracy of 99% at each step, has a cumulative accuracy of ~30% after 120 steps!

These problems are formalized as a Markov Decision Process - an agent performs actions in an environment, and in turn receives rewards as feedback. These configurations are distinguished as states, and the whole process can be seen as sequential decision making.

The core components of an agent, often agreed on, are

  • Grounding - Language is anchored to concepts in the world. Language can be grounded to different forms of information systems - images, actions and cultural norms.
    • Agency (ability to act) - At each state, an agent needs to have multiple choices to act. If an agent has to select what tools to use but there’s always only one tool, is that agency? The action space has to be well-defined to look for agency. Although there is a single tool call, different parameters for the tool call can probably be considered as different actions. Actions can be defined as something the agent does and changes the environment. The distinction between an agent and environment is not very clear in many cases. Although, our approximations mostly serve us well.
    • Planning (Long horizon)
    • Memory -
      • Short-term - What is the relevant information around the agent that it needs to use to act now
      • Long term - What information has the agent already gathered that it can retrieve to take an action
    • Learning (from feedback) - Doesn’t necessarily always mean backpropagation.
  • Additional -
    • Embodiment (physically acting in the real-world). Embodied hypothesis - embodiment is necessary for AGI.
    • Communication - Can the agent communicate its intentions to other agents. Very necessary pre-requisite for multi-agent scenarios.
    • World Modeling - Given the state of the world and an actions, predict the next state of the world. Is Veo/Sora a world model? It is an attempt for world model since they have no verifiability. Genie is another such attempt. So is Genesis - this is much better if it works.
    • Multi-modality - The clean text on the internet is only a few terabytes, and our models have consumed it (took use 2 decades though). YouTube has 4.3 Petabytes of new videos a day. CERN generates 1 Petabyte a day (modalities outside vision and language). Some people believe this form of scaling is the way to go. There are more distinctions -
Model AI System Agent
GPT-4 ChatGPT ChatGPT computer use
Forward passes of a neural net Mixing models together Has agency

It is important to remember that not every use case needs an agent and most use cases just need models or AI systems. Occam’s razor.

Simulated Environments and Reality

Why do we need simulations? Most tasks have many ways of completing them. There is no notion of global optimal solutions ahead of time but usually known once the task is complete.

The agent needs to explore to find many solutions to compare and see what is the most efficient. However, exploration in the read world is expensive - wear and tear of robots, excessive compute, danger to humans, etc.

Simulations offer an easy solution to these problems. Assign a set of rules, and let a world emerge. One of the early examples of this is Conway’s Game of Life which theorized that complicated behaviors can emerge by just a few rules.

From an MDP perspective, a simulation contains \(<S, A, T>\) where

  • \(S\) is the set of all states. It consists of propositions that are true/false. Example: You are in a house, door is open, knife in drawer
  • \(A\) is the set of all actions. Example: Take knife from drawer, walk through door
  • \(T\) is the transition matrix - (You are in the house, you walk out of the door) -> You are outside the house.

A simulation need not have an explicit reward.

Sim2Real Transfer

The ability of an agent trained in simulation transfer to reality is dependent on how good the model extrapolates out of distribution. With the current stage of agents, the simulation is made as close to reality as possible to reduce the Sim2Real gap.

How do we measure closeness to reality? The tasks in the real world have different types of complexities -

  1. Cognitive complexity - Problems that requires long chains of reasoning - puzzles, math problems or moral dilemmas
  2. Perceptive complexity - Requires high levels of vision and/or precise motor skills - bird watching, threading a needle, Where’s Waldo

Examples of simulations -

  1. Grid world - low cognitive and almost zero perceptive. However, this idea can arbitrarily scale to test algorithms for their generalization potential in controllable settings.
  2. Atari - low perceptive, medium cognitive. Atari games became very popular in 2013, when Deepmind released their Deep Q-Net paper that achieved human level skills on these games.
  3. Zork, NetHack - low perceptive, high cognitive. These are configurations or worlds that you purely interact with text. The worlds are actually so complex that there is no agent that is able to finish the challenge!
  4. Clevr simulation - medium perceptive, low cognitive - This simulation generates images procedurally with certain set of objects and has reasoning questions for each image.
  5. AI2 THOR - medium perceptive, medium cognitive. Worlds with ego-centric views for robotics manipulation and navigation simulations
  6. AppWorld - medium perceptive, medium cognitive. A bunch of different apps that you would generally use in daily life. The agents can access apps, and the simulation also has human simulators. This simulation is one that is closest to reality in the discussed so far!
  7. Minecraft - medium perceptive, high cognitive. A voxel based open-world game that lets players take actions similar to early-age humans.
  8. Mujoco - high perceptive, low cognitive. It is a free and open source physics engine to aid the development of robotics.
  9. Habitat - high perceptive, medium cognitive. A platform for research in embodied AI that contains indoor-world ego-centric views similar to AI2 THOR, but with much better graphics. They have recently added sound in the environment too!
  10. High perceptive, high cognitive - Real world, and whoever gets this simulation right, wins the race to AGI. It requires people to sit down and enumerate all kinds of rules. Game Engines like Unreal and Unity are incredibly complex, and are the closest we’ve gotten.

    Some researchers try to “learn” the simulations from real-world demonstrations.

In each of these simulators, think of the complexity and reward sparsity in the environment. It is easy to build a simulator that gives rewards at a goal state than the one that gives a reward for each action. There are some open-lines of research in this domain -

  1. Which dimensions of complexity transfer more easily? Curriculum learning
  2. Can you train on lower complexity and switch to a higher complexity?
  3. Can we learn the world model holy grail?

How to make simulations?

As we’ve seen, simulations can range from games to real-world replications with physics involved. Most simulations are not designed keeping AI in mind. However, with the current state of AI, this is an important factor to keep in mind.

Classical environments like in Zork/AI2 Thor/Mujoco have something known as PDDLs. Some simulations are built through AI, like AI Dungeon that spins up worlds for role-play games.

Planning Domain Definition Language (PDDL)

Standard encoding for classic planning tasks. Many specific languages for creating simulations have similarities with PDDL.

A PDDL Task consists of the following

  • Objects - things in the world that interest us
  • Predicates - Properties of objects that we are interested in, can be true or false
  • Initial state - The state of the world that we start in
  • Goal specification - Things that we want to be true
  • Actions/Operators - Ways of changing the state of the world.

These are split across two files - domain and problem .pddl files.

Classic symbolic planners read PDDLs and give possible solutions. Checkout the Planning.wiki. In many cases these planners are used over reinforcement learning due to lack of algorithmic guarantees.

There were other attempts