An Analysis of Generative AI Agents in Automatic Task Completion in Interactive Web Environments

Sarah Li; Ziyu Yao

Authors

Sarah Li Department of Computer Science, George Mason University, Fairfax, VA
Ziyu Yao Department of Computer Science, George Mason University, Fairfax, VA

Abstract

Autonomous AI agents are software programs that can independently complete multi-step tasks given a prompt (e.g. “Tell me the full address of the nearest airport to George Mason University”). They are expected to complete mundane daily tasks that do not require human input in the near future. Autonomous agents have become more sophisticated in recent years owing to advancements in machine learning and natural language processing. However, these agents are still susceptible to mistakes in comprehension, reasoning, and action execution. WebArena is a standalone web environment that can host autonomous agents. In this project, we conducted an analysis of WebArena and its provided agent using the procedural guidelines and task prompts provided by the WebArena team. We ran WebArena using 100 unique prompts ranging from shopping and store management to mapping and navigation. We categorized errors and discovered which errors occurred most frequently. The autonomous agent successfully completed the task in 23% of runs. Many runs suffered from an incorrect, missing, or unnecessary action. Common reasons for errors included a lack of world knowledge or human-like visualization, ignoring part of a prompt, and hallucinations. In the future, training should ensure that autonomous agents have sufficient world knowledge and experience to navigate the web successfully. Developers can also add weights to prompts so that agents can focus more on relevant criteria and introduce memory so that agents can learn from past actions. Hopefully, with more time and research, autonomous agents will soon improve the lives of many.

An Analysis of Generative AI Agents in Automatic Task Completion in Interactive Web Environments

Authors

Abstract

Published

Issue

Section

License

assip