AesopAgent: Agent-driven Evolutionary System on Story-to-Video Production

Jiuniu Wang*, Zehua Du*, Yuyuan Zhao*, Bo Yuan, Kexiang Wang, Jian Liang,
Yaxi Zhao, Yihen Lu, Gengliang Li, Junlong Gao, Xin Tu, Zhenyu Guo†
DAMO Academy, Alibaba Group * Equal Contribution.   † Corresponding author.


The Agent and AIGC (Artificial Intelligence Generated Content) technologies have recently made significant progress. We propose AesopAgent, an Agent-driven Evolutionary System on Story-to-Video Production. AesopAgent is a practical application of agent technology for multimodal content generation. The system integrates multiple generative capabilities within a unified framework, so that individual users can leverage these modules easily. This innovative system would convert user story proposals into scripts, images, and audio, and then integrate these multimodal contents into videos. Additionally, the animating units (e.g., Gen-2 and Sora) could make the videos more infectious. The AesopAgent system could orchestrate task workflow for video generation, ensuring that the generated video is both rich in content and coherent.

This system mainly contains two layers, i.e., the Horizontal Layer and the Utility Layer. In the Horizontal Layer, we introduce a novel RAG-based evolutionary system that optimizes the whole video generation workflow and the steps within the workflow. It continuously evolves and iteratively optimizes workflow by accumulating expert experience and professional knowledge, including optimizing the LLM prompts and utilities usage. The Utility Layer provides multiple utilities, leading to consistent image generation that is visually coherent in terms of composition, characters, and style. Meanwhile, it provides audio and special effects, integrating them into expressive and logically arranged videos. Overall, our AesopAgent achieves state-of-the-art performance compared with many previous works in visual storytelling. Our AesopAgent is designed for convenient service for individual users.

Main Body


Figure 1: Overview of AesopAgent. This system would convert the user story proposal into a video assembled with images, audio, narration, and special effects. The video generation workflow suggested by AesopAgent, utilizing agent-based approaches and RAG techniques, encompasses script generation, image generation, and video assembly.

AesopAgent utilizes agent-based approaches and RAG techniques, coupled with the incorporation of expert insights, to facilitate an iterative evolutionary process that results in an efficient workflow. This workflow creates high-quality videos from user story proposals automatically. As illustrated in Figure 1, upon receiving a user's “Dragon Story” proposal, AesopAgent employs the well-designed workflow by agents implemented with RAGs, including script generation, image generation, and video assembly, ultimately generates a high-quality dragon story video.


Figure 2: Illustration of the AesopAgent framework. The bottom part of the figure shows the workflow from the user story proposal to the video, and the top part shows the main components of our method: the Horizontal Layer and the Utility Layer. The Horizontal Layer is responsible for leveraging agent and RAG techniques, optimizing workflow and prompts, and optimizing utilities usage, and the Utility Layer is responsible for providing utilities for image generation and video assembly steps.


Figure 3: The illustration of Utility Layer. The Utility Layer contains four modules, i.e., image composition rationality, multiple characters consistency, image style consistency, and dynamic video assembly. Each module has some utilities from the utilities library, and these utilities are created and optimized by the Horizontal Layer.


Figure 4: Qualitative results of different methods on image generation. We show the generated images from two stories (i.e., "Goldilocks" and "Epaminondas and Auntie") from SDXL, ComicAI, and our AesopAgent. Specifically, the generated images from different AesopAgent's utility modules are shown in Columns 3 to 6 (note that improvements of the below modules include the improvements of above modules).


Figure 5: Qualitative comparison with other methods. We compare the keyframes of our AesopAgent with the other three methods (i.e., Human Design, NUWA-XL, and AutoStory) on three stories. The semantic meaning of each frame is listed above the corresponding generated image.

Generated Videos

A Story of Little Red Riding Hood and the Wolf

Tang Poems

《池上》 小娃撑小艇,偷采白莲回。不解藏踪迹,浮萍一道开。


The Cat and the Dog