The Agent and AIGC (Artificial Intelligence Generated Content) technologies have recently made significant progress. We
propose AesopAgent, an Agent-driven Evolutionary System
on Story-to-Video Production. AesopAgent is a practical application of agent technology for multimodal content generation. The system integrates
multiple generative capabilities within a unified framework, so that individual users can leverage these modules easily.
This innovative system would convert user story proposals into scripts, images, and audio, and then integrate these
multimodal contents into videos. Additionally, the animating units (e.g., Gen-2 and Sora) could make the videos more
infectious. The AesopAgent system could orchestrate task workflow for video generation, ensuring that the generated
video is both rich in content and coherent.
This system mainly contains two layers, i.e., the Horizontal Layer and the Utility Layer. In the Horizontal Layer, we
introduce a novel RAG-based evolutionary system that optimizes the whole video generation workflow and the steps within
the workflow. It continuously evolves and iteratively optimizes workflow by accumulating expert experience and
professional knowledge, including optimizing the LLM prompts and utilities usage. The Utility Layer provides multiple
utilities, leading to consistent image generation that is visually coherent in terms of composition, characters, and
style. Meanwhile, it provides audio and special effects, integrating them into expressive and logically arranged videos.
Overall, our AesopAgent achieves state-of-the-art performance compared with many previous works in visual storytelling.
Our AesopAgent is designed for convenient service for individual users.