https://youtu.be/vmt_WVBJMj4?si=OZlzEqfEvWjPakYV
I developed a system that uses generative AI to automatically create presentation videos from blog articles and upload them to YouTube.
With some ingenuity, generative AI can not only devise the presentation's storyline but also create the presentation materials.
Furthermore, by having generative AI create a script for the presentation and then having a text-to-speech generative AI read that script aloud, audio data can also be generated.
Combining the presentation materials and audio data then produces a video.
By automating this series of tasks, I've made it possible to automatically generate presentation videos with a single click.
Mechanism
The most crucial part of the process is the generation of the presentation materials.
Generative AI excels at generating images, but this is typically limited to photographs or drawings. Creating documents centered around text and figures, like presentation materials, is challenging for image-generating AIs.
Therefore, I generate text and figure-centric materials in a text-based format, similar to a programming language.
Several formats can be used for creating such materials.
Initially, I tried Marp, a format specifically for presentation creation, but its capabilities were limited. So, I decided to use the more general SVG format, which is for vector graphics.
With a text-based format like SVG, a standard chat-based generative AI can create the materials if prompted with a request like, "Please create presentation materials in SVG format introducing the content of this blog article."
Text Overflow Problem
The problem here is that text frequently extends beyond the outer frame of the document or the frames of figures within the document.
A human would immediately notice text overflow when viewing the completed document. However, detecting text overflow at the SVG text stage, rather than through visual inspection of the finished document, is difficult.
Consequently, chat-based generative AI often produces documents with frequent text overflow.
Of course, the AI generates most of the content well, and I could simply correct the text overflow manually. However, this would introduce a manual step every time.
Therefore, it became necessary to incorporate measures to prevent text overflow when generating SVG documents, and to develop a mechanism to automatically detect if any text overflow is present in the generated SVG.
To prevent text overflow, I adopted an approach of providing the generative AI with basic rules, operating procedures, and cautionary notes when instructing it to create presentation materials.
As rules, I instructed it not to use complex figures and to fix the font size of the text.
Furthermore, I instructed it to follow a procedure of counting the number of characters in a sentence within the document, multiplying that by the font size to estimate the width and height, and then pre-confirming that the text does not overflow the frame or figures.
During this process, I instructed the AI to record the checked process and results as pre-check comments within the SVG file.
Adding these instructions brought some improvement, but the initial accuracy was not satisfactory. So, I repeatedly generated several variations, added common error points as cautionary notes to the instructions, and emphasized rules and instructions repeatedly within the prompt text if they were not being followed.
By iterating on these prompt improvements through trial and error, text overflow can be suppressed to a certain extent.
However, even with all these efforts, perfection is unattainable, so I decided to implement a check in a later stage.
For this post-generation check, I tried using a generative AI capable of visually inspecting images, but it couldn't detect text overflow effectively, so I abandoned that approach.
Next, I attempted another method: inputting the SVG text back into a chat-based generative AI for checking.
This method was better at detecting text overflow than the visual inspection AI, but its detection accuracy was still not very high. Here again, by iteratively improving the instructions for detecting overflow, I could achieve a certain level of accuracy, but not a perfect one.
Therefore, I decided to create a program to detect text overflow more rigorously. This program checks if the text overflows the document frame or internal figures by calculating the width and height from the length of the sentences and the font size in the presentation materials, as instructed to the generative AI.
Creating this program was laborious, but it ultimately became capable of accurate detection.
In addition to text overflow, there were also cases where the AI attempted to create complex charts and produced distorted outputs. For such aspects, I retained the approach of having the chat-based generative AI perform a rule violation check.
This check determines if the AI created figures that are more complex than defined in the rules, marking them as unacceptable.
With this program for overflow checking and the generative AI for rule violation checking, problems can now be largely detected.
Subsequent Processing
If a rejection is detected during these checks, the generated SVG-format material is discarded and regenerated. This is because pointing out and correcting problematic areas often leads to other issues, ultimately taking more time.
Once a presentation material free of text overflow is complete, the next step is to input this material and the original blog article to the generative AI to create the narration script. No particular ingenuity was required here.
Then, the narration script is converted into audio data using a text-to-speech generative AI. Again, no special techniques were needed for this.
Finally, the SVG-format presentation material is converted into PNG images, and then, using a tool called ffmpeg, it is converted into an mp4 video with audio. This completes the process.
The series of processes following the creation of SVG-format slides can easily be automated by writing programs while consulting with generative AI.
Conclusion
Having successfully built and refined this automated presentation video generation system myself, I began publishing videos on YouTube last week.
However, shortly after this system was completed, Google's NotebookLM, an AI tool, also gained a similar feature for automatically generating videos to explain text documents.
Therefore, it is anticipated that in the future, companies offering AI services will release similar services, negating the need for individuals to build such systems from scratch.
Nevertheless, developing a practical program leveraging generative AI in such a serious manner has been a significant achievement, allowing me to grasp the key principles of utilizing generative AI effectively.