Enhance Text Animation with Semantic Soul: HKUST’s Open Source ‘Text Jumping’ Technology Adds a Touch of Romance to Every Word

[Introduction to New Wisdom]A team from the Hong Kong University of Science and Technology and Tel Aviv University has open sourced the “Dynamic Typography” technology based on a large video model. Just select a letter and give a simple text description to generate an SVG animation to make the letter “vividly pop.” on paper”.

The “M” in ROMANTIC turns into a couple holding hands and walking back and forth.


The “h” in Father is interpreted as a father patiently holding his child for a walk.

The “N” in PASSION can be transformed into a couple kissing together.


The “S” in SWAN actually turns into a swan stretching her neck gracefully.

The “P” in TELESCOPE has become a real telescope! Slowly turn towards the camera.

This is the latest work brought to us by the research team from Hong Kong University of Science and Technology and Tel Aviv University: Dynamic Typography.

Paper link: https://arxiv.org/abs/2404.11614

Project home page: https://animate-your-word.github.io/demo/

Make the text move

Text animation is an expressive medium that transforms static communication into dynamic experiences to evoke emotions, emphasize the meaning of text, and build engaging narratives. It is widely used in memes, videos, and advertising production. However, producing such semantic animations requires expertise in graphic design and animation.

Therefore, researchers have proposed a new automated text animation solution “Text Jump”, which achieves the perfect integration of text and animation.

This solution can be broken down into two steps:

1. Based on the user's description, the letters will be transformed to convey the meaning of the text.

2. The deformed letters will be given a vivid dynamic effect described by the user, thereby achieving text animation.

Maintaining the readability of text while maintaining its silky movement is extremely challenging. Current Vincent video models are unable to guarantee the generation of readable text, let alone “deform” the text based on its semantic information to better convey motion information. Retraining such a model requires a large and difficult-to-obtain dataset of stylized text videos.

The researchers used Score Distillation Sampling (SDS) technology to predict the displacement of the control points in the vector image of the text in each frame by distilling the prior knowledge in the large-parameter Vincent video basic model, and through additional readability Constraint and structure-preserving technology maintain readability and appearance during text movement.

The researchers demonstrate the generalizability of their proposed framework on various Vincent video models and highlight the superiority of the method over baseline methods. Experimental results show that their technology can successfully generate text animations that are consistent and coherent with user descriptions, while maintaining the readability of the original text.


1. Data representation

In this work, the outline of the letters is characterized as several connected cubic Bezier curves, whose shape is determined by the Bezier curve control points. The method proposed by the authors predicts the displacement of each control point for each frame. These displacements “deform” the letters to convey semantic information, adding movement through different displacements in each frame.

The outlines of the letters are extracted as connected cubic Bezier curves

2. Model framework

Given a letter represented by a Bezier curve, the researchers first used a coordinate-based MLP (called Base Field) to deform the base shape of the letter that can represent its semantic information, as shown in the figure of “CAMEL” “M” is transformed into a camel.

The Base shape is then copied to each frame, and another coordinate-based MLP (called Displacement Field) predicts the displacement of each control point in each frame, thereby adding motion to the base shape.

Each frame is then rendered into a pixel image through a differentiable renderer and stitched into the output video. The basic field and displacement field complete end-to-end joint optimization through the prior knowledge of Vincent’s video and other constraints.

3. Optimization

Current diffusion-based Vincentian graph models such as Stable Diffusion are trained on large-scale two-dimensional pixel images and contain rich prior knowledge. Score Distillation Sampling (SDS) aims to distill the prior knowledge in the diffusion model and use it to train other models to generate content of other modalities, such as training the parameters of MLP in NeRF to generate 3D models.

In this work, researchers distilled a diffusion-based Vincent video model through SDS and trained parameters in the base field and displacement field based on the obtained prior knowledge.

In addition, in order to ensure that each frame of the generated video still maintains the readability of the letters themselves, (for example, the letter “M” in the word “CAMEL” needs to maintain the shape of the word M while appearing similar to a camel, so that users can recognize it. (excluding the letter M), this work constrains the perceptual similarity between the base shape and the original letter by adding constraints based on Learned Perceptual Image Patch Similarity (LPIPS).

In order to alleviate the observed problem of severe flickering caused by frequent intersections of Bezier curves, this work added a triangulation-based structure maintenance constraint to maintain a stable skeleton structure during deformation and movement.

Frequent intersections of Bezier curves cause severe flickering

Structure-preservation loss based on triangulation


In terms of experiments, the researchers evaluated the legibility of the text and the consistency between the text description provided by the user and the video.

This work compares with two different approaches: one is a Vincentian video model for pixel images, and the other is a general animation scheme for vector images.

In the Vincent video model for pixel images, this work was compared with the current leading Vincent video model Gen-2 and the Tucson video model DynamiCrafter.

It can be seen from the qualitative and quantitative comparison results that most other methods have difficulty maintaining the readability of letters when generating videos, or it is difficult to generate semantic movements. The method proposed in this paper effectively maintains the readability of the letters during the movement while generating movements that conform to the text description given by the user.

Qualitative comparison with other methods

Quantitative comparison with other methods

In order to further prove the role of each module in this work, the researchers conducted sufficient ablation experiments. Experimental results show that base shape design and triangulation-based structure preservation technology effectively improve video quality, while readability constraints based on sensory similarity effectively maintain the readability of letters during movement.

Qualitative results of ablation experiments

Ablation experiment quantitative results

The researchers further explained the versatility of their proposed framework on various Wensheng video models, which means that the framework can be compatible with the further development of future video generation models and make the generation more attractive as the effectiveness of video generation models improves. text animation.

Comparison of results of distilling different video generation models