TokensGen: Harnessing Condensed Tokens for Long Video Generation

¹S-Lab, Nanyang Technological University, ²SenseTime Research,

³Wangxuan Institute of Computer Technology, Peking University

ICCV 2025

Abstract

Generating consistent long videos is a complex challenge: while diffusion-based generative models generate visually impressive short clips, extending them to longer durations often leads to memory bottlenecks and long-term inconsistency. In this paper, we propose TokensGen, a novel two-stage framework that leverages condensed tokens to address these issues. Our method decomposes long video generation into three core tasks: (1) inner-clip semantic control, (2) long-term consistency control, and (3) inter-clip smooth transition.

First, we train To2V (Token-to-Video), a short video diffusion model guided by text and video tokens, with a Video Tokenizer that condenses short clips into semantically rich tokens. Second, we introduce T2To (Text-to-Token), a video token diffusion transformer that generates all tokens at once, ensuring global consistency across clips. Finally, during inference, an adaptive FIFO-Diffusion strategy seamlessly connects adjacent clips, reducing boundary artifacts and enhancing smooth transitions.

Experimental results demonstrate that our approach significantly enhances long-term temporal and content coherence without incurring prohibitive computational overhead. By leveraging condensed tokens and pre-trained short video models, our method provides a scalable, modular solution for long video generation, opening new possibilities for storytelling, cinematic production, and immersive simulations.

Visual Results

Generation

A journey through a picturesque countryside on horseback. The rider, atop a well-groomed horse, traverses a variety of landscapes, including open fields, dirt paths lined with tall stone walls, and areas dense with lush trees. The setting is serene and bucolic, with the natural beauty of the environment highlighted by the soft, warm lighting of what appears to be either a sunrise or sunset. The atmosphere is tranquil, and the video captures the essence of a peaceful ride through rural scenery.

A serene journey along a dirt road meandering through a lush forest. The road is flanked by a variety of trees and shrubbery, with the sun casting dynamic shadows across the path. As the video progresses, the perspective travels down the road, offering changing views of the forested landscape. The natural setting is vibrant with greenery and the sky is a clear blue with occasional fluffy white clouds, suggesting a peaceful, sunny day.

A person riding a horse on a dirt path through a lush forest environment. The rider appears to be wearing a hat and a coat, suggesting a setting that could be from a historical or adventure context. The horse moves steadily along the path, surrounded by dense greenery, including tall trees and underbrush. As the journey progresses, the scenery opens up to reveal a stunning backdrop of majestic mountains, with the path leading the rider closer to a serene blue river. The lighting changes throughout the video, with the initial scenes bathed in soft daylight and later scenes capturing the golden hues of the sun low on the horizon, creating a warm, atmospheric glow that enhances the natural beauty of the landscape.

A serene and natural landscape during twilight. It begins with a view of a dirt path meandering through a lush grassy area, flanked by trees and wildflowers. As the video progresses, the path leads the viewer towards a tranquil body of water visible in the distance. The lighting is soft and dim, suggesting either dawn or dusk, with the sky painted in muted blues and the first hints of warmer tones near the horizon.

A lone rider guides a horse along a rustic dirt track that weaves through tranquil countryside. Dressed in understated, dark clothing, the rider maintains a calm, measured pace while passing beneath leafy trees and across stretches of emerald grass. The route unfolds in gentle stages—first a canopy‑shaded lane, then an open meadow shimmering with morning (or dusk) light, and at last a quiet expanse of water whose mirrored surface captures the day’s soft glow—together painting a serene pastoral journey.

A serene and expansive beach on a cloudy day, with the overcast sky casting a cool, blue-toned hue over the scene. The shoreline is gently caressed by the lapping waves of the ocean, creating a rhythmic and calming presence. In the distance, a row of houses can be seen, adding a sense of habitation and life to the otherwise tranquil and natural landscape. The beach itself appears to be mostly deserted, emphasizing the quiet and peaceful atmosphere.

A lone swordsman—seemingly a samurai—makes his way down an exquisitely crafted trail inside a video game world. He wears traditional Japanese garb: a midnight‑hued kimono traced with delicate motifs, a conical straw hat, and a katana resting at his hip. With each deliberate step, the landscape transforms: shadowed woodland ablaze in autumn reds and golds slips into a sunlit glade that opens onto a tranquil stretch of coast. His calm, purposeful stride speaks of a quest woven deep into the game’s narrative.

The serene beauty of a river flowing through a rugged landscape at sunset. The river, filled with clear, rushing water, meanders through a terrain scattered with rocks and boulders. The surrounding mountains, bathed in the warm hues of the setting sun, create a majestic backdrop. The sky, painted with streaks of clouds, reflects a spectrum of colors from orange to blue, adding to the tranquil and picturesque setting.

A man's journey down a foggy, tree-lined road. The man is dressed in a period-appropriate outfit, suggesting a setting from the past, with a white shirt, vest, and dark pants, complemented by a hat. His pace is steady and deliberate, as he traverses the uneven, muddy path. The fog adds a layer of mystery and serenity to the scene, with the sunlight struggling to pierce through the dense mist, creating a soft, diffused light that bathes the landscape.

The exhilarating experience of a mountain biker navigating a rugged dirt trail in a dense forest. The trail is uneven and strewn with natural obstacles such as rocks, roots, and occasional puddles, reflecting the challenging conditions of a typical mountain biking path. The surrounding woods are lush with a variety of green foliage, indicating a healthy, vibrant forest ecosystem. As the biker progresses, the trail presents varying degrees of difficulty, including narrow passages flanked by trees and larger rocks that require careful maneuvering.

The video depicts a sequence from a video game where the player is driving a red car along a winding rural highway at dawn. The environment is richly detailed, featuring rolling hills, distant farmhouses, and occasional wooden fences. The setting evokes a nostalgic mid-20th-century countryside. As the video progresses, the player navigates the car through different landscapes, including open fields bathed in golden sunlight and shaded forested sections with dappled light filtering through the trees.

A serene maritime scene at dusk, with several small boats gently swaying on the calm surface of a body of water. The boats, moored by ropes, appear to be traditional, with a simple design and painted in shades of blue, creating a harmonious visual with the water. The backdrop features a rugged coastline, with rocks and landforms that suggest a natural harbor or cove. The lighting is soft and warm, indicative of the golden hour, which adds a tranquil and picturesque quality to the scene.

A serene and picturesque snowy mountain path, flanked by dense evergreen trees and a clear blue sky. The ground is blanketed in a thick layer of pristine snow, which sparkles under the bright sunlight. As the video progresses, the viewer is taken on a tranquil journey along the path, which meanders through the rugged terrain, offering glimpses of the majestic mountain landscape. The natural beauty of the scene is accentuated by the contrast between the white snow, the dark green of the pine trees, and the rocky outcrops that occasionally jut out from the snow-covered slopes.

The serene and atmospheric setting of a river meandering through a dense forest. The forest is lush with a variety of trees, some with thick canopies and others more sparse and skeletal. The ground is covered with grass and foliage, glistening with moisture, suggesting recent rain or morning dew. The sky is overcast, with fog hanging low, diffusing the light and creating a soft, ethereal glow that permeates the scene. As the video progresses, the perspective moves gently along the riverbank, revealing the tranquil flow of water and the rich textures of the environment.

The serene beauty of cherry blossoms in full bloom, gently swaying in the foreground, with a tranquil lake stretching out towards a stately building in the background. The overcast sky casts a soft, diffused light, enhancing the delicate pink hues of the flowers. Throughout the video, the focus remains on the cherry blossoms, while the building and lake provide a constant, picturesque backdrop. The ambiance is peaceful, with the subtle movement of the branches suggesting a light breeze.

A journey along a dirt road leading up to a hill with a panoramic view of a distant city. The progression of the frames shows the road winding through a natural landscape, with the cityscape gradually coming into clearer view as the camera moves forward. The sky is overcast, with a blanket of clouds diffusing the light, and the vegetation on either side of the road suggests a semi-arid region. As the video advances, the presence of human habitation becomes more apparent, with buildings and infrastructure like power lines entering the frame.

1/8

Editing

A child, likely a little girl, wanders down a narrow, earthen corridor carved through a sea of crimson poppies. The child is dressed in a golden princess dress with fluttering sleeves, and her sun‑kissed hair is gathered into a loose braid tied with a golden ribbon. As she steps forward, the towering blossoms sway at her shoulders, their vibrant reds giving way to softer greens and distant hues as the scene drifts toward a hazy, light‑washed horizon. Her pace is unhurried and inquisitive, fingertips grazing petals, suggesting a quiet journey of discovery amid the gentle rustle of summer blooms.

A white off-road vehicle navigates a snow-covered mountain road, surrounded by frosted pines and rugged hills. Snow kicks up behind it, hinting at the winter adventure ahead as it continues deeper into the silent wilderness.

1/1

Comparisons

The serene transition from late afternoon to dusk in a pastoral setting. Initially, the sun casts a warm glow over a grassy field, with trees dotting the landscape and a backdrop of majestic mountains. As time progresses, the light softens, and the colors become more muted, with the sky transitioning through shades of orange, pink, and purple before settling into the cooler tones of twilight. The presence of a rustic settlement with buildings and fences becomes more pronounced as the natural light fades, suggesting a harmonious blend of nature and rural life.

A man traversing through various terrains in a video game environment. He is dressed in a traditional Western outfit, complete with a hat, coat, and boots, suggesting a setting reminiscent of the American frontier era. The man moves with purpose, walking steadily down paths on open grassy fields. The paths finally take him to rocky canyons. Throughout his journey, the landscape changes, but his determined stride remains constant, indicating a sense of adventure or a mission to be accomplished within the game's narrative.

The video features a person exploring a vast, dimly lit cave with a flowing water body. The individual is equipped with a backpack and appears to be on an adventurous journey, navigating through the natural underground environment. The cave's interior is illuminated by natural light filtering in from an opening, revealing intricate rock formations and a diverse range of flora. As the person progresses, the cave transitions into a more open area with signs of advanced technology, contrasting the natural elements with a futuristic structure that stands prominently within the cave.

The serene and majestic Golden Gate Bridge at night, illuminated by a series of orange and white lights that outline its iconic structure. The scene is set against a dusky blue sky, with the dark silhouettes of hills faintly visible in the background. The foreground features the restless, dark ocean with waves crashing onto the shore, adding a dynamic element to the otherwise tranquil nocturnal landscape. The overall atmosphere is one of calmness and grandeur, with the bridge standing as a testament to human engineering amidst the natural beauty of the sea and sky.

1/6

Ablation Study

The video features a video game character traversing a lush, grassy field. The character appears to be a warrior or knight, clad in dark armor, with a sword sheathed at their side and a shield on their back. They have a helmet with horns, giving them a formidable appearance. As the character moves through the environment, they pass by an array of green pine trees and approach an ancient stone fortress with tall towers and crumbling walls, partially enveloped in a light mist. The setting conveys a medieval fantasy world, rich with natural beauty and a sense of adventure.

The video depicts a character navigating through a post-apocalyptic environment within a video game. The character is equipped for survival, carrying weapons and gear on their back, and is dressed in rugged attire suitable for harsh conditions. The journey takes them along a damaged road with signs of destruction and abandonment. The setting is realistic and immersive, with dynamic weather and lighting that contribute to the game's atmosphere.

1/2

BibTeX

@misc{ouyang2025tokensgenharnessingcondensedtokens, title={TokensGen: Harnessing Condensed Tokens for Long Video Generation}, author={Wenqi Ouyang and Zeqi Xiao and Danni Yang and Yifan Zhou and Shuai Yang and Lei Yang and Jianlou Si and Xingang Pan}, year={2025}, eprint={2507.15728}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2507.15728}, }