sora 还没开始公测，北大就要推出open sora 开源版本了-武穆逸仙 In July 2025

点击上方”壹家大数据”，关注后发现更多精彩内容

数据来源于：https://github.com/trending，后台回复 230101，获取 github 爬虫

免费提供 github 热搜历史数据，后台回复邮箱即可

名称: /PKU-YuanGroup/Open-Sora-Plan

地址: https://github.com/PKU-YuanGroup/Open-Sora-Plan

fork: 189 star: 2,869 开发语言: Python

简介: This project aim to reproducing Sora (Open AI T2V model), but we only have limited resource. We deeply wish the all open source community can contribute to this project.

2024 年 2 月, OpenAI 推出了Sora，这是目前最强的文本生成视频 AI 模型，代表了视频生成技术的一个重大飞跃。Sora 能够根据简短的文本描述生成成详细的、高清的视频，这些视频时长最长可达一分钟。Sora 的推出，预示着 AI 的发展又来到了一个新的时代。

虽然 Sora 还没开始进行公测，大部分人还无法使用 Sora。不过担心，北大-兔展联合实验室推出了Open Sora 项目，旨在复现OpenAI 的视频生成模型，目前国内外很多的网友都表示非常的期待，

open sora 框架

接下来介绍 open sora 的框架, 它由以下组成部分组成：

Video VQ-VAE.
Denoising Diffusion Transformer.
Condition Encoder.

sora 还没开始公测，北大就要推出 open sora 开源版本了
框架图

open sora 实现细节

可变长宽比

open sora 参考 FIT 实施了一种动态掩码策略, 以并行批量训练的同时保持灵活的长宽比。具体来说, open sora将高分辨率视频在保持长宽比的同时下采样至最长边为 256 像素, 然后在右侧和底部用零填充至一致的 256×256 分辨率。这样便于 videovae 以批量编码视频, 以及便于扩散模型使用注意力掩码对批量潜变量进行去噪。

sora 还没开始公测，北大就要推出 open sora 开源版本了

动态训练策略

可变分辨率

在推理过程中, 尽管open sora在固定的 256×256 分辨率上进行训练, 但open sora使用位置插值可以实现可变分辨率采样。open sora将可变分辨率噪声潜变量的位置索引从[0, seq_length-1]下调到[0, 255]，以使其与预训练范围对齐。这种调整使得基于注意力的扩散模型能够处理更高分辨率的序列。

可变时长

我们使用 VideoGPT 中的 Video VQ-VAE, 将视频压缩至潜在空间, 并且支持变时长生成。同时, 我们扩展空间位置插值至时空维度, 实现对变时长视频的处理。

10s 视频重建(256x)

18s 视频重建(196x

TODO LIST

Setup the codebase and train a unconditional model on landscape dataset

Setup repo-structure.
Add Video-VQGAN model, which is borrowed from VideoGPT.
Support variable aspect ratios, resolutions, durations training on DiT.
Support Dynamic mask input inspired FiT.
Add class-conditioning on embeddings.
Incorporating Latte as main codebase.
Add VAE model, which is borrowed from Stable Diffusion.
Joint dynamic mask input with VAE.
Make the codebase ready for the cluster training. Add SLURM scripts.
Add sampling script.
Incorporating SiT.

Train models that boost resolution and duration

Add PI to support out-of-domain size.
Add frame interpolation model.

Conduct text2video experiments on landscape dataset.

Finish data loading, pre-processing utils.
Add CLIP and T5 support.
Add text2image training script.
Add prompt captioner.

Train the 1080p model on video2text dataset

Looking for a suitable dataset, welcome to discuss and recommend.
Finish data loading, pre-processing utils.
Support memory friendly training.

Add flash-attention2 from pytorch.
Add xformers.
Add accelerate to automatically manage training, e.g. mixed precision training.
Add gradient checkpoint.
Train using the deepspeed engine.

Control model with more condition

Load pretrained weight from PixArt-α.
Incorporating ControlNet.

仓库结构

├── README.md├── docs│   ├── Data.md                    -> Datasets description.│   ├── Contribution_Guidelines.md -> Contribution guidelines description.├── scripts                        -> All training scripts.│   └── train.sh├── sora│   ├── dataset                    -> Dataset code to read videos│   ├── models │   │   ├── captioner               │   │   ├── super_resolution        │   ├── modules│   │   ├── ae                     -> compress videos to latents│   │   │   ├── vqvae│   │   │   ├── vae│   │   ├── diffusion              -> denoise latents│   │   │   ├── dit│   │   │   ├── unet|   ├── utils.py                   │   ├── train.py                   -> Training code