Bidhan Roy
commited on
Commit
·
0b94e29
1
Parent(s):
73872ef
Add README updates and images with Git LFS
Browse files- .gitattributes +1 -0
- README.md +157 -67
- bagel_labs_logo.png +3 -0
- generated_images.png +3 -0
- training_architecture.png +3 -0
.gitattributes
CHANGED
|
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
*.png filter=lfs diff=lfs merge=lfs -text
|
README.md
CHANGED
|
@@ -6,108 +6,198 @@ tags:
|
|
| 6 |
- multi-expert
|
| 7 |
- dit
|
| 8 |
- laion
|
|
|
|
|
|
|
|
|
|
| 9 |
---
|
| 10 |
|
| 11 |
-
|
| 12 |
|
| 13 |
-
|
| 14 |
|
| 15 |
-
|
| 16 |
|
| 17 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
|
| 19 |
-
|
| 20 |
-
- **Router**: dit-based routing network
|
| 21 |
-
- **Hidden Size**: 1152
|
| 22 |
-
- **Layers**: 28
|
| 23 |
-
- **Attention Heads**: 16
|
| 24 |
-
- **Parameters per Expert**: ~0M
|
| 25 |
-
- **Total Parameters**: ~3M
|
| 26 |
-
- **Text Conditioning**: ✓ (CLIP ViT-L/14)
|
| 27 |
-
- **Training Dataset**: LAION-Aesthetic
|
| 28 |
|
| 29 |
-
|
| 30 |
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 47 |
|
| 48 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 49 |
|
| 50 |
```python
|
| 51 |
-
from
|
|
|
|
| 52 |
|
| 53 |
# Load the pipeline
|
| 54 |
-
pipeline =
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 55 |
|
| 56 |
# Generate images
|
| 57 |
images = pipeline(
|
| 58 |
prompt="A beautiful sunset over Paris, oil painting style",
|
| 59 |
num_inference_steps=50,
|
| 60 |
guidance_scale=7.5,
|
| 61 |
-
|
| 62 |
-
|
|
|
|
| 63 |
|
| 64 |
-
|
| 65 |
-
for i, img in enumerate(images):
|
| 66 |
-
img.save(f"output_{i}.png")
|
| 67 |
```
|
| 68 |
|
| 69 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
|
| 71 |
-
|
| 72 |
-
- **Batch Size**: 16 per expert
|
| 73 |
-
- **Learning Rate**: 2e-05
|
| 74 |
-
- **Image Size**: 256x256 (32x32 latent space)
|
| 75 |
-
- **VAE**: SD VAE (8x downsampling)
|
| 76 |
-
- **Text Encoder**: CLIP ViT-L/14
|
| 77 |
-
- **EMA**: True
|
| 78 |
-
- **Mixed Precision**: True
|
| 79 |
|
| 80 |
-
|
| 81 |
|
| 82 |
-
|
| 83 |
-
- The router network analyzes the noisy latent and timestep
|
| 84 |
-
- Selects the most appropriate expert for denoising
|
| 85 |
-
- Enables better quality and diversity compared to single models
|
| 86 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 87 |
|
|
|
|
|
|
|
|
|
|
| 88 |
|
| 89 |
-
|
| 90 |
|
| 91 |
-
|
| 92 |
|
| 93 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 94 |
|
| 95 |
-
|
| 96 |
-
- Best results at 256x256 resolution
|
| 97 |
-
- Requires GPU for inference (8GB+ VRAM recommended)
|
| 98 |
|
| 99 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 100 |
|
| 101 |
```bibtex
|
| 102 |
-
@misc{
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
year
|
| 106 |
-
publisher
|
| 107 |
-
url
|
| 108 |
}
|
| 109 |
```
|
| 110 |
|
| 111 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 112 |
|
| 113 |
-
|
|
|
|
| 6 |
- multi-expert
|
| 7 |
- dit
|
| 8 |
- laion
|
| 9 |
+
- distributed
|
| 10 |
+
- decentralized
|
| 11 |
+
- flow-matching
|
| 12 |
---
|
| 13 |
|
| 14 |
+
<div align="center">
|
| 15 |
|
| 16 |
+
<img src="bagel_labs_logo.png" alt="Bagel Labs" width="120"/>
|
| 17 |
|
| 18 |
+
# Paris: A Decentralized Trained Open-Weight Diffusion Model
|
| 19 |
|
| 20 |
+
<a href="https://huggingface.co/bageldotcom/paris">
|
| 21 |
+
<img src="https://img.shields.io/badge/%F0%9F%A4%97%20Like%20this-model-yellow?style=for-the-badge" alt="Like on Hugging Face">
|
| 22 |
+
</a>
|
| 23 |
+
<a href="https://github.com/bageldotcom/paris">
|
| 24 |
+
<img src="https://img.shields.io/github/stars/bageldotcom/paris?style=for-the-badge&logo=github&label=Star%20on%20GitHub" alt="Star on GitHub">
|
| 25 |
+
</a>
|
| 26 |
+
<a href="https://github.com/bageldotcom/Paris/blob/main/paper.pdf">
|
| 27 |
+
<img src="https://img.shields.io/badge/📄%20Read-Technical%20Report-red?style=for-the-badge" alt="Read Technical Report">
|
| 28 |
+
</a>
|
| 29 |
|
| 30 |
+
</div>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
|
| 32 |
+
<br>
|
| 33 |
|
| 34 |
+
The world's first diffusion model trained entirely through decentralized computation. The model consists of 8 expert diffusion models (129M-605M parameters each) trained in complete isolation with no gradient, parameter, or intermediate activation synchronization, achieving superior parallelism efficiency over traditional methods while using 14× less data and 16× less compute than baselines. [Read our technical report](https://github.com/bageldotcom/Paris/blob/main/paper.pdf) to learn more.
|
| 35 |
+
|
| 36 |
+
# Key Characteristics
|
| 37 |
+
|
| 38 |
+
- 8 independently trained expert diffusion models (605M parameters each, 4.84B total)
|
| 39 |
+
- No gradient synchronization, parameter sharing, or activation exchange among nodes during training
|
| 40 |
+
- Lightweight transformer router (~158M parameters) for dynamic expert selection
|
| 41 |
+
- 11M LAION-Aesthetic images across 120 A40 GPU-days
|
| 42 |
+
- 14× less training data than prior decentralized baselines
|
| 43 |
+
- 16× less compute than prior decentralized baselines
|
| 44 |
+
- Competitive generation quality (FID 12.45)
|
| 45 |
+
- Open weights for research and commercial use under MIT license
|
| 46 |
+
|
| 47 |
+
---
|
| 48 |
+
|
| 49 |
+
# Examples
|
| 50 |
+
|
| 51 |
+

|
| 52 |
+
|
| 53 |
+
*Text-conditioned image generation samples using Paris across diverse prompts and visual styles*
|
| 54 |
+
|
| 55 |
+
---
|
| 56 |
+
|
| 57 |
+
# Architecture Details
|
| 58 |
+
|
| 59 |
+
| Component | Specification |
|
| 60 |
+
|-----------|--------------|
|
| 61 |
+
| **Model Scale** | DiT-XL/2 |
|
| 62 |
+
| **Parameters per Expert** | 605M |
|
| 63 |
+
| **Total Expert Parameters** | 4.84B (8 experts) |
|
| 64 |
+
| **Router Parameters** | ~158M |
|
| 65 |
+
| **Hidden Dimensions** | 1152 |
|
| 66 |
+
| **Transformer Layers** | 28 |
|
| 67 |
+
| **Attention Heads** | 16 |
|
| 68 |
+
| **Patch Size** | 2×2 (latent space) |
|
| 69 |
+
| **Latent Resolution** | 32×32×4 |
|
| 70 |
+
| **Image Resolution** | 256×256 |
|
| 71 |
+
| **Text Conditioning** | CLIP ViT-L/14 |
|
| 72 |
+
| **VAE** | sd-vae-ft-mse (8× downsampling) |
|
| 73 |
+
|
| 74 |
+
---
|
| 75 |
+
|
| 76 |
+
# Training Approach
|
| 77 |
+
|
| 78 |
+
Paris implements fully decentralized training where:
|
| 79 |
+
|
| 80 |
+
- Each expert trains independently on a semantically coherent data partition (DINOv2-based clustering)
|
| 81 |
+
- No gradient synchronization, parameter sharing, or activation exchange between experts during training
|
| 82 |
+
- Experts trained asynchronously across AWS, GCP, local clusters, and Runpod instances at different speeds
|
| 83 |
+
- Router trained post-hoc on full dataset for expert selection during inference
|
| 84 |
+
- Complete computational independence eliminates requirements for specialized interconnects (InfiniBand, NVLink)
|
| 85 |
+
|
| 86 |
+

|
| 87 |
+
|
| 88 |
+
*Paris training phase showing complete asynchronous isolation across heterogeneous compute clusters. Unlike traditional parallelization strategies (Data/Pipeline/Model Parallelism), Paris requires zero communication during training.*
|
| 89 |
+
|
| 90 |
+
This zero-communication approach enables training on fragmented compute resources without specialized interconnects, eliminating the dedicated GPU cluster requirement of traditional diffusion model training.
|
| 91 |
|
| 92 |
+
**Comparison with Traditional Parallelization**
|
| 93 |
+
|
| 94 |
+
| **Strategy** | **Synchronization** | **Straggler Impact** | **Topology Requirements** |
|
| 95 |
+
|--------------|---------------------|---------------------|---------------------------|
|
| 96 |
+
| Data Parallel | Periodic all-reduce | Slowest worker blocks iteration | Latency-sensitive cluster |
|
| 97 |
+
| Model Parallel | Sequential layer transfers | Slowest layer blocks pipeline | Linear pipeline |
|
| 98 |
+
| Pipeline Parallel | Stage-to-stage per microbatch | Bubble overhead from slowest stage | Linear pipeline |
|
| 99 |
+
| **Paris** | **No synchronization** | **No blocking** | **Arbitrary** |
|
| 100 |
+
|
| 101 |
+
---
|
| 102 |
+
|
| 103 |
+
# Usage
|
| 104 |
|
| 105 |
```python
|
| 106 |
+
from diffusers import DiffusionPipeline
|
| 107 |
+
import torch
|
| 108 |
|
| 109 |
# Load the pipeline
|
| 110 |
+
pipeline = DiffusionPipeline.from_pretrained(
|
| 111 |
+
"bageldotcom/paris",
|
| 112 |
+
torch_dtype=torch.float16,
|
| 113 |
+
use_safetensors=True
|
| 114 |
+
)
|
| 115 |
+
pipeline.to("cuda")
|
| 116 |
|
| 117 |
# Generate images
|
| 118 |
images = pipeline(
|
| 119 |
prompt="A beautiful sunset over Paris, oil painting style",
|
| 120 |
num_inference_steps=50,
|
| 121 |
guidance_scale=7.5,
|
| 122 |
+
height=256,
|
| 123 |
+
width=256
|
| 124 |
+
).images
|
| 125 |
|
| 126 |
+
images[0].save("output.png")
|
|
|
|
|
|
|
| 127 |
```
|
| 128 |
|
| 129 |
+
### Routing Strategies
|
| 130 |
+
|
| 131 |
+
- **`top-1`** (default): Single best expert per step. Fastest inference, competitive quality.
|
| 132 |
+
- **`top-2`**: Weighted ensemble of top-2 experts. Often best quality, 2× inference cost.
|
| 133 |
+
- **`full-ensemble`**: All 8 experts weighted by router. Highest compute (8× cost).
|
| 134 |
|
| 135 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 136 |
|
| 137 |
+
# Performance Metrics
|
| 138 |
|
| 139 |
+
**Multi-Expert vs. Monolithic on LAION-Art (DiT-B/2)**
|
|
|
|
|
|
|
|
|
|
| 140 |
|
| 141 |
+
| **Inference Strategy** | **FID-50K ↓** |
|
| 142 |
+
|------------------------|---------------|
|
| 143 |
+
| Monolithic (single model) | 29.64 |
|
| 144 |
+
| Paris Top-1 | 30.60 |
|
| 145 |
+
| **Paris Top-2** | **22.60** |
|
| 146 |
+
| Paris Full Ensemble | 47.89 |
|
| 147 |
|
| 148 |
+
*Top-2 routing achieves 7.04 FID improvement over monolithic baseline, validating that targeted expert collaboration outperforms both single models and naive ensemble averaging.*
|
| 149 |
+
|
| 150 |
+
---
|
| 151 |
|
| 152 |
+
# Training Details
|
| 153 |
|
| 154 |
+
**Hyperparameters (DiT-XL/2)**
|
| 155 |
|
| 156 |
+
| **Parameter** | **Value** |
|
| 157 |
+
|---------------|-----------|
|
| 158 |
+
| Dataset | LAION-Aesthetic (11M images) |
|
| 159 |
+
| Clustering | DINOv2 semantic features |
|
| 160 |
+
| Batch Size | 16 per expert (effective 32 with 2-step accumulation) |
|
| 161 |
+
| Learning Rate | 2e-5 (AdamW, no scheduling) |
|
| 162 |
+
| Training Steps | ~120k total across experts (asynchronous) |
|
| 163 |
+
| EMA Decay | 0.9999 |
|
| 164 |
+
| Mixed Precision | FP16 with automatic loss scaling |
|
| 165 |
+
| Initialization | ImageNet-pretrained DiT-XL/2 |
|
| 166 |
+
| Conditioning | AdaLN-Single (23% parameter reduction) |
|
| 167 |
|
| 168 |
+
**Router Training**
|
|
|
|
|
|
|
| 169 |
|
| 170 |
+
| **Parameter** | **Value** |
|
| 171 |
+
|---------------|-----------|
|
| 172 |
+
| Architecture | DiT-B (smaller than experts) |
|
| 173 |
+
| Batch Size | 64 with 4-step accumulation (effective 256) |
|
| 174 |
+
| Learning Rate | 5e-5 with cosine annealing (25 epochs) |
|
| 175 |
+
| Loss | Cross-entropy on cluster assignments |
|
| 176 |
+
| Training | Post-hoc on full dataset |
|
| 177 |
+
|
| 178 |
+
|
| 179 |
+
---
|
| 180 |
+
|
| 181 |
+
# Citation
|
| 182 |
|
| 183 |
```bibtex
|
| 184 |
+
@misc{paris2025,
|
| 185 |
+
title={Paris: A Decentralized Trained Open-Weight Diffusion Model},
|
| 186 |
+
author={Jiang, Zhiying and Seraj, Raihan and Villagra, Marcos and Roy, Bidhan},
|
| 187 |
+
year={2025},
|
| 188 |
+
publisher={Bagel Labs},
|
| 189 |
+
url={https://huggingface.co/bageldotcom/paris}
|
| 190 |
}
|
| 191 |
```
|
| 192 |
|
| 193 |
+
---
|
| 194 |
+
|
| 195 |
+
# License
|
| 196 |
+
|
| 197 |
+
MIT License – Open for research and commercial use.
|
| 198 |
+
|
| 199 |
+
<div align="center">
|
| 200 |
+
|
| 201 |
+
Made with ❤️ by [Bagel Labs](https://bagel.com)
|
| 202 |
|
| 203 |
+
</div>
|
bagel_labs_logo.png
ADDED
|
Git LFS Details
|
generated_images.png
ADDED
|
Git LFS Details
|
training_architecture.png
ADDED
|
Git LFS Details
|