使用Google gshard缩放巨型模型

作者：小舞很执着 | 2024-08-11 14:26:09

踩

gshard

Originally published at LinkedIn Pulse.

最初发布于LinkedIn Pulse 。

I recently came across an interesting paper from Google (GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding), which presents their work for scaling giant language translation models (with 600B parameters trained on 2048 TPU v3 cores).

最近，我遇到了Google的一篇有趣的论文( GShard：使用条件计算和自动分片来缩放巨型模型)，介绍了他们用于缩放巨型语言翻译模型(在2048个TPU v3内核上训练了600B参数)的工作。

I liked this paper because it not only describes the system and model innovations for distributed training, but also discusses how to make it easy for the users to develop distributed model training program. I believe this is a critical but often overlooked problem; existing approaches usually require

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/小舞很执着/article/detail/964705