Originally published at LinkedIn Pulse.
最初发布于LinkedIn Pulse 。
I recently came across an interesting paper from Google (GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding), which presents their work for scaling giant language translation models (with 600B parameters trained on 2048 TPU v3 cores).
最近,我遇到了Google的一篇有趣的论文( GShard:使用条件计算和自动分片来缩放巨型模型),介绍了他们用于缩放巨型语言翻译模型(在2048个TPU v3内核上训练了600B参数)的工作。
I liked this paper because it not only describes the system and model innovations for distributed training, but also discusses how to make it easy for the users to develop distributed model training program. I believe this is a critical but often overlooked problem; existing approaches usually require