Artwork

Вміст надано The Nonlinear Fund. Весь вміст подкастів, включаючи епізоди, графіку та описи подкастів, завантажується та надається безпосередньо компанією The Nonlinear Fund або його партнером по платформі подкастів. Якщо ви вважаєте, що хтось використовує ваш захищений авторським правом твір без вашого дозволу, ви можете виконати процедуру, описану тут https://uk.player.fm/legal.
Player FM - додаток Podcast
Переходьте в офлайн за допомогою програми Player FM !

LW - Scaling of AI training runs will slow down after GPT-5 by Maxime Riché

6:25
 
Поширити
 

Manage episode 414870473 series 3337129
Вміст надано The Nonlinear Fund. Весь вміст подкастів, включаючи епізоди, графіку та описи подкастів, завантажується та надається безпосередньо компанією The Nonlinear Fund або його партнером по платформі подкастів. Якщо ви вважаєте, що хтось використовує ваш захищений авторським правом твір без вашого дозволу, ви можете виконати процедуру, описану тут https://uk.player.fm/legal.
Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Scaling of AI training runs will slow down after GPT-5, published by Maxime Riché on April 26, 2024 on LessWrong. My credence: 33% confidence in the claim that the growth in the number of GPUs used for training SOTA AI will slow down significantly directly after GPT-5. It is not higher because of (1) decentralized training is possible, and (2) GPT-5 may be able to increase hardware efficiency significantly, (3) GPT-5 may be smaller than assumed in this post, (4) race dynamics. TLDR: Because of a bottleneck in energy access to data centers and the need to build OOM larger data centers. Update: See Vladimir_Nesov's comment below for why this claim is likely wrong, since decentralized training seems to be solved. The reasoning behind the claim: Current large data centers consume around 100 MW of power, while a single nuclear power plant generates 1GW. The largest seems to consume 150 MW. An A100 GPU uses 250W, and around 1kW with overheard. B200 GPUs, uses ~1kW without overhead. Thus a 1MW data center can support maximum 1k to 2k GPUs. GPT-4 used something like 15k to 25k GPUs to train, thus around 15 to 25MW. Large data centers are around 10-100 MW. This is likely one of the reason why top AI labs are mostly only using ~ GPT-4 level of FLOPS to train new models. GPT-5 will mark the end of the fast scaling of training runs. A 10-fold increase in the number of GPUs above GPT-5 would require a 1 to 2.5 GW data center, which doesn't exist and would take years to build, OR would require decentralized training using several data centers. Thus GPT-5 is expected to mark a significant slowdown in scaling runs. The power consumption required to continue scaling at the current rate is becoming unsustainable, as it would require the equivalent of multiple nuclear power plants. I think this is basically what Sam Altman, Elon Musk and Mark Zuckerberg are saying in public interviews. The main focus to increase capabilities will be one more time on improving software efficiency. In the next few years, investment will also focus on scaling at inference time and decentralized training using several data centers. If GPT-5 doesn't unlock research capabilities, then after GPT-5, scaling capabilities will slow down for some time towards historical rates, with most gains coming from software improvements, a bit from hardware improvement, and significantly less than currently from scaling spending. Scaling GPUs will be slowed down by regulations on lands, energy production, and build time. Training data centers may be located and built in low-regulation countries. E.g., the Middle East for cheap land, fast construction, low regulation, and cheap energy, thus maybe explaining some talks with Middle East investors. Unrelated to the claim: Hopefully, GPT-5 is still insufficient for self-improvement: Research has pretty long horizon tasks that may require several OOM more compute. More accurate world models may be necessary for longer horizon tasks and especially for research (hopefully requiring the use of compute-inefficient real, non-noisy data, e.g., real video). "Hopefully", moving to above human level requires RL. "Hopefully", RL training to finetune agents is still several OOM less efficient than pretraining and/or is currently too noisy to improve the world model (this is different than simply shaping propensities) and doesn't work in the end. Guessing that GPT-5 will be at expert human level on short horizon tasks but not on long horizon tasks nor on doing research (improving SOTA), and we can't scale as fast as currently above that. How big is that effect going to be? Using values from: https://epochai.org/blog/the-longest-training-run, we have estimates that in a year, the effective compute is increased by: Software efficiency: x1.7/year (1 OOM in 3.9 y) Hardware efficiency: x1.3/year ...
  continue reading

1658 епізодів

Artwork
iconПоширити
 
Manage episode 414870473 series 3337129
Вміст надано The Nonlinear Fund. Весь вміст подкастів, включаючи епізоди, графіку та описи подкастів, завантажується та надається безпосередньо компанією The Nonlinear Fund або його партнером по платформі подкастів. Якщо ви вважаєте, що хтось використовує ваш захищений авторським правом твір без вашого дозволу, ви можете виконати процедуру, описану тут https://uk.player.fm/legal.
Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Scaling of AI training runs will slow down after GPT-5, published by Maxime Riché on April 26, 2024 on LessWrong. My credence: 33% confidence in the claim that the growth in the number of GPUs used for training SOTA AI will slow down significantly directly after GPT-5. It is not higher because of (1) decentralized training is possible, and (2) GPT-5 may be able to increase hardware efficiency significantly, (3) GPT-5 may be smaller than assumed in this post, (4) race dynamics. TLDR: Because of a bottleneck in energy access to data centers and the need to build OOM larger data centers. Update: See Vladimir_Nesov's comment below for why this claim is likely wrong, since decentralized training seems to be solved. The reasoning behind the claim: Current large data centers consume around 100 MW of power, while a single nuclear power plant generates 1GW. The largest seems to consume 150 MW. An A100 GPU uses 250W, and around 1kW with overheard. B200 GPUs, uses ~1kW without overhead. Thus a 1MW data center can support maximum 1k to 2k GPUs. GPT-4 used something like 15k to 25k GPUs to train, thus around 15 to 25MW. Large data centers are around 10-100 MW. This is likely one of the reason why top AI labs are mostly only using ~ GPT-4 level of FLOPS to train new models. GPT-5 will mark the end of the fast scaling of training runs. A 10-fold increase in the number of GPUs above GPT-5 would require a 1 to 2.5 GW data center, which doesn't exist and would take years to build, OR would require decentralized training using several data centers. Thus GPT-5 is expected to mark a significant slowdown in scaling runs. The power consumption required to continue scaling at the current rate is becoming unsustainable, as it would require the equivalent of multiple nuclear power plants. I think this is basically what Sam Altman, Elon Musk and Mark Zuckerberg are saying in public interviews. The main focus to increase capabilities will be one more time on improving software efficiency. In the next few years, investment will also focus on scaling at inference time and decentralized training using several data centers. If GPT-5 doesn't unlock research capabilities, then after GPT-5, scaling capabilities will slow down for some time towards historical rates, with most gains coming from software improvements, a bit from hardware improvement, and significantly less than currently from scaling spending. Scaling GPUs will be slowed down by regulations on lands, energy production, and build time. Training data centers may be located and built in low-regulation countries. E.g., the Middle East for cheap land, fast construction, low regulation, and cheap energy, thus maybe explaining some talks with Middle East investors. Unrelated to the claim: Hopefully, GPT-5 is still insufficient for self-improvement: Research has pretty long horizon tasks that may require several OOM more compute. More accurate world models may be necessary for longer horizon tasks and especially for research (hopefully requiring the use of compute-inefficient real, non-noisy data, e.g., real video). "Hopefully", moving to above human level requires RL. "Hopefully", RL training to finetune agents is still several OOM less efficient than pretraining and/or is currently too noisy to improve the world model (this is different than simply shaping propensities) and doesn't work in the end. Guessing that GPT-5 will be at expert human level on short horizon tasks but not on long horizon tasks nor on doing research (improving SOTA), and we can't scale as fast as currently above that. How big is that effect going to be? Using values from: https://epochai.org/blog/the-longest-training-run, we have estimates that in a year, the effective compute is increased by: Software efficiency: x1.7/year (1 OOM in 3.9 y) Hardware efficiency: x1.3/year ...
  continue reading

1658 епізодів

All episodes

×
 
Loading …

Ласкаво просимо до Player FM!

Player FM сканує Інтернет для отримання високоякісних подкастів, щоб ви могли насолоджуватися ними зараз. Це найкращий додаток для подкастів, який працює на Android, iPhone і веб-сторінці. Реєстрація для синхронізації підписок між пристроями.

 

Короткий довідник