DeepSeek V4: el modelo de código abierto más grande del mundo, a un precio que rompe el mercado
DeepSeek V4: el modelo de código abierto más grande del mundo, a un precio que rompe el mercado
Este artículo sintetiza información publicada originalmente por github.com. Para el contexto completo, las declaraciones originales y los detalles que no hemos incluido, consulta la fuente indicada.
Resumen: DeepSeek V4: el modelo de código abierto más grande del mundo, a un precio que rompe el mercado
El contexto
We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token.
Qué ha pasado
To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2.
Detalles
Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance.
We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities.
Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models.
Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training.
Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks.
On top of the efficient architecture of DeepSeek-V2, we pioneer an auxiliary-loss-free strategy for load balancing, which minimizes the performance degradation that arises from encouraging load balancing.
We investigate a Multi-Token Prediction (MTP) objective and prove it beneficial to model performance.
We design an FP8 mixed precision training framework and, for the first time, validate the feasibility and effectiveness of FP8 training on an extremely large-scale model.
Fuente original
Lee el artículo completo en github.com.
Si trabajas en este ámbito y quieres compartir tu perspectiva, escríbeme a [email protected].