On January 10, 2025, DeepSeek launched an LLM, R1, that the startup claimed is on par with Open AI鈥檚 ChatGPT o1 . The app topped the App Store charts in record time, capturing the imagination of not just the tech industry but the wider world too. One claim was particularly startling: that the model had been trained for under $6m (as opposed to the $100m spent on GPT-4 by OpenAI). This created waves in the stock market and the news media.听
听
While the media and financial storm is hard to avoid, what鈥檚 more interesting to us at 魅影直播 is to try to understand what DeepSeek did, exactly, and how. We will start with key facts and first impressions and then explore what we know so far from their papers about their model architecture, training data, their approach to evals and the techniques used. Finally, we鈥檒l look at attempts to reproduce their results and what comes next.
听
DeepSeek鈥檚 new models: What are the facts?
听
Who built it?
听
Founded in May 2023, DeepSeek is a Chinese AI startup based in Hangzhou and Beijing. It鈥檚 backed by the Chinese Hedgefund High-Flyer located in Hangzhou. Both High-Flyer and DeepSeek were founded by .
听
On Jan 10, 2025, DeepSeek released their mobile app; on January 20, 2025 the company released and .听
听
What did DeepSeek actually build?
听
DeepSeek built two types of models and apps to use them. The latest versions of the two types of models are V3 and R1. V3, as the name suggests, is version 3 of a general purpose language model and R1 is a reasoning model on top of V3-Base. They also provide distilled versions of their models so they can fit on your laptop.听
听
There are two flavors of V3. One is built on Llama (Meta鈥檚 open-weight model) and one on Qwen (Alibaba鈥檚 open-weight model).
听
While they have released weights for the R1 model and code to run the model in inference, they have not released any training code or the code for all the optimizations they did closer to the hardware.
Our first impressions of using DeepSeek
听
A few of us at 魅影直播 used DeepSeek via the company鈥檚 website, while others used ollama to get an R1 distilled model running on their laptop. We then spent some time using the model as you would other models 鈥 for tasks ranging from coding to reasoning questions.
听
Based on our experience in recent days, here are some of our initial impressions and thoughts:
听
Multilingual performance showed good results in English and Mandarin but was less smooth in French, with unintended Chinese or Arabic characters appearing and occasional reversion to English during complex reasoning.
Its reasoning style can be overly verbose 鈥 sometimes it went in circles when we were using it.
We鈥檇 like to learn more about听 how DeepSeek is thinking about security and privacy aspects 鈥 particularly from a user perspective.听
Model instances are available in various sizes and can be installed on various consumer-grade hardware, including energy saving models.
The hosted version appears to have guardrails aligned with the Chinese government鈥檚 worldview. The model itself may reflect perspectives consistent with that alignment.听
We don鈥檛 have visibility into the data that was used to train it (although it鈥檚 important to note the same is true of Llama, OpenAI, Claude) This makes some governments and enterprises nervous.
How can you use DeepSeek?
听
You can try out the model on the or mobile app. Alternatively, you can use听ollama run deepseek-r1:32b听to run the distilled version of the model locally. To do that, you鈥檒l need to .
听
Cloud Service Providers have also jumped on this. You can deploy DeepSeek models on , and .听You can also deploy it as .
听
DeepSeek鈥檚 models are certainly intriguing enough to consider adding to your AI platform鈥檚 toolbox along with other open-weight models, as application builders will want to experiment with or use different models for different purposes.
听
Can we trust DeepSeek鈥檚 reported performance results?
听
DeepSeek鈥檚 results have not yet been reproduced. We are closely following Huggingface鈥檚 attempt to reproduce it at .听
听
We would also like to understand whether the model was exposed to the benchmark data during training, and whether the evals methods used in the paper are appropriate.
听
That said, we do not have any specific reason to believe that the results are not real.
听
One thing that has made waves is the 2.788M GPU Hours (est. $5.576m) for training (see the first table in . The V3 paper makes the assumptions backing into this price point clear, but also offers caveats, saying it鈥檚 only representative of the last training run. Given how quickly the industry has leapt at covering this particular family of models, we suspect this number has been presented out-of-context many times in that coverage.
DeepSeek's technical components
听
R1 was trained using a combination of SFT and RL on V3-Base. They鈥檙e transformers that have been highly optimized for specific hardware/software frameworks based on the limitations imposed by the environment (specifically the ). DeepSeek has also used a combination of new and old techniques in some interesting ways. Let鈥檚 start by looking at .
听
V3-Base uses a strong mixture-of-experts approach. This is similar to Mixtral but . V3-Base is trained with 671B total parameters while Llama has a 405B version. Both V3-Base and Llama 3.1 405B use FP8 quantization. V3-Base was trained on 14.8T tokens while Llama was trained on 15T tokens. They both support a 128K context window.
听
The key difference is that the V3 paper mentions they only used 2.788M GPU hours, while the says they used 39.3M cumulative GPU hours. Here鈥檚 where the nuance lies: from what we understand, the 2.788M GPU Hours used to train V3 were only for the last full training run, while the number Llama reports is a cumulative number. The details of exactly how to parse the words will come out eventually, but for now we are still unclear whether there is a like-for-like comparison to be had here. As an example, V3 was trained on some data that was generated by a then-unreleased R1; should the training costs calculated for V3 be inclusive of the training costs for R1 in that case?
听
R1 was built on V3-Base using supervised fine-tuning (SFT) as well as reinforcement learning (RL) to build reasoning into the model. R1 uses the long chain-of-thought pattern to perform reasoning. R1 was then further distilled into smaller dense models. Like V3-Base, they have released both Llama and Qwen based versions. They have also released R1-Zero which does not use SFT and has some limitations such as readability and language mixing, though it shows some intriguing reasoning behaviors. These limitations mean R1-Zero is probably more interesting for researchers than users. To overcome these, they applied multi-stage training as well as cold-start data before RL.
听
V3 was then built by using data created by R1鈥檚 reasoning, verification and reflection patterns to further improve on V3-Base to create a more well-rounded model, V3.
听
All these models were trained using NVIDIA H800 GPUs. These are versions of the H100 GPUs made for the Chinese market and are, as mentioned earlier, limited in ways to .听 Specifically, H800 chips possess half the chip-to-chip interconnect speed as H100s (around 400GB/s vs 900GB/s on NVLink).
听
The cost for training R1 is . We know they are wrong, but it鈥檚 unclear how wrong they are. The calculation is from the V3 technical report, which is the cost of training DeepSeek V3. CNN by saying the cost was for the base model 鈥 however, they don鈥檛 help people understand the difference between the two.
听
R1 was trained on top of V3-Base, so the cumulative cost of training R1 would definitely be more than training the base model. The numbers in table one of the V3 Technical Report seem to be for one full run, likely the final full training run. If you were to try to replicate the training process, you鈥檇 probably need to do more than one full training run.
听
There are also conflicting reports that , which is more in-line with what OpenAI is supposed to have used to .听
If you were to rent 50,000 A100 GPUs in the US today, you鈥檇 probably pay about $1.35/GPUhr (if you found that many available). That鈥檚 about $11.34 m/week. In DeepSeek鈥檚 case, it appears they could have used GPUs that their hedge fund backer High-Flyer had for high-frequency trading purposes.
Diving deeper into what makes DeepSeek distinctive
听
There are a number of sophisticated ways in which DeepSeek modified the model architecture, training techniques and data to get the most out of the limited hardware available to them. Let鈥檚 now look at these from the bottom up.
听
Optimizing for available hardware
听
There are two key limitations of the H800s DeepSeek had to use compared to H100s. First, they have half the GPU-to-GPU interconnect bandwidth of H100s, and, second, much smaller memory: 80GB vs 188GB.
听
Interestingly, DeepSeek appears to have turned these limitations into an advantage. 鈥淸T]he economical training costs of DeepSeek-V3鈥 [was] achieved through our optimized co-design of algorithms, frameworks, and hardware,鈥 DeepSeek鈥檚 team wrote. In other words, they made decisions that would allow them to extract the most out of what they had available.
听
For example, they used to significantly reduce the amount of memory required. The V3 paper says 鈥渓ow-precision training has emerged as a promising solution for efficient training鈥. However, prior to this work, FP8 was seen as efficient but less effective; DeepSeek demonstrated how it can be used effectively. 鈥淚n this work, we introduce an FP8 mixed precision training framework and, for the first time, validate its effectiveness on an extremely large-scale model. Through the support for FP8 computation and storage, we achieve both accelerated training and reduced GPU memory usage.鈥
听
They鈥檝e further optimized for the constrained hardware at a very low level. The V3 paper also states 鈥渨e also develop efficient cross-node all-to-all communication kernels to fully utilize InfiniBand (IB) and NVLink bandwidths. Furthermore, we meticulously optimize the memory footprint, making it possible to train DeepSeek-V3 without using costly tensor parallelism. Combining these efforts, we achieve high training efficiency.鈥 This is some seriously deep work to get the most out of the hardware they were limited to.
听
Further, the paper talks about something we find particularly interesting. 鈥淎s for the training framework, we design the DualPipe algorithm for efficient pipeline parallelism, which has fewer pipeline bubbles and hides most of the communication during training through computation-communication overlap. This overlap ensures that, as the model further scales up, as long as we maintain a constant computation-to-communication ratio, we can still employ fine-grained experts across nodes while achieving a near-zero all-to-all communication overhead.鈥 The constant computation-to-communication ratio and near-zero all-to-all communication overhead is striking relative to 鈥渘ormal鈥 ways to scale distributed training which typically just means 鈥渁dd more hardware to the pile鈥.
听
This is a clear case of necessity being the mother of invention.
听
The impact of reinforcement learning in post-training on benchmark performance
听
DeepSeek applied reinforcement learning with GRPO (group relative policy optimization) in V2 and V3. But, apparently, reinforcement learning had a big impact on the reasoning model, R1 鈥 its impact on benchmark performance is notable.
听
By using GRPO to apply the reward to the model, DeepSeek avoids using a large 鈥渃ritic鈥 model; this again saves memory. However, GRPO takes a rules-based rules approach which, while it will work better for problems that have an objective answer 鈥 such as coding and math 鈥 it might struggle in domains where answers are subjective or variable. It will be interesting to track the trade-offs as more people use it in different contexts.
听
Multi-head latent attention (MLA)
听
Multi-head Latent Attention is a variation on multi-head attention that was introduced by DeepSeek in their V2 paper. According to , while previous multi-head attention techniques were considered a tradeoff, insofar as you reduce model quality to get better scale in large model training, DeepSeek says that MLA not only allows scale, it also improves the model. We鈥檙e looking forward to digging deeper into this.
听
Distillation vs reinforcement learning
听
The R1 paper has an interesting discussion about distillation vs reinforcement learning. The DeepSeek team writes that their work makes it possible to: 鈥渄raw two conclusions: First, distilling more powerful models into smaller ones yields excellent results, whereas smaller models relying on the large-scale RL mentioned in this paper require enormous computational power and may not even achieve the performance of distillation. Second, while distillation strategies are both economical and effective, advancing beyond the boundaries of intelligence may still require more powerful base models and larger-scale reinforcement learning.鈥
听
The first conclusion is interesting and actually intuitive. The second is reassuring 鈥 they haven鈥檛, at least, completely upended our understanding of how deep learning works in terms of significant compute requirements.
听
What can we learn from what didn鈥檛 work?
听
This is always interesting to read. What did DeepSeek try that didn鈥檛 work?
听
First, using a process reward model (PRM) to guide reinforcement learning was untenable at scale. However, it could still be used for re-ranking top-N responses.
Second, Monte Carlo tree search (MCTS), which was used by AlphaGo and AlphaZero, doesn鈥檛 scale to general reasoning tasks because the problem space is not as 鈥渃onstrained鈥 as chess or even Go. Remember when, , the Go space was considered to be too complex to be computationally feasible? Now, it鈥檚 鈥渃onstrained鈥.
听
Other interesting things
听
There鈥檚 obviously a huge amount of interesting stuff we could comment on. However, there are a couple of things worth pointing out:
听
A very impressive coding benchmark
Post training + scaling inference looks like a viable strategy to make a very effective model
What comes next?
听
Breaking the circularity of benchmarks and models
听
After every new and better model is released, we鈥檙e left wondering if it was exposed to the benchmark data at training time. 鈥淒id it just study for the exam or did it actually learn the subject?鈥
听
This is because of the perverse circularity of benchmark datasets; it鈥檚 a never-ending spiral of misleading hype. You create a good benchmark dataset, the next model games it to win, gets hype, then you need to create another 鈥榝air鈥 benchmark鈥 it adds value until the next model games it, and so on. will only be what it says it is until the next model is released.
听
To put it another way, when an LLM confidently generates the right answers on current benchmarks, that's great if its application would also be on real data whose complexity is similar. On the other hand, when the LLM fails on a newer benchmark (or the domain it's applied to), usually it's because it's confident in answers that are wrong. This is because the new benchmark data has complexity it didn鈥檛 know about at training time.
This cycle needs to stop, and we need better and more generic evaluation mechanisms and informative metrics that are not dependent on new benchmarks every few weeks. (We touched on this elsewhere.)
听
Replicating DeepSeek R1鈥檚 results
听
One big thing left to do is to replicate R1鈥檚 results and validate their findings independently. We鈥檙e keenly following Huggingface鈥檚 as the open-source community attempts to reproduce the results. Reproducing the results will need a few things:
听
- GPUs: 2048 isn鈥檛 a huge number, just like $5.5m isn鈥檛 a big amount, per training run.
Training code. DeepSeek has not open-sourced theirs.
- Training data 鈥 perhaps the biggest gap.
听
DeepSeek probably won鈥檛 release their entire training dataset, just like OpenAI or Anthropic won鈥檛 release theirs. As far as we can find, DeepSeek hasn鈥檛 released samples of the data used for the long chain-of-thought training, for example. So the intrepid open-source community has started creating datasets.听 is one such example.
听
In the meantime, Berkeley researchers claim to have 鈥
听
...I think I need to take a nap.
Thanks to my colleagues Shayan Mohanty, Ben O鈥橫ahony, Chris Kramer, Sahger Lad, Prathamesh Kalamkar, Lauris Jullien, Emily Gorcenski, Karrtik Iyer, Runyan Tan, Parag Mahajani and Andy Yates who all contributed to this piece.