Demystifying DeepSeek

Prasanna Pendse

Published: January 31, 2025

On January 10, 2025, DeepSeek launched an LLM, R1, that the startup claimed is on par with Open AI’s ChatGPT o1 . The app topped the App Store charts in record time, capturing the imagination of not just the tech industry but the wider world too. One claim was particularly startling: that the model had been trained for under $6m (as opposed to the $100m spent on GPT-4 by OpenAI). This created waves in the stock market and the news media.��

��

While the media and financial storm is hard to avoid, what’s more interesting to us at ��Ӱֱ�� is to try to understand what DeepSeek did, exactly, and how. We will start with key facts and first impressions and then explore what we know so far from their papers about their model architecture, training data, their approach to evals and the techniques used. Finally, we’ll look at attempts to reproduce their results and what comes next.

��

DeepSeek’s new models: What are the facts?

��

Who built it?

��

Founded in May 2023, DeepSeek is a Chinese AI startup based in Hangzhou and Beijing. It’s backed by the Chinese Hedgefund High-Flyer located in Hangzhou. Both High-Flyer and DeepSeek were founded by .

��

On Jan 10, 2025, DeepSeek released their mobile app; on January 20, 2025 the company released and .��

��

What did DeepSeek actually build?

��

DeepSeek built two types of models and apps to use them. The latest versions of the two types of models are V3 and R1. V3, as the name suggests, is version 3 of a general purpose language model and R1 is a reasoning model on top of V3-Base. They also provide distilled versions of their models so they can fit on your laptop.��

��

There are two flavors of V3. One is built on Llama (Meta’s open-weight model) and one on Qwen (Alibaba’s open-weight model).

��

While they have released weights for the R1 model and code to run the model in inference, they have not released any training code or the code for all the optimizations they did closer to the hardware.

Our first impressions of using DeepSeek

��

A few of us at ��Ӱֱ�� used DeepSeek via the company’s website, while others used ollama to get an R1 distilled model running on their laptop. We then spent some time using the model as you would other models — for tasks ranging from coding to reasoning questions.

��

Based on our experience in recent days, here are some of our initial impressions and thoughts:

��

Multilingual performance showed good results in English and Mandarin but was less smooth in French, with unintended Chinese or Arabic characters appearing and occasional reversion to English during complex reasoning.
Its reasoning style can be overly verbose — sometimes it went in circles when we were using it.
We’d like to learn more about�� how DeepSeek is thinking about security and privacy aspects — particularly from a user perspective.��
Model instances are available in various sizes and can be installed on various consumer-grade hardware, including energy saving models.
The hosted version appears to have guardrails aligned with the Chinese government’s worldview. The model itself may reflect perspectives consistent with that alignment.��
We don’t have visibility into the data that was used to train it (although it’s important to note the same is true of Llama, OpenAI, Claude) This makes some governments and enterprises nervous.

How can you use DeepSeek?

��

You can try out the model on the or mobile app. Alternatively, you can use��ollama run deepseek-r1:32b��to run the distilled version of the model locally. To do that, you’ll need to .

��

Cloud Service Providers have also jumped on this. You can deploy DeepSeek models on , and .��You can also deploy it as .

��

DeepSeek’s models are certainly intriguing enough to consider adding to your AI platform’s toolbox along with other open-weight models, as application builders will want to experiment with or use different models for different purposes.

��

Can we trust DeepSeek’s reported performance results?

��

DeepSeek’s results have not yet been reproduced. We are closely following Huggingface’s attempt to reproduce it at .��

��

We would also like to understand whether the model was exposed to the benchmark data during training, and whether the evals methods used in the paper are appropriate.

��

That said, we do not have any specific reason to believe that the results are not real.

��

One thing that has made waves is the 2.788M GPU Hours (est. $5.576m) for training (see the first table in . The V3 paper makes the assumptions backing into this price point clear, but also offers caveats, saying it’s only representative of the last training run. Given how quickly the industry has leapt at covering this particular family of models, we suspect this number has been presented out-of-context many times in that coverage.

DeepSeek's technical components

��

R1 was trained using a combination of SFT and RL on V3-Base. They’re transformers that have been highly optimized for specific hardware/software frameworks based on the limitations imposed by the environment (specifically the ). DeepSeek has also used a combination of new and old techniques in some interesting ways. Let’s start by looking at .

��

V3-Base uses a strong mixture-of-experts approach. This is similar to Mixtral but . V3-Base is trained with 671B total parameters while Llama has a 405B version. Both V3-Base and Llama 3.1 405B use FP8 quantization. V3-Base was trained on 14.8T tokens while Llama was trained on 15T tokens. They both support a 128K context window.

��

The key difference is that the V3 paper mentions they only used 2.788M GPU hours, while the says they used 39.3M cumulative GPU hours. Here’s where the nuance lies: from what we understand, the 2.788M GPU Hours used to train V3 were only for the last full training run, while the number Llama reports is a cumulative number. The details of exactly how to parse the words will come out eventually, but for now we are still unclear whether there is a like-for-like comparison to be had here. As an example, V3 was trained on some data that was generated by a then-unreleased R1; should the training costs calculated for V3 be inclusive of the training costs for R1 in that case?

��

R1 was built on V3-Base using supervised fine-tuning (SFT) as well as reinforcement learning (RL) to build reasoning into the model. R1 uses the long chain-of-thought pattern to perform reasoning. R1 was then further distilled into smaller dense models. Like V3-Base, they have released both Llama and Qwen based versions. They have also released R1-Zero which does not use SFT and has some limitations such as readability and language mixing, though it shows some intriguing reasoning behaviors. These limitations mean R1-Zero is probably more interesting for researchers than users. To overcome these, they applied multi-stage training as well as cold-start data before RL.

��

V3 was then built by using data created by R1’s reasoning, verification and reflection patterns to further improve on V3-Base to create a more well-rounded model, V3.

��

All these models were trained using NVIDIA H800 GPUs. These are versions of the H100 GPUs made for the Chinese market and are, as mentioned earlier, limited in ways to .�� Specifically, H800 chips possess half the chip-to-chip interconnect speed as H100s (around 400GB/s vs 900GB/s on NVLink).

��

The cost for training R1 is . We know they are wrong, but it’s unclear how wrong they are. The calculation is from the V3 technical report, which is the cost of training DeepSeek V3. CNN by saying the cost was for the base model — however, they don’t help people understand the difference between the two.

��

R1 was trained on top of V3-Base, so the cumulative cost of training R1 would definitely be more than training the base model. The numbers in table one of the V3 Technical Report seem to be for one full run, likely the final full training run. If you were to try to replicate the training process, you’d probably need to do more than one full training run.

��

There are also conflicting reports that , which is more in-line with what OpenAI is supposed to have used to .��

If you were to rent 50,000 A100 GPUs in the US today, you’d probably pay about $1.35/GPUhr (if you found that many available). That’s about $11.34 m/week. In DeepSeek’s case, it appears they could have used GPUs that their hedge fund backer High-Flyer had for high-frequency trading purposes.

Diving deeper into what makes DeepSeek distinctive

��

There are a number of sophisticated ways in which DeepSeek modified the model architecture, training techniques and data to get the most out of the limited hardware available to them. Let’s now look at these from the bottom up.

��

Optimizing for available hardware

��

There are two key limitations of the H800s DeepSeek had to use compared to H100s. First, they have half the GPU-to-GPU interconnect bandwidth of H100s, and, second, much smaller memory: 80GB vs 188GB.

��

Interestingly, DeepSeek appears to have turned these limitations into an advantage. “[T]he economical training costs of DeepSeek-V3… [was] achieved through our optimized co-design of algorithms, frameworks, and hardware,” DeepSeek’s team wrote. In other words, they made decisions that would allow them to extract the most out of what they had available.

��

For example, they used to significantly reduce the amount of memory required. The V3 paper says “low-precision training has emerged as a promising solution for efficient training”. However, prior to this work, FP8 was seen as efficient but less effective; DeepSeek demonstrated how it can be used effectively. “In this work, we introduce an FP8 mixed precision training framework and, for the first time, validate its effectiveness on an extremely large-scale model. Through the support for FP8 computation and storage, we achieve both accelerated training and reduced GPU memory usage.”

��

They’ve further optimized for the constrained hardware at a very low level. The V3 paper also states “we also develop efficient cross-node all-to-all communication kernels to fully utilize InfiniBand (IB) and NVLink bandwidths. Furthermore, we meticulously optimize the memory footprint, making it possible to train DeepSeek-V3 without using costly tensor parallelism. Combining these efforts, we achieve high training efficiency.” This is some seriously deep work to get the most out of the hardware they were limited to.

��

Further, the paper talks about something we find particularly interesting. “As for the training framework, we design the DualPipe algorithm for efficient pipeline parallelism, which has fewer pipeline bubbles and hides most of the communication during training through computation-communication overlap. This overlap ensures that, as the model further scales up, as long as we maintain a constant computation-to-communication ratio, we can still employ fine-grained experts across nodes while achieving a near-zero all-to-all communication overhead.” The constant computation-to-communication ratio and near-zero all-to-all communication overhead is striking relative to “normal” ways to scale distributed training which typically just means “add more hardware to the pile”.

��

This is a clear case of necessity being the mother of invention.

��

The impact of reinforcement learning in post-training on benchmark performance

��

DeepSeek applied reinforcement learning with GRPO (group relative policy optimization) in V2 and V3. But, apparently, reinforcement learning had a big impact on the reasoning model, R1 — its impact on benchmark performance is notable.

��

By using GRPO to apply the reward to the model, DeepSeek avoids using a large “critic” model; this again saves memory. However, GRPO takes a rules-based rules approach which, while it will work better for problems that have an objective answer — such as coding and math — it might struggle in domains where answers are subjective or variable. It will be interesting to track the trade-offs as more people use it in different contexts.

��

Multi-head latent attention (MLA)

��

Multi-head Latent Attention is a variation on multi-head attention that was introduced by DeepSeek in their V2 paper. According to , while previous multi-head attention techniques were considered a tradeoff, insofar as you reduce model quality to get better scale in large model training, DeepSeek says that MLA not only allows scale, it also improves the model. We’re looking forward to digging deeper into this.

��

Distillation vs reinforcement learning

��

The R1 paper has an interesting discussion about distillation vs reinforcement learning. The DeepSeek team writes that their work makes it possible to: “draw two conclusions: First, distilling more powerful models into smaller ones yields excellent results, whereas smaller models relying on the large-scale RL mentioned in this paper require enormous computational power and may not even achieve the performance of distillation. Second, while distillation strategies are both economical and effective, advancing beyond the boundaries of intelligence may still require more powerful base models and larger-scale reinforcement learning.”

��

The first conclusion is interesting and actually intuitive. The second is reassuring — they haven’t, at least, completely upended our understanding of how deep learning works in terms of significant compute requirements.

��

What can we learn from what didn’t work?

��

This is always interesting to read. What did DeepSeek try that didn’t work?

��

First, using a process reward model (PRM) to guide reinforcement learning was untenable at scale. However, it could still be used for re-ranking top-N responses.
Second, Monte Carlo tree search (MCTS), which was used by AlphaGo and AlphaZero, doesn’t scale to general reasoning tasks because the problem space is not as “constrained” as chess or even Go. Remember when, , the Go space was considered to be too complex to be computationally feasible? Now, it’s “constrained”.

��

Other interesting things

��

There’s obviously a huge amount of interesting stuff we could comment on. However, there are a couple of things worth pointing out:

��

A very impressive coding benchmark
Post training + scaling inference looks like a viable strategy to make a very effective model

What comes next?

��

Breaking the circularity of benchmarks and models

��

After every new and better model is released, we’re left wondering if it was exposed to the benchmark data at training time. “Did it just study for the exam or did it actually learn the subject?”

��

This is because of the perverse circularity of benchmark datasets; it’s a never-ending spiral of misleading hype. You create a good benchmark dataset, the next model games it to win, gets hype, then you need to create another ‘fair’ benchmark… it adds value until the next model games it, and so on. will only be what it says it is until the next model is released.

��

To put it another way, when an LLM confidently generates the right answers on current benchmarks, that's great if its application would also be on real data whose complexity is similar. On the other hand, when the LLM fails on a newer benchmark (or the domain it's applied to), usually it's because it's confident in answers that are wrong. This is because the new benchmark data has complexity it didn’t know about at training time.

This cycle needs to stop, and we need better and more generic evaluation mechanisms and informative metrics that are not dependent on new benchmarks every few weeks. (We touched on this elsewhere.)

��

Replicating DeepSeek R1’s results

��

One big thing left to do is to replicate R1’s results and validate their findings independently. We’re keenly following Huggingface’s as the open-source community attempts to reproduce the results. Reproducing the results will need a few things:

��

GPUs: 2048 isn’t a huge number, just like $5.5m isn’t a big amount, per training run.
Training code. DeepSeek has not open-sourced theirs.
Training data — perhaps the biggest gap.

��

DeepSeek probably won’t release their entire training dataset, just like OpenAI or Anthropic won’t release theirs. As far as we can find, DeepSeek hasn’t released samples of the data used for the long chain-of-thought training, for example. So the intrepid open-source community has started creating datasets.�� is one such example.

��

In the meantime, Berkeley researchers claim to have …

��

...I think I need to take a nap.

Thanks to my colleagues Shayan Mohanty, Ben O’Mahony, Chris Kramer, Sahger Lad, Prathamesh Kalamkar, Lauris Jullien, Emily Gorcenski, Karrtik Iyer, Runyan Tan, Parag Mahajani and Andy Yates who all contributed to this piece.

View less

��Ӱֱ��

Solutions

Industries

Resource Hubs

Publications and Tools

All Insights