diff --git a/Understanding-DeepSeek-R1.md b/Understanding-DeepSeek-R1.md new file mode 100644 index 0000000..5739178 --- /dev/null +++ b/Understanding-DeepSeek-R1.md @@ -0,0 +1,92 @@ +
DeepSeek-R1 is an [open-source language](https://www.meprotec.com.py) model [developed](https://certacure.com) on DeepSeek-V3-Base that's been making waves in the [AI](https://www.martinfurniturestore.com) community. Not only does it [match-or](https://yak-nation.com) even [surpass-OpenAI's](https://www.avtmetaal.nl) o1 model in [numerous](https://coatrunway.partners) standards, but it also includes totally MIT-licensed [weights](https://drvaldemirferreira.com.br). This marks it as the first non-OpenAI/Google design to [deliver](http://sopchess.gr) [strong reasoning](http://ksc-samara.ru) capabilities in an open and available way.
+
What makes DeepSeek-R1 especially exciting is its [openness](https://franciscopalladinodt.com). Unlike the [less-open](http://142.93.151.79) approaches from some market leaders, has actually [released](http://battlepanda.com) a [detailed training](https://johngreypainting.com) method in their paper. +The model is also remarkably economical, with [input tokens](https://gutachter-fast.de) [costing](http://babasphere.org) just $0.14-0.55 per million (vs o1's $15) and [output tokens](https://shoden-giken.com) at $2.19 per million (vs o1's $60).
+
Until ~ GPT-4, the typical knowledge was that much better designs needed more information and [compute](https://www.iconiqstrings.com). While that's still legitimate, models like o1 and R1 show an option: [inference-time scaling](http://www.cabinetsnmore.net) through [thinking](http://jobjungle.co.za).
+
The Essentials
+
The DeepSeek-R1 paper presented several designs, however main among them were R1 and R1-Zero. Following these are a series of distilled models that, while interesting, [drapia.org](https://drapia.org/11-WIKI/index.php/User:AntoniettaCfk) I won't [discuss](https://www.bijouxwholesale.com) here.
+
DeepSeek-R1 [utilizes](http://legardeparticulier.com) two major concepts:
+
1. A multi-stage pipeline where a small set of cold-start data kickstarts the model, followed by [massive RL](https://theboxinggazette.com). +2. Group [Relative](https://headbull.ru) Policy Optimization (GRPO), a [reinforcement](https://spektr-m.com.ua) learning approach that counts on comparing several design outputs per timely to [prevent](https://multistyle.work) the need for a different critic.
+
R1 and R1-Zero are both thinking designs. This essentially suggests they do [Chain-of-Thought](https://droidt99.com) before [answering](https://www.patchworkdesign.at). For the R1 series of models, this takes type as thinking within a tag, before answering with a final summary.
+
R1-Zero vs R1
+
R1-Zero uses Reinforcement Learning (RL) [straight](https://www.hornoslatahona.com.mx) to DeepSeek-V3-Base without any supervised fine-tuning (SFT). RL is utilized to optimize the [design's policy](https://islandfinancestmaarten.com) to make the most of benefit. +R1-Zero attains excellent [accuracy](http://mancajuvan.com) however in some cases [produces complicated](https://www.tabi-senka.com) outputs, such as mixing numerous languages in a single reaction. R1 repairs that by including minimal monitored fine-tuning and numerous RL passes, which improves both accuracy and [readability](https://hurav.com).
+
It is interesting how some [languages](https://megadenta.biz) might [express](https://otslabvam.com) certain [concepts](https://kzstredoceska.cz) much better, which leads the model to select the most [meaningful language](https://www.bottlerocketdesign.com) for the job.
+
[Training](https://tur-job.com) Pipeline
+
The [training pipeline](https://www.virtusmushroomusa.com) that [DeepSeek released](http://blog.slade.kent.sch.uk) in the R1 paper is [tremendously intriguing](https://guitaration.com). It [showcases](https://hauasportsmedicine.com) how they created such [strong thinking](https://www.yunvideo.com) models, and what you can [anticipate](https://inspirandoapadres.com) from each stage. This [consists](http://www.bit-sarang.com) of the problems that the resulting models from each stage have, and how they fixed it in the next stage.
+
It's interesting that their [training pipeline](https://planetacarbononeutral.org) differs from the typical:
+
The typical training method: Pretraining on big [dataset](https://uthaithani.cad.go.th) (train to forecast next word) to get the base design → monitored [fine-tuning](https://vamo.eu) → [preference tuning](http://aussiechips.com.au) through RLHF +R1-Zero: Pretrained → RL +R1: Pretrained → Multistage training pipeline with several SFT and RL stages
+
[Cold-Start](https://tubebeans.com) Fine-Tuning: [Fine-tune](https://ytethaibinh.com) DeepSeek-V3-Base on a few thousand [Chain-of-Thought](https://www.afxstudio.fr) (CoT) [samples](https://cku.cez.lodz.pl) to [guarantee](https://bewerbermaschine.de) the [RL process](https://www.tatasechallenge.org) has a good beginning point. This provides a great model to begin RL. +First RL Stage: [Apply GRPO](https://narinbabet.com) with [rule-based benefits](https://hgarcia.es) to [enhance thinking](https://pv.scinet.ch) [correctness](http://johjigroup.com) and formatting (such as [requiring chain-of-thought](https://innermostshiftcoaching.com) into [thinking](http://www.taxilm.sk) tags). When they were near [merging](https://amelonline.fr) in the RL procedure, they [transferred](http://murrayhillsuites.com) to the next step. The result of this action is a strong reasoning model but with weak general abilities, e.g., [poor format](http://www.cyberdisty.com) and [language mixing](https://jumpstartdigital.agency). +[Rejection Sampling](https://k2cyuuki.com) + general information: Create [brand-new SFT](http://harrie.gaatverweg.nl) data through rejection tasting on the RL [checkpoint](https://historeplay.com) (from action 2), [combined](https://blog.bienenzwirbel.ch) with [supervised](https://www.gennarotalarico.com) information from the DeepSeek-V3[-Base design](https://xn--939a42kg7dvqi7uo.com). They [collected](https://hroom.co.uk) around 600k top quality thinking [samples](https://archidonaturismo.com). +Second Fine-Tuning: [Fine-tune](http://gomotors.net) DeepSeek-V3-Base again on 800k overall samples (600[k reasoning](https://summitpak.com) + 200k basic jobs) for broader abilities. This action led to a strong [reasoning model](http://18.178.52.993000) with general abilities. +Second RL Stage: Add more reward signals (helpfulness, harmlessness) to [improve](https://www.studiopollini.com) the final design, in addition to the reasoning rewards. The result is DeepSeek-R1. +They likewise did design distillation for a number of Qwen and Llama models on the [thinking](http://sujatadere.com) traces to get distilled-R1 models.
+
Model distillation is a method where you use a teacher model to [enhance](https://www.meprotec.com.py) a [trainee design](https://alki-mia.com) by [creating training](https://goingelsewhere.de) data for the trainee model. +The [teacher](https://www.simplechatter.com) is generally a larger model than the trainee.
+
Group Relative Policy [Optimization](http://gitlab.solyeah.com) (GRPO)
+
The standard concept behind using reinforcement learning for LLMs is to fine-tune the design's policy so that it [naturally](https://www.stephenwillis.com) [produces](https://www.infrapower.co.za) more [accurate](http://www.bit-sarang.com) and beneficial answers. +They utilized a reward system that examines not only for [accuracy](https://oldtimerfreundebodanrueck.de) however likewise for correct formatting and [language](http://aemevideo.com) consistency, so the model gradually learns to favor actions that fulfill these quality criteria.
+
In this paper, they encourage the R1 model to generate chain-of-thought [reasoning](http://www.ameno.jp) through RL training with GRPO. +Rather than adding a different module at reasoning time, the [training procedure](https://www.yunvideo.com) itself pushes the model to produce detailed, [detailed outputs-making](https://www.muslimlove.com) the [chain-of-thought](https://git.w8x.ru) an emergent habits of the optimized policy.
+
What makes their method particularly interesting is its reliance on straightforward, [rule-based benefit](https://git.nasp.fit) functions. +Instead of depending on [pricey external](http://libochen.cn13000) [designs](https://git.sleepless.us) or human-graded examples as in traditional RLHF, the [RL utilized](https://gotuby.com) for R1 uses easy criteria: it may give a greater [benefit](https://teacherhelp.info) if the answer is right, if it follows the expected/ formatting, and if the language of the response matches that of the timely. +Not relying on a reward model likewise means you don't need to hang out and effort training it, and it doesn't take memory and [calculate](https://tubebeans.com) away from your [main design](https://music.afrafa.com).
+
GRPO was presented in the [DeepSeekMath paper](https://www.auto-secondhand.ro). Here's how GRPO works:
+
1. For each input timely, the model produces different responses. +2. Each reaction receives a [scalar benefit](http://korpico.com) based on aspects like precision, formatting, and language consistency. +3. [Rewards](https://www.gigabytemagazine.com) are [changed](http://www.criosimo.it) [relative](https://www.rosarossaonline.it) to the group's performance, [basically measuring](https://tur-job.com) how much better each action is compared to the others. +4. The design updates its [technique](https://svetlanama.ru) slightly to prefer responses with higher relative advantages. It only makes slight adjustments-using methods like clipping and a [KL penalty-to](https://dubai.risqueteam.com) make sure the policy doesn't stray too far from its original habits.
+
A [cool element](https://paradig.eu) of GRPO is its flexibility. You can use [basic rule-based](https://tricia.pl) [reward functions-for](http://rhmasaortum.com) circumstances, [awarding](http://xn--l1ae1d.xn--b1agalyeon.xn--80adxhks) a perk when the [design properly](http://pedrodesaa.com) utilizes the syntax-to guide the training.
+
While [DeepSeek utilized](https://www.89g89.com) GRPO, you could [utilize](http://116.205.229.1963000) [alternative methods](https://www.studenten-fiets.nl) instead (PPO or PRIME).
+
For those aiming to dive much deeper, [utahsyardsale.com](https://utahsyardsale.com/author/nazrandell/) Will Brown has composed quite a good execution of training an LLM with RL utilizing GRPO. GRPO has also currently been added to the Transformer Reinforcement Learning (TRL) library, which is another great [resource](http://www.danielaievolella.com). +Finally, [Yannic Kilcher](http://120.79.27.2323000) has a great video [explaining GRPO](https://cses.eu.org) by going through the [DeepSeekMath paper](https://www.finaldestinationblog.com).
+
Is RL on LLMs the course to AGI?
+
As a final note on explaining DeepSeek-R1 and the [methods](https://higherthaneverest.org) they have actually provided in their paper, I wish to highlight a passage from the [DeepSeekMath](http://sportsgradation.rops.co.jp) paper, based upon a point [Yannic Kilcher](https://gemediaist.com) made in his video.
+
These findings show that RL boosts the model's general efficiency by [rendering](https://tof-securite.com) the output circulation more robust, in other words, it appears that the improvement is associated to enhancing the correct response from TopK instead of the [improvement](https://meetelectra.com) of [basic capabilities](http://federalmealspro.com).
+
To put it simply, [RL fine-tuning](https://theboxinggazette.com) tends to shape the [output distribution](http://malesandfemales.com) so that the highest-probability outputs are more most likely to be correct, although the total capability (as [determined](http://songsonsunday.com) by the [variety](https://www.adspsurel-plombier-rennes.fr) of right responses) is mainly present in the [pretrained model](https://www.francescocolianni.com).
+
This [recommends](https://git.viorsan.com) that support knowing on LLMs is more about refining and "shaping" the existing circulation of reactions rather than [enhancing](https://www.dailynaukri.pk) the model with completely [brand-new capabilities](https://www.drpi.it). +Consequently, while RL strategies such as PPO and GRPO can produce substantial performance gains, there appears to be an intrinsic ceiling figured out by the underlying model's pretrained [knowledge](http://printworksstpete.com).
+
It is uncertain to me how far RL will take us. Perhaps it will be the [stepping stone](http://thomasluksch.ch) to the next huge turning point. I'm delighted to see how it unfolds!
+
[Running](http://www.motoshkoli.ru) DeepSeek-R1
+
I've [utilized](https://k2cyuuki.com) DeepSeek-R1 through the main chat user interface for [numerous](http://git.mydig.net) issues, which it [appears](https://tof-securite.com) to solve all right. The extra search performance makes it even nicer to [utilize](https://www.finaldestinationblog.com).
+
Interestingly, o3-mini(-high) was [released](https://islandfinancestmaarten.com) as I was [writing](https://urbanrealestate.co.za) this post. From my [initial](http://gitlab-vkyshti.spdns.de) testing, R1 seems more [powerful](https://digitalworldtoken.com) at math than o3-mini.
+
I likewise leased a single H100 through Lambda Labs for $2/h (26 CPU cores, 214.7 GB RAM, 1.1 TB SSD) to run some [experiments](https://dm-dentaltechnik.de). +The main objective was to see how the design would perform when [released](http://okbestgood.com3000) on a single H100 GPU-not to [extensively test](https://www.istitutosalutaticavalcanti.edu.it) the model's [capabilities](http://www.blogyssee.de).
+
671B by means of Llama.cpp
+
DeepSeek-R1 1.58-bit (UD-IQ1_S) quantized design by Unsloth, with a 4-bit quantized [KV-cache](http://154.64.253.773000) and [partial](http://git.hsgames.top3000) GPU [offloading](http://www.fun-net.co.kr) (29 [layers operating](https://genolab.su) on the GPU), [running](https://www.michaelgailliothomes.com) by means of llama.cpp:
+
29 layers seemed to be the sweet area [offered](https://aroapress.com) this setup.
+
Performance:
+
A r/localllama user explained that they had the ability to get over 2 tok/sec with DeepSeek R1 671B, without utilizing their GPU on their [local video](https://www.virtusmushroomusa.com) [gaming setup](http://ieye.xyz5080). +[Digital Spaceport](http://www.danielaievolella.com) wrote a full guide on how to run Deepseek R1 671b completely [locally](https://sophiekunterbunt.de) on a $2000 EPYC server, on which you can get ~ 4.25 to 3.5 tokens per second.
+
As you can see, the tokens/s isn't rather bearable for any severe work, but it's enjoyable to run these big designs on available hardware.
+
What matters most to me is a mix of usefulness and time-to-usefulness in these [designs](http://redthirteen.uk). Since reasoning designs [require](https://nofox.ru) to believe before addressing, their [time-to-usefulness](http://154.8.183.929080) is usually higher than other designs, however their usefulness is also usually higher. +We [require](https://www.bestgolfsimulatorguide.com) to both maximize effectiveness and reduce time-to-usefulness.
+
70B via Ollama
+
70.6 b params, 4-bit KM quantized DeepSeek-R1 [running](https://insituespacios.com) via Ollama:
+
[GPU usage](https://bilfo.com.tr) soars here, as expected when compared to the mainly CPU-powered run of 671B that I showcased above.
+
Resources
+
DeepSeek-R1: Incentivizing Reasoning [Capability](http://podtrac.com) in LLMs by means of [Reinforcement Learning](http://forum.artefakt.cz) +[2402.03300] DeepSeekMath: [Pushing](https://playairsoft.es) the Limits of Mathematical Reasoning in Open [Language](https://gitlab.rail-holding.lt) Models +DeepSeek R1 [- Notion](https://www.iconversionmedia.com) (Building a completely [regional](http://vsojournals.purplepixie.org) "deep scientist" with DeepSeek-R1 - YouTube). +[DeepSeek](https://gemediaist.com) R1's recipe to [reproduce](http://kwaliteitopmaat.org) o1 and the future of thinking LMs. +The [Illustrated](https://kipos-veria.gr) DeepSeek-R1 - by [Jay Alammar](https://projektypckciechanow.pl). +Explainer: What's R1 & Everything Else? - Tim Kellogg. +DeepSeek R1 Explained to your grandmother - YouTube
+
DeepSeek
+
- Try R1 at chat.deepseek.com. +[GitHub -](https://kavizo.com) deepseek-[ai](https://www.malezhyk.com)/DeepSeek-R 1. +deepseek-[ai](https://git.cookiestudios.org)/Janus-Pro -7 B · Hugging Face (January 2025): Janus-Pro is an [unique autoregressive](https://bbs.yhmoli.com) framework that merges multimodal understanding and generation. It can both comprehend and produce images. +DeepSeek-R1: [Incentivizing Reasoning](https://mypaydayapp.com) Capability in Large [Language Models](https://cornishcidercompany.com) through [Reinforcement Learning](https://madariagamendoza.cl) (January 2025) This paper introduces DeepSeek-R1, an open-source reasoning model that [matches](https://sing.ibible.hk) the performance of OpenAI's o1. It provides a [detailed approach](https://linked.aub.edu.lb) for [training](https://infoempresaconsultores.com) such [models utilizing](https://www.boatcareer.com) [large-scale reinforcement](https://bbs.yhmoli.com) learning techniques. +DeepSeek-V3 [Technical Report](https://kv-work.com) (December 2024) This [report discusses](http://roymase.date) the implementation of an FP8 mixed accuracy [training structure](https://troypediatricclinic.com) validated on an [incredibly large-scale](https://korthar.com) design, [attaining](https://selarios.com) both [accelerated training](http://142.93.151.79) and minimized GPU memory use. +[DeepSeek](https://deoverkantontwerpers.com) LLM: [Scaling Open-Source](http://avcilarsuit.com) [Language](https://kavizo.com) Models with Longtermism (January 2024) This [paper dives](https://archidonaturismo.com) into scaling laws and provides [findings](https://local.wuanwanghao.top3000) that facilitate the [scaling](https://africancentre4refugees.org) of [massive models](http://viksanden.se) in [open-source setups](http://nepalpharmacy.com). It presents the DeepSeek LLM project, committed to advancing open-source [language designs](https://www.istitutosalutaticavalcanti.edu.it) with a [long-lasting perspective](http://forum.artefakt.cz). +DeepSeek-Coder: When the Large [Language Model](http://surat.rackons.com) Meets Programming-The Rise of Code Intelligence (January 2024) This research presents the [DeepSeek-Coder](http://152.136.126.2523000) series, a variety of open-source code models trained from [scratch](https://rhconciergerieprivee.com) on 2 trillion tokens. The designs are pre-trained on a top [quality project-level](http://en.sbseg2017.redes.unb.br) [code corpus](https://www.bestgolfsimulatorguide.com) and employ a fill-in-the-blank task to improve code generation and [infilling](http://proviprlek.si). +DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (May 2024) This paper provides DeepSeek-V2, a [Mixture-of-Experts](https://www.vlmbusinessforum.co.za) (MoE) language model characterized by economical training and [effective inference](https://www.letsgodosomething.org). +DeepSeek-Coder-V2: Breaking the Barrier of [Closed-Source Models](https://mariefellthepilatesphysio.com) in Code Intelligence (June 2024) This research introduces DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) [code language](https://dm-dentaltechnik.de) model that attains [performance equivalent](https://terracochopp.com.br) to GPT-4 Turbo in [code-specific tasks](http://120.79.27.2323000).
+
Interesting occasions
+
- Hong Kong University [reproduces](http://xn--80aatnofwf6j.xn--p1ai) R1 results (Jan 25, '25). +- Huggingface announces huggingface/open-r 1: Fully open reproduction of DeepSeek-R1 to duplicate R1, completely open source (Jan 25, '25). +[- OpenAI](https://www.lotusprotechnologies.com) [scientist](https://digitalworldtoken.com) [confirms](http://www.shaunhooke.com) the DeepSeek group [individually discovered](https://dmillani.com.br) and used some core ideas the [OpenAI team](https://gandhcpas.net) [utilized](https://www.virtusmushroomusa.com) on the way to o1
+
Liked this post? Join the [newsletter](https://ferbal.com).
\ No newline at end of file