Add Understanding DeepSeek R1
parent
73bd02ed6d
commit
c854f50d34
|
@ -0,0 +1,92 @@
|
|||
<br>DeepSeek-R1 is an [open-source language](https://www.meprotec.com.py) model [developed](https://certacure.com) on DeepSeek-V3-Base that's been making waves in the [AI](https://www.martinfurniturestore.com) community. Not only does it [match-or](https://yak-nation.com) even [surpass-OpenAI's](https://www.avtmetaal.nl) o1 model in [numerous](https://coatrunway.partners) standards, but it also includes totally MIT-licensed [weights](https://drvaldemirferreira.com.br). This marks it as the first non-OpenAI/Google design to [deliver](http://sopchess.gr) [strong reasoning](http://ksc-samara.ru) capabilities in an open and available way.<br>
|
||||
<br>What makes DeepSeek-R1 especially exciting is its [openness](https://franciscopalladinodt.com). Unlike the [less-open](http://142.93.151.79) approaches from some market leaders, has actually [released](http://battlepanda.com) a [detailed training](https://johngreypainting.com) method in their paper.
|
||||
The model is also remarkably economical, with [input tokens](https://gutachter-fast.de) [costing](http://babasphere.org) just $0.14-0.55 per million (vs o1's $15) and [output tokens](https://shoden-giken.com) at $2.19 per million (vs o1's $60).<br>
|
||||
<br>Until ~ GPT-4, the typical knowledge was that much better designs needed more information and [compute](https://www.iconiqstrings.com). While that's still legitimate, models like o1 and R1 show an option: [inference-time scaling](http://www.cabinetsnmore.net) through [thinking](http://jobjungle.co.za).<br>
|
||||
<br>The Essentials<br>
|
||||
<br>The DeepSeek-R1 paper presented several designs, however main among them were R1 and R1-Zero. Following these are a series of distilled models that, while interesting, [drapia.org](https://drapia.org/11-WIKI/index.php/User:AntoniettaCfk) I won't [discuss](https://www.bijouxwholesale.com) here.<br>
|
||||
<br>DeepSeek-R1 [utilizes](http://legardeparticulier.com) two major concepts:<br>
|
||||
<br>1. A multi-stage pipeline where a small set of cold-start data kickstarts the model, followed by [massive RL](https://theboxinggazette.com).
|
||||
2. Group [Relative](https://headbull.ru) Policy Optimization (GRPO), a [reinforcement](https://spektr-m.com.ua) learning approach that counts on comparing several design outputs per timely to [prevent](https://multistyle.work) the need for a different critic.<br>
|
||||
<br>R1 and R1-Zero are both thinking designs. This essentially suggests they do [Chain-of-Thought](https://droidt99.com) before [answering](https://www.patchworkdesign.at). For the R1 series of models, this takes type as thinking within a tag, before answering with a final summary.<br>
|
||||
<br>R1-Zero vs R1<br>
|
||||
<br>R1-Zero uses Reinforcement Learning (RL) [straight](https://www.hornoslatahona.com.mx) to DeepSeek-V3-Base without any supervised fine-tuning (SFT). RL is utilized to optimize the [design's policy](https://islandfinancestmaarten.com) to make the most of benefit.
|
||||
R1-Zero attains excellent [accuracy](http://mancajuvan.com) however in some cases [produces complicated](https://www.tabi-senka.com) outputs, such as mixing numerous languages in a single reaction. R1 repairs that by including minimal monitored fine-tuning and numerous RL passes, which improves both accuracy and [readability](https://hurav.com).<br>
|
||||
<br>It is interesting how some [languages](https://megadenta.biz) might [express](https://otslabvam.com) certain [concepts](https://kzstredoceska.cz) much better, which leads the model to select the most [meaningful language](https://www.bottlerocketdesign.com) for the job.<br>
|
||||
<br>[Training](https://tur-job.com) Pipeline<br>
|
||||
<br>The [training pipeline](https://www.virtusmushroomusa.com) that [DeepSeek released](http://blog.slade.kent.sch.uk) in the R1 paper is [tremendously intriguing](https://guitaration.com). It [showcases](https://hauasportsmedicine.com) how they created such [strong thinking](https://www.yunvideo.com) models, and what you can [anticipate](https://inspirandoapadres.com) from each stage. This [consists](http://www.bit-sarang.com) of the problems that the resulting models from each stage have, and how they fixed it in the next stage.<br>
|
||||
<br>It's interesting that their [training pipeline](https://planetacarbononeutral.org) differs from the typical:<br>
|
||||
<br>The typical training method: Pretraining on big [dataset](https://uthaithani.cad.go.th) (train to forecast next word) to get the base design → monitored [fine-tuning](https://vamo.eu) → [preference tuning](http://aussiechips.com.au) through RLHF
|
||||
R1-Zero: Pretrained → RL
|
||||
R1: Pretrained → Multistage training pipeline with several SFT and RL stages<br>
|
||||
<br>[Cold-Start](https://tubebeans.com) Fine-Tuning: [Fine-tune](https://ytethaibinh.com) DeepSeek-V3-Base on a few thousand [Chain-of-Thought](https://www.afxstudio.fr) (CoT) [samples](https://cku.cez.lodz.pl) to [guarantee](https://bewerbermaschine.de) the [RL process](https://www.tatasechallenge.org) has a good beginning point. This provides a great model to begin RL.
|
||||
First RL Stage: [Apply GRPO](https://narinbabet.com) with [rule-based benefits](https://hgarcia.es) to [enhance thinking](https://pv.scinet.ch) [correctness](http://johjigroup.com) and formatting (such as [requiring chain-of-thought](https://innermostshiftcoaching.com) into [thinking](http://www.taxilm.sk) tags). When they were near [merging](https://amelonline.fr) in the RL procedure, they [transferred](http://murrayhillsuites.com) to the next step. The result of this action is a strong reasoning model but with weak general abilities, e.g., [poor format](http://www.cyberdisty.com) and [language mixing](https://jumpstartdigital.agency).
|
||||
[Rejection Sampling](https://k2cyuuki.com) + general information: Create [brand-new SFT](http://harrie.gaatverweg.nl) data through rejection tasting on the RL [checkpoint](https://historeplay.com) (from action 2), [combined](https://blog.bienenzwirbel.ch) with [supervised](https://www.gennarotalarico.com) information from the DeepSeek-V3[-Base design](https://xn--939a42kg7dvqi7uo.com). They [collected](https://hroom.co.uk) around 600k top quality thinking [samples](https://archidonaturismo.com).
|
||||
Second Fine-Tuning: [Fine-tune](http://gomotors.net) DeepSeek-V3-Base again on 800k overall samples (600[k reasoning](https://summitpak.com) + 200k basic jobs) for broader abilities. This action led to a strong [reasoning model](http://18.178.52.993000) with general abilities.
|
||||
Second RL Stage: Add more reward signals (helpfulness, harmlessness) to [improve](https://www.studiopollini.com) the final design, in addition to the reasoning rewards. The result is DeepSeek-R1.
|
||||
They likewise did design distillation for a number of Qwen and Llama models on the [thinking](http://sujatadere.com) traces to get distilled-R1 models.<br>
|
||||
<br>Model distillation is a method where you use a teacher model to [enhance](https://www.meprotec.com.py) a [trainee design](https://alki-mia.com) by [creating training](https://goingelsewhere.de) data for the trainee model.
|
||||
The [teacher](https://www.simplechatter.com) is generally a larger model than the trainee.<br>
|
||||
<br>Group Relative Policy [Optimization](http://gitlab.solyeah.com) (GRPO)<br>
|
||||
<br>The standard concept behind using reinforcement learning for LLMs is to fine-tune the design's policy so that it [naturally](https://www.stephenwillis.com) [produces](https://www.infrapower.co.za) more [accurate](http://www.bit-sarang.com) and beneficial answers.
|
||||
They utilized a reward system that examines not only for [accuracy](https://oldtimerfreundebodanrueck.de) however likewise for correct formatting and [language](http://aemevideo.com) consistency, so the model gradually learns to favor actions that fulfill these quality criteria.<br>
|
||||
<br>In this paper, they encourage the R1 model to generate chain-of-thought [reasoning](http://www.ameno.jp) through RL training with GRPO.
|
||||
Rather than adding a different module at reasoning time, the [training procedure](https://www.yunvideo.com) itself pushes the model to produce detailed, [detailed outputs-making](https://www.muslimlove.com) the [chain-of-thought](https://git.w8x.ru) an emergent habits of the optimized policy.<br>
|
||||
<br>What makes their method particularly interesting is its reliance on straightforward, [rule-based benefit](https://git.nasp.fit) functions.
|
||||
Instead of depending on [pricey external](http://libochen.cn13000) [designs](https://git.sleepless.us) or human-graded examples as in traditional RLHF, the [RL utilized](https://gotuby.com) for R1 uses easy criteria: it may give a greater [benefit](https://teacherhelp.info) if the answer is right, if it follows the expected/ formatting, and if the language of the response matches that of the timely.
|
||||
Not relying on a reward model likewise means you don't need to hang out and effort training it, and it doesn't take memory and [calculate](https://tubebeans.com) away from your [main design](https://music.afrafa.com).<br>
|
||||
<br>GRPO was presented in the [DeepSeekMath paper](https://www.auto-secondhand.ro). Here's how GRPO works:<br>
|
||||
<br>1. For each input timely, the model produces different responses.
|
||||
2. Each reaction receives a [scalar benefit](http://korpico.com) based on aspects like precision, formatting, and language consistency.
|
||||
3. [Rewards](https://www.gigabytemagazine.com) are [changed](http://www.criosimo.it) [relative](https://www.rosarossaonline.it) to the group's performance, [basically measuring](https://tur-job.com) how much better each action is compared to the others.
|
||||
4. The design updates its [technique](https://svetlanama.ru) slightly to prefer responses with higher relative advantages. It only makes slight adjustments-using methods like clipping and a [KL penalty-to](https://dubai.risqueteam.com) make sure the policy doesn't stray too far from its original habits.<br>
|
||||
<br>A [cool element](https://paradig.eu) of GRPO is its flexibility. You can use [basic rule-based](https://tricia.pl) [reward functions-for](http://rhmasaortum.com) circumstances, [awarding](http://xn--l1ae1d.xn--b1agalyeon.xn--80adxhks) a perk when the [design properly](http://pedrodesaa.com) utilizes the syntax-to guide the training.<br>
|
||||
<br>While [DeepSeek utilized](https://www.89g89.com) GRPO, you could [utilize](http://116.205.229.1963000) [alternative methods](https://www.studenten-fiets.nl) instead (PPO or PRIME).<br>
|
||||
<br>For those aiming to dive much deeper, [utahsyardsale.com](https://utahsyardsale.com/author/nazrandell/) Will Brown has composed quite a good execution of training an LLM with RL utilizing GRPO. GRPO has also currently been added to the Transformer Reinforcement Learning (TRL) library, which is another great [resource](http://www.danielaievolella.com).
|
||||
Finally, [Yannic Kilcher](http://120.79.27.2323000) has a great video [explaining GRPO](https://cses.eu.org) by going through the [DeepSeekMath paper](https://www.finaldestinationblog.com).<br>
|
||||
<br>Is RL on LLMs the course to AGI?<br>
|
||||
<br>As a final note on explaining DeepSeek-R1 and the [methods](https://higherthaneverest.org) they have actually provided in their paper, I wish to highlight a passage from the [DeepSeekMath](http://sportsgradation.rops.co.jp) paper, based upon a point [Yannic Kilcher](https://gemediaist.com) made in his video.<br>
|
||||
<br>These findings show that RL boosts the model's general efficiency by [rendering](https://tof-securite.com) the output circulation more robust, in other words, it appears that the improvement is associated to enhancing the correct response from TopK instead of the [improvement](https://meetelectra.com) of [basic capabilities](http://federalmealspro.com).<br>
|
||||
<br>To put it simply, [RL fine-tuning](https://theboxinggazette.com) tends to shape the [output distribution](http://malesandfemales.com) so that the highest-probability outputs are more most likely to be correct, although the total capability (as [determined](http://songsonsunday.com) by the [variety](https://www.adspsurel-plombier-rennes.fr) of right responses) is mainly present in the [pretrained model](https://www.francescocolianni.com).<br>
|
||||
<br>This [recommends](https://git.viorsan.com) that support knowing on LLMs is more about refining and "shaping" the existing circulation of reactions rather than [enhancing](https://www.dailynaukri.pk) the model with completely [brand-new capabilities](https://www.drpi.it).
|
||||
Consequently, while RL strategies such as PPO and GRPO can produce substantial performance gains, there appears to be an intrinsic ceiling figured out by the underlying model's pretrained [knowledge](http://printworksstpete.com).<br>
|
||||
<br>It is uncertain to me how far RL will take us. Perhaps it will be the [stepping stone](http://thomasluksch.ch) to the next huge turning point. I'm delighted to see how it unfolds!<br>
|
||||
<br>[Running](http://www.motoshkoli.ru) DeepSeek-R1<br>
|
||||
<br>I've [utilized](https://k2cyuuki.com) DeepSeek-R1 through the main chat user interface for [numerous](http://git.mydig.net) issues, which it [appears](https://tof-securite.com) to solve all right. The extra search performance makes it even nicer to [utilize](https://www.finaldestinationblog.com).<br>
|
||||
<br>Interestingly, o3-mini(-high) was [released](https://islandfinancestmaarten.com) as I was [writing](https://urbanrealestate.co.za) this post. From my [initial](http://gitlab-vkyshti.spdns.de) testing, R1 seems more [powerful](https://digitalworldtoken.com) at math than o3-mini.<br>
|
||||
<br>I likewise leased a single H100 through Lambda Labs for $2/h (26 CPU cores, 214.7 GB RAM, 1.1 TB SSD) to run some [experiments](https://dm-dentaltechnik.de).
|
||||
The main objective was to see how the design would perform when [released](http://okbestgood.com3000) on a single H100 GPU-not to [extensively test](https://www.istitutosalutaticavalcanti.edu.it) the model's [capabilities](http://www.blogyssee.de).<br>
|
||||
<br>671B by means of Llama.cpp<br>
|
||||
<br>DeepSeek-R1 1.58-bit (UD-IQ1_S) quantized design by Unsloth, with a 4-bit quantized [KV-cache](http://154.64.253.773000) and [partial](http://git.hsgames.top3000) GPU [offloading](http://www.fun-net.co.kr) (29 [layers operating](https://genolab.su) on the GPU), [running](https://www.michaelgailliothomes.com) by means of llama.cpp:<br>
|
||||
<br>29 layers seemed to be the sweet area [offered](https://aroapress.com) this setup.<br>
|
||||
<br>Performance:<br>
|
||||
<br>A r/localllama user explained that they had the ability to get over 2 tok/sec with DeepSeek R1 671B, without utilizing their GPU on their [local video](https://www.virtusmushroomusa.com) [gaming setup](http://ieye.xyz5080).
|
||||
[Digital Spaceport](http://www.danielaievolella.com) wrote a full guide on how to run Deepseek R1 671b completely [locally](https://sophiekunterbunt.de) on a $2000 EPYC server, on which you can get ~ 4.25 to 3.5 tokens per second. <br>
|
||||
<br>As you can see, the tokens/s isn't rather bearable for any severe work, but it's enjoyable to run these big designs on available hardware.<br>
|
||||
<br>What matters most to me is a mix of usefulness and time-to-usefulness in these [designs](http://redthirteen.uk). Since reasoning designs [require](https://nofox.ru) to believe before addressing, their [time-to-usefulness](http://154.8.183.929080) is usually higher than other designs, however their usefulness is also usually higher.
|
||||
We [require](https://www.bestgolfsimulatorguide.com) to both maximize effectiveness and reduce time-to-usefulness.<br>
|
||||
<br>70B via Ollama<br>
|
||||
<br>70.6 b params, 4-bit KM quantized DeepSeek-R1 [running](https://insituespacios.com) via Ollama:<br>
|
||||
<br>[GPU usage](https://bilfo.com.tr) soars here, as expected when compared to the mainly CPU-powered run of 671B that I showcased above.<br>
|
||||
<br>Resources<br>
|
||||
<br>DeepSeek-R1: Incentivizing Reasoning [Capability](http://podtrac.com) in LLMs by means of [Reinforcement Learning](http://forum.artefakt.cz)
|
||||
[2402.03300] DeepSeekMath: [Pushing](https://playairsoft.es) the Limits of Mathematical Reasoning in Open [Language](https://gitlab.rail-holding.lt) Models
|
||||
DeepSeek R1 [- Notion](https://www.iconversionmedia.com) (Building a completely [regional](http://vsojournals.purplepixie.org) "deep scientist" with DeepSeek-R1 - YouTube).
|
||||
[DeepSeek](https://gemediaist.com) R1's recipe to [reproduce](http://kwaliteitopmaat.org) o1 and the future of thinking LMs.
|
||||
The [Illustrated](https://kipos-veria.gr) DeepSeek-R1 - by [Jay Alammar](https://projektypckciechanow.pl).
|
||||
Explainer: What's R1 & Everything Else? - Tim Kellogg.
|
||||
DeepSeek R1 Explained to your grandmother - YouTube<br>
|
||||
<br>DeepSeek<br>
|
||||
<br>- Try R1 at chat.deepseek.com.
|
||||
[GitHub -](https://kavizo.com) deepseek-[ai](https://www.malezhyk.com)/DeepSeek-R 1.
|
||||
deepseek-[ai](https://git.cookiestudios.org)/Janus-Pro -7 B · Hugging Face (January 2025): Janus-Pro is an [unique autoregressive](https://bbs.yhmoli.com) framework that merges multimodal understanding and generation. It can both comprehend and produce images.
|
||||
DeepSeek-R1: [Incentivizing Reasoning](https://mypaydayapp.com) Capability in Large [Language Models](https://cornishcidercompany.com) through [Reinforcement Learning](https://madariagamendoza.cl) (January 2025) This paper introduces DeepSeek-R1, an open-source reasoning model that [matches](https://sing.ibible.hk) the performance of OpenAI's o1. It provides a [detailed approach](https://linked.aub.edu.lb) for [training](https://infoempresaconsultores.com) such [models utilizing](https://www.boatcareer.com) [large-scale reinforcement](https://bbs.yhmoli.com) learning techniques.
|
||||
DeepSeek-V3 [Technical Report](https://kv-work.com) (December 2024) This [report discusses](http://roymase.date) the implementation of an FP8 mixed accuracy [training structure](https://troypediatricclinic.com) validated on an [incredibly large-scale](https://korthar.com) design, [attaining](https://selarios.com) both [accelerated training](http://142.93.151.79) and minimized GPU memory use.
|
||||
[DeepSeek](https://deoverkantontwerpers.com) LLM: [Scaling Open-Source](http://avcilarsuit.com) [Language](https://kavizo.com) Models with Longtermism (January 2024) This [paper dives](https://archidonaturismo.com) into scaling laws and provides [findings](https://local.wuanwanghao.top3000) that facilitate the [scaling](https://africancentre4refugees.org) of [massive models](http://viksanden.se) in [open-source setups](http://nepalpharmacy.com). It presents the DeepSeek LLM project, committed to advancing open-source [language designs](https://www.istitutosalutaticavalcanti.edu.it) with a [long-lasting perspective](http://forum.artefakt.cz).
|
||||
DeepSeek-Coder: When the Large [Language Model](http://surat.rackons.com) Meets Programming-The Rise of Code Intelligence (January 2024) This research presents the [DeepSeek-Coder](http://152.136.126.2523000) series, a variety of open-source code models trained from [scratch](https://rhconciergerieprivee.com) on 2 trillion tokens. The designs are pre-trained on a top [quality project-level](http://en.sbseg2017.redes.unb.br) [code corpus](https://www.bestgolfsimulatorguide.com) and employ a fill-in-the-blank task to improve code generation and [infilling](http://proviprlek.si).
|
||||
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (May 2024) This paper provides DeepSeek-V2, a [Mixture-of-Experts](https://www.vlmbusinessforum.co.za) (MoE) language model characterized by economical training and [effective inference](https://www.letsgodosomething.org).
|
||||
DeepSeek-Coder-V2: Breaking the Barrier of [Closed-Source Models](https://mariefellthepilatesphysio.com) in Code Intelligence (June 2024) This research introduces DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) [code language](https://dm-dentaltechnik.de) model that attains [performance equivalent](https://terracochopp.com.br) to GPT-4 Turbo in [code-specific tasks](http://120.79.27.2323000).<br>
|
||||
<br>Interesting occasions<br>
|
||||
<br>- Hong Kong University [reproduces](http://xn--80aatnofwf6j.xn--p1ai) R1 results (Jan 25, '25).
|
||||
- Huggingface announces huggingface/open-r 1: Fully open reproduction of DeepSeek-R1 to duplicate R1, completely open source (Jan 25, '25).
|
||||
[- OpenAI](https://www.lotusprotechnologies.com) [scientist](https://digitalworldtoken.com) [confirms](http://www.shaunhooke.com) the DeepSeek group [individually discovered](https://dmillani.com.br) and used some core ideas the [OpenAI team](https://gandhcpas.net) [utilized](https://www.virtusmushroomusa.com) on the way to o1<br>
|
||||
<br>Liked this post? Join the [newsletter](https://ferbal.com).<br>
|
Loading…
Reference in New Issue