<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Incredible.AI</title>
    <description>Artificial Intelligence and BigData</description>
    <link>http://incredible.ai/</link>
    <atom:link href="http://incredible.ai/feed.xml" rel="self" type="application/rss+xml" />
    <pubDate>Thu, 05 Mar 2026 15:05:40 +0000</pubDate>
    <lastBuildDate>Thu, 05 Mar 2026 15:05:40 +0000</lastBuildDate>
    <generator>Jekyll v3.10.0</generator>
    
      <item>
        <title>OpenClaw on Ubuntu + Docker Isolation</title>
        <description>&lt;h1 id=&quot;1-installation&quot;&gt;1. Installation&lt;/h1&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;git clone git@github.com:openclaw/openclaw.git 

&lt;span class=&quot;c&quot;&gt;# go to openclaw directory&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;./docker-setup.sh
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Change docker-compose.yml.&lt;br /&gt;
I added &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;extra_hosts : &quot;host.docker.internal:host-gateway&quot;&lt;/code&gt;.&lt;/p&gt;

&lt;div class=&quot;language-yaml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;na&quot;&gt;services&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;openclaw-gateway&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;image&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;${OPENCLAW_IMAGE:-openclaw:local}&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;extra_hosts&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
      &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;host.docker.internal:host-gateway&quot;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;install brew manually.&lt;br /&gt;
it seesm the docker container doesn’t have brew and during installation it fails if brew is not installed.&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nb&quot;&gt;mkdir&lt;/span&gt; ~/.linuxbrew &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; curl &lt;span class=&quot;nt&quot;&gt;-L&lt;/span&gt; https://github.com/Homebrew/brew/tarball/master | &lt;span class=&quot;nb&quot;&gt;tar &lt;/span&gt;xz &lt;span class=&quot;nt&quot;&gt;--strip&lt;/span&gt; 1 &lt;span class=&quot;nt&quot;&gt;-C&lt;/span&gt; ~/.linuxbrew

&lt;span class=&quot;c&quot;&gt;# set up environment variables &lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;echo&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;export PATH=&quot;$HOME/.linuxbrew/bin:$HOME/.linuxbrew/sbin:$PATH&quot;&apos;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&lt;/span&gt; ~/.bashrc
&lt;span class=&quot;nb&quot;&gt;echo&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;export MANPATH=&quot;$HOME/.linuxbrew/share/man:$MANPATH&quot;&apos;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&lt;/span&gt; ~/.bashrc
&lt;span class=&quot;nb&quot;&gt;echo&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;export INFOPATH=&quot;$HOME/.linuxbrew/share/info:$INFOPATH&quot;&apos;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&lt;/span&gt; ~/.bashrc

&lt;span class=&quot;c&quot;&gt;# 3. 현재 세션에 적용&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;source&lt;/span&gt; ~/.bashrc

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;model-provider&quot;&gt;Model Provider&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;vLLM&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;you can serve local LLM like this.&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;vllm serve openai/gpt-oss-20b &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--host&lt;/span&gt; 127.0.0.1 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--port&lt;/span&gt; 8045 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--tensor-parallel-size&lt;/span&gt; 1 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--gpu-memory-utilization&lt;/span&gt; 0.3 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--trust-remote-code&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--async-scheduling&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--max-num-batched-tokens&lt;/span&gt; 8192 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--max-model-len&lt;/span&gt; 35096 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--api-key&lt;/span&gt; 1234
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;connect to Docker terminal and then change the connection information like this.&lt;br /&gt;
no need to restart the openclaw container.&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;# change address to host.docker.internal&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;sed&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-i&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;s/127.0.0.1/host.docker.internal/&apos;&lt;/span&gt; ~/.openclaw/openclaw.json

&lt;span class=&quot;c&quot;&gt;# change vLLM API key (1234)&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;sed&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-i&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;s/&quot;VLLM_API_KEY&quot;/&quot;1234&quot;/&apos;&lt;/span&gt; ~/.openclaw/openclaw.json
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;setup-openclaw-gateway&quot;&gt;Setup OpenClaw Gateway&lt;/h3&gt;

&lt;p&gt;connect to Docker terminal and you can get a gateway token&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;echo&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$OPENCLAW_GATEWAY_TOKEN&lt;/span&gt;
05c5df07ef01&amp;lt;eradicated&amp;gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;ol&gt;
  &lt;li&gt;http://127.0.0.1:18789/overview&lt;/li&gt;
  &lt;li&gt;in Gateway Access, put the token in Gateway Token&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;token-issue&quot;&gt;Token Issue&lt;/h3&gt;

&lt;p&gt;If you happen to run &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;./docker-setup.sh&lt;/code&gt; multiple times, it generates a new random gateway token. &lt;br /&gt; 
this creates a mismatch between &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.env&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;~/.openclaw/openclaw.json&lt;/code&gt;&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;grep &lt;/span&gt;OPENCLAW_GATEWAY_TOKEN .env
&lt;span class=&quot;nv&quot;&gt;OPENCLAW_GATEWAY_TOKEN&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;05c5df0&amp;lt;eradicated&amp;gt;

&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;grep&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-o&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;&quot;token&quot;: &quot;[^&quot;]*&quot;&apos;&lt;/span&gt; ~/.openclaw/openclaw.json
&lt;span class=&quot;s2&quot;&gt;&quot;token&quot;&lt;/span&gt;: &lt;span class=&quot;s2&quot;&gt;&quot;05c5df0&amp;lt;eradicated&amp;gt;&quot;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;if both configurations are different, you need to update ~/.openclaw/openclaw.json file to match the .env token&lt;br /&gt;
and then &lt;strong&gt;restart docker&lt;/strong&gt;&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;docker restart openclaw-openclaw-gateway-1
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;slack&quot;&gt;Slack&lt;/h3&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;./openclaw.mjs pairing list &lt;span class=&quot;nt&quot;&gt;--channel&lt;/span&gt; slack
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
</description>
        <pubDate>Sun, 01 Feb 2026 01:00:00 +0000</pubDate>
        <link>http://incredible.ai/openclaw/2026/02/01/OpenClaw-On-Ubuntu/</link>
        <guid isPermaLink="true">http://incredible.ai/openclaw/2026/02/01/OpenClaw-On-Ubuntu/</guid>
        
        
        <category>openclaw</category>
        
      </item>
    
      <item>
        <title>Nvidia RTX 6000 Pro Blackwell Workstation Settings</title>
        <description>&lt;h1 id=&quot;1-install-pytorch&quot;&gt;1. Install Pytorch&lt;/h1&gt;

&lt;h2 id=&quot;11-pytorch-with-cuda-130&quot;&gt;1.1 Pytorch with CUDA 13.0&lt;/h2&gt;

&lt;p&gt;create &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;requirements.txt&lt;/code&gt;&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-txt&quot;&gt;--index-url https://pypi.org/simple
--extra-index-url https://download.pytorch.org/whl/cu130
nvidia-cudnn-cu13
nvidia-cublas
nvidia-cufft
nvidia-curand
nvidia-cusolver
nvidia-cusparse
nvidia-nccl-cu13
torch 
torchvision 
transformers
&lt;/code&gt;&lt;/pre&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;uv pip &lt;span class=&quot;nb&quot;&gt;install&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-r&lt;/span&gt; requirements.txt &lt;span class=&quot;nt&quot;&gt;--system&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;12-pytorch-with-cuda-128&quot;&gt;1.2 Pytorch with CUDA 12.8&lt;/h2&gt;

&lt;p&gt;create &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;requirements.txt&lt;/code&gt;&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-txt&quot;&gt;--index-url https://pypi.org/simple
--extra-index-url https://download.pytorch.org/whl/cu128
nvidia-cudnn-cu12
nvidia-cublas-cu12
nvidia-cufft-cu12
nvidia-curand-cu12
nvidia-cusolver-cu12
nvidia-cusparse-cu12
nvidia-nccl-cu12
torch 
torchvision 
transformers
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;또는&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-txt&quot;&gt;uv pip install nvidia-cudnn-cu12 \
    nvidia-cublas-cu12 \
    nvidia-cufft-cu12 \
    nvidia-curand-cu12 \
    nvidia-cusolver-cu12 \
    nvidia-cusparse-cu12 \
    nvidia-nccl-cu12 \
    torch \
    torchvision \
    transformers \
    --index-url https://pypi.org/simple \
    --extra-index-url https://download.pytorch.org/whl/cu128
&lt;/code&gt;&lt;/pre&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;uv pip &lt;span class=&quot;nb&quot;&gt;install&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-r&lt;/span&gt; requirements.txt &lt;span class=&quot;nt&quot;&gt;--system&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;13-setting-bashrc&quot;&gt;1.3 setting .bashrc&lt;/h2&gt;

&lt;p&gt;modify ~/.bashrc&lt;br /&gt;
NVIDIA_HOME is different, depending on your python version&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;# CUDA&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;NVIDIA_HOME&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$HOME&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;/.pyenv/versions/3.12.10/lib/python3.12/site-packages/nvidia&quot;&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# 2. Add necessary libraries into LD_LIBRARY_PATH에&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;export &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;LD_LIBRARY_PATH&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$LD_LIBRARY_PATH&lt;/span&gt;:&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;NVIDIA_HOME&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt;/cu13/lib
&lt;span class=&quot;nb&quot;&gt;export &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;LD_LIBRARY_PATH&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$LD_LIBRARY_PATH&lt;/span&gt;:&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;NVIDIA_HOME&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt;/cudnn/lib
&lt;span class=&quot;nb&quot;&gt;export &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;LD_LIBRARY_PATH&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$LD_LIBRARY_PATH&lt;/span&gt;:&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;NVIDIA_HOME&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt;/cublas/lib
&lt;span class=&quot;nb&quot;&gt;export &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;LD_LIBRARY_PATH&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$LD_LIBRARY_PATH&lt;/span&gt;:&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;NVIDIA_HOME&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt;/cufft/lib
&lt;span class=&quot;nb&quot;&gt;export &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;LD_LIBRARY_PATH&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$LD_LIBRARY_PATH&lt;/span&gt;:&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;NVIDIA_HOME&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt;/curand/lib
&lt;span class=&quot;nb&quot;&gt;export &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;LD_LIBRARY_PATH&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$LD_LIBRARY_PATH&lt;/span&gt;:&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;NVIDIA_HOME&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt;/cusolver/lib
&lt;span class=&quot;nb&quot;&gt;export &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;LD_LIBRARY_PATH&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$LD_LIBRARY_PATH&lt;/span&gt;:&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;NVIDIA_HOME&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt;/cusparse/lib
&lt;span class=&quot;nb&quot;&gt;export &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;LD_LIBRARY_PATH&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$LD_LIBRARY_PATH&lt;/span&gt;:&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;NVIDIA_HOME&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt;/nccl/lib
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;14-test&quot;&gt;1.4 Test&lt;/h2&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;python &lt;span class=&quot;nt&quot;&gt;-c&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;import torch; print(torch.__version__)&quot;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;time&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;os&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Suppress TF logs for cleaner output
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;os&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;environ&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;TF_CPP_MIN_LOG_LEVEL&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;3&apos;&lt;/span&gt; 

&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;test_pytorch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;():&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;--- PyTorch Check ---&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;torch&lt;/span&gt;
    
    &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;torch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cuda&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;is_available&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;():&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;device&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;torch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;device&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;cuda&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;gpu_name&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;torch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cuda&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;get_device_name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;vram&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;torch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cuda&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;get_device_properties&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;total_memory&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;mf&quot;&gt;1e9&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sa&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;✅ GPU Detected: &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;gpu_name&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; (&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;vram&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; GB VRAM)&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# Simple Compute Test
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;torch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;randn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;5000&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;5000&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;device&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;device&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;torch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;randn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;5000&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;5000&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;device&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;device&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;start&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;result&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;torch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;matmul&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;torch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cuda&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;synchronize&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# Wait for compute to finish
&lt;/span&gt;        &lt;span class=&quot;k&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sa&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;✅ Matrix Mul (5k x 5k) Time: &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;start&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;s&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;❌ PyTorch cannot see the GPU.&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;test_pytorch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h1 id=&quot;2-install-unsloth&quot;&gt;2. Install Unsloth&lt;/h1&gt;

&lt;ul&gt;
  &lt;li&gt;my conclusion, Unsloth doesn’t work with Tensorflow Nightly version.&lt;/li&gt;
  &lt;li&gt;I decided not to install tensorflow in system level (I will use Tensorflow in virtualenv)&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;pip &lt;span class=&quot;nb&quot;&gt;install &lt;/span&gt;unsloth
&lt;span class=&quot;c&quot;&gt;#$ pip install tf-keras transformers --no-deps&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# must install it again to upgrade torch and torchvision. &lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# as of this writing, torch==2.10.0, torchao==0.15.0, torchaudio==2.9.1, torchvision==0.25.0&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# these versions work well with unsloth==2026.1.4, unsloth_zoo==2026.1.4&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;pip &lt;span class=&quot;nb&quot;&gt;install&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--upgrade&lt;/span&gt; torch torchvision
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;테스트&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;torch&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;unsloth&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;FastLanguageModel&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;test_unsloth&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;():&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;# 설정
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;max_seq_length&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2048&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;dtype&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;None&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# None으로 설정 시 자동으로 bfloat16 (RTX 6000 지원) 감지
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;load_in_4bit&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# Unsloth의 핵심인 4bit QLoRA 로딩 테스트
&lt;/span&gt;    
    &lt;span class=&quot;k&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sa&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;🔹 GPU Check: &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;torch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cuda&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;get_device_name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sa&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;🔹 VRAM: &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;torch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cuda&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;get_device_properties&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;total_memory&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1024&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;**&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; GB&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

    &lt;span class=&quot;c1&quot;&gt;# 1. 모델 및 토크나이저 로드 테스트
&lt;/span&gt;    &lt;span class=&quot;k&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;[1/3] Loading Llama-3.2-1B model...&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;try&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tokenizer&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;FastLanguageModel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;from_pretrained&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;model_name&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;unsloth/Llama-3.2-1B-Instruct&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
            &lt;span class=&quot;n&quot;&gt;max_seq_length&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;max_seq_length&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;dtype&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dtype&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;load_in_4bit&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;load_in_4bit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;✅ Model loaded successfully.&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;except&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;Exception&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;e&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sa&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;❌ Model load failed: &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;e&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt;

    &lt;span class=&quot;c1&quot;&gt;# 2. Inference 테스트 (FastLanguageModel 최적화 동작 확인)
&lt;/span&gt;    &lt;span class=&quot;k&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;[2/3] Running Inference...&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;FastLanguageModel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;for_inference&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# Native 2x faster inference
&lt;/span&gt;    
    &lt;span class=&quot;n&quot;&gt;inputs&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tokenizer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;
            &lt;span class=&quot;s&quot;&gt;&quot;unsloth 라이브러리의 주요 장점은 무엇인가요? 짧게 요약해주세요.&quot;&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;return_tensors&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;pt&quot;&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;to&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;cuda&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

    &lt;span class=&quot;n&quot;&gt;outputs&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;generate&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;**&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;inputs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;max_new_tokens&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;64&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;use_cache&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;decoded_output&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tokenizer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;batch_decode&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;outputs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
    
    &lt;span class=&quot;c1&quot;&gt;# 결과 출력 (내용보다는 에러 없이 생성되었는지가 중요)
&lt;/span&gt;    &lt;span class=&quot;k&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;✅ Inference completed.&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    
    &lt;span class=&quot;c1&quot;&gt;# 3. LoRA 어댑터 부착 테스트 (학습 준비 상태 확인)
&lt;/span&gt;    &lt;span class=&quot;k&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;[3/3] Testing LoRA Adapter attachment...&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;try&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;model&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;FastLanguageModel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;get_peft_model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;r&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;16&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;target_modules&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;q_proj&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;k_proj&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;v_proj&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;o_proj&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;lora_alpha&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;16&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;lora_dropout&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;bias&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;none&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;use_gradient_checkpointing&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sa&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;✅ LoRA Adapters attached. Trainable parameters: &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;print_trainable_parameters&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;except&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;Exception&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;e&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sa&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;❌ LoRA attachment failed: &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;e&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;test_unsloth&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h1 id=&quot;3-install-tlr&quot;&gt;3. Install TLR&lt;/h1&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;torch&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;transformers&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;transformers&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AutoConfig&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AutoModelForCausalLM&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AutoTokenizer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;TrainingArguments&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;datasets&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Dataset&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;peft&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;LoraConfig&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;get_peft_model&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;trl&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;SFTTrainer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;DPOTrainer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;PPOConfig&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AutoModelForCausalLMWithValueHead&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;SFTConfig&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;DPOConfig&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;os&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;tempfile&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;warnings&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Suppress minor warnings for clean output
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;warnings&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;filterwarnings&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;ignore&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;


&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;print_status&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;message&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sa&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;[TRL TEST] &amp;gt;&amp;gt; &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;message&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;


&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;run_trl_health_check&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;():&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;print_status&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Starting TRL Health Check...&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

    &lt;span class=&quot;c1&quot;&gt;# 1. Environment &amp;amp; Device Check
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;device&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;cuda&quot;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;torch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cuda&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;is_available&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;cpu&quot;&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;print_status&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sa&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Device: &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;device&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

    &lt;span class=&quot;c1&quot;&gt;# 2. Setup Resources (Correct Vocab Size Match)
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;print_status&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Initializing dummy model and tokenizer...&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

    &lt;span class=&quot;c1&quot;&gt;# 먼저 토크나이저를 로드합니다.
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;tokenizer&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AutoTokenizer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;from_pretrained&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;gpt2&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;tokenizer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pad_token&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tokenizer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;eos_token&lt;/span&gt;

    &lt;span class=&quot;c1&quot;&gt;# 모델 설정: vocab_size를 토크나이저와 동일하게 맞춤 (매우 중요)
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;model_config&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AutoConfig&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;from_pretrained&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;gpt2&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;model_config&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;n_layer&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;model_config&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;n_head&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;model_config&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;n_embd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;32&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;model_config&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;vocab_size&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;len&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tokenizer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# &amp;lt;--- 여기가 수정되었습니다. (50257)
&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;model&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AutoModelForCausalLM&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;from_config&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;model_config&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;to&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;device&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

    &lt;span class=&quot;c1&quot;&gt;# 3. Test SFTTrainer (Supervised Fine-Tuning)
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;print_status&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Testing SFTTrainer (1 training step)...&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

    &lt;span class=&quot;n&quot;&gt;sft_data&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;s&quot;&gt;&quot;text&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;
                    &lt;span class=&quot;s&quot;&gt;&quot;User: Hello&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Assistant: Hi there!&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                    &lt;span class=&quot;s&quot;&gt;&quot;User: Code for me&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Assistant: Sure, here is python code.&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                    &lt;span class=&quot;s&quot;&gt;&quot;User: Bye&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Assistant: Goodbye!&quot;&lt;/span&gt;
                &lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;sft_dataset&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Dataset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;from_dict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sft_data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;with&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tempfile&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;TemporaryDirectory&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tmp_dir&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;sft_config&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;SFTConfig&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;output_dir&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tmp_dir&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;dataset_text_field&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;text&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;max_length&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;16&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;per_device_train_batch_size&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;max_steps&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;learning_rate&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;1e-4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;logging_steps&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;report_to&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;none&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;save_strategy&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;no&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

        &lt;span class=&quot;n&quot;&gt;sft_trainer&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;SFTTrainer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;model&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;train_dataset&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sft_dataset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;args&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sft_config&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;processing_class&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tokenizer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# it was &quot;tokenizer&quot; previously
&lt;/span&gt;        &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

        &lt;span class=&quot;n&quot;&gt;sft_trainer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;train&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;print_status&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;SFTTrainer executed successfully.&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

    &lt;span class=&quot;c1&quot;&gt;# 4. Test DPOTrainer (Direct Preference Optimization)
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;print_status&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Testing DPOTrainer initialization...&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

    &lt;span class=&quot;n&quot;&gt;dpo_data&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;s&quot;&gt;&quot;prompt&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Question 1&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Question 2&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;s&quot;&gt;&quot;chosen&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Good answer 1&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Good answer 2&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;s&quot;&gt;&quot;rejected&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Bad answer 1&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Bad answer 2&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;dpo_dataset&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Dataset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;from_dict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dpo_data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;with&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tempfile&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;TemporaryDirectory&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tmp_dir&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;dpo_config&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;DPOConfig&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;output_dir&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tmp_dir&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;per_device_train_batch_size&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;max_steps&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;report_to&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;none&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;learning_rate&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;1e-5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

        &lt;span class=&quot;c1&quot;&gt;# DPO용 새 모델 (동일 설정)
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;dpo_model&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AutoModelForCausalLM&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;from_config&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;model_config&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;to&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;device&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

        &lt;span class=&quot;n&quot;&gt;dpo_trainer&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;DPOTrainer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;model&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dpo_model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;ref_model&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;None&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;args&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dpo_config&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;train_dataset&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dpo_dataset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;processing_class&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tokenizer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# it was &quot;tokenizer&quot; previously
&lt;/span&gt;        &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

        &lt;span class=&quot;n&quot;&gt;dataloader&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dpo_trainer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;get_train_dataloader&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;batch&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;next&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;iter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dataloader&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;print_status&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;DPOTrainer initialized and data processed successfully.&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

    &lt;span class=&quot;c1&quot;&gt;# 5. Test PPO Integration (Value Head)
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;print_status&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Testing PPO ValueHead Model Wrapper...&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;try&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;c1&quot;&gt;# PPO 모델은 기본 모델 위에 Value Head를 얹는 구조
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;ppo_base_model&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AutoModelForCausalLM&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;from_config&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;model_config&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;to&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;device&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;ppo_model&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AutoModelForCausalLMWithValueHead&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;from_pretrained&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ppo_base_model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

        &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;hasattr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ppo_model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;v_head&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;print_status&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;AutoModelForCausalLMWithValueHead successfully attached &apos;v_head&apos;.&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
            &lt;span class=&quot;k&quot;&gt;raise&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;ValueError&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;v_head not found in PPO model.&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

        &lt;span class=&quot;n&quot;&gt;inputs&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tokenizer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Test input&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;return_tensors&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;pt&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;to&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;device&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;with&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;torch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;no_grad&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;():&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;output&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ppo_model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;**&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;inputs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;print_status&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;PPO Model forward pass successful.&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;except&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;Exception&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;e&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sa&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;PPO Test Failed: &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;e&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;c1&quot;&gt;# PPO is tricky with dummy models sometimes, but we proceed if minor error
&lt;/span&gt;        &lt;span class=&quot;k&quot;&gt;pass&lt;/span&gt;

    &lt;span class=&quot;c1&quot;&gt;# 6. Test PEFT Integration (LoRA)
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;print_status&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Testing PEFT (LoRA) Integration with TRL...&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

    &lt;span class=&quot;n&quot;&gt;peft_config&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;LoraConfig&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;r&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;lora_alpha&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;8&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;lora_dropout&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;0.05&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;bias&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;none&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;task_type&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;CAUSAL_LM&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

    &lt;span class=&quot;n&quot;&gt;base_model_peft&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AutoModelForCausalLM&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;from_config&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;model_config&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;to&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;device&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;peft_model&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;get_peft_model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;base_model_peft&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;peft_config&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

    &lt;span class=&quot;n&quot;&gt;print_status&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sa&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;PEFT Model created. Trainable params: &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;peft_model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;print_trainable_parameters&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

    &lt;span class=&quot;n&quot;&gt;print_status&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;---------------------------------------------------&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;print_status&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;SUCCESS: All TRL components checked thoroughly.&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;print_status&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;---------------------------------------------------&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;


&lt;span class=&quot;n&quot;&gt;run_trl_health_check&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h1 id=&quot;3-install-tensorflow&quot;&gt;3. Install Tensorflow&lt;/h1&gt;

&lt;ul&gt;
  &lt;li&gt;you need to install latest version of tensorflow “tf-nightly” with cuda option.&lt;/li&gt;
  &lt;li&gt;Also it doesn’t work with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Unsloth&lt;/code&gt; well.&lt;/li&gt;
  &lt;li&gt;I decided not to install Tensorflow nightly version on system&lt;/li&gt;
  &lt;li&gt;instead, I will use tensorflow in virtualenv&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;# Create tensorflow env&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;pyenv virtualenv tensroflow-nightly
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;pyenv activate tensroflow-nightly

&lt;span class=&quot;c&quot;&gt;# Install tensorflow nightly version with cuda support&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;pip &lt;span class=&quot;nb&quot;&gt;install&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;tf-nightly[and-cuda]&quot;&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;#&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# 테스트&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;python &lt;span class=&quot;nt&quot;&gt;-c&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;import tensorflow as tf; print(tf.config.list_physical_devices(&apos;GPU&apos;))&quot;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;modify ~/.bashrc&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;# CUDA for Tensorflow&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;NVIDIA_HOME&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$HOME&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;/.pyenv/versions/3.12.10/lib/python3.12/site-packages/nvidia&quot;&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# 2. 필요한 라이브러리 경로들을 LD_LIBRARY_PATH에 추가&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;export &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;LD_LIBRARY_PATH&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$LD_LIBRARY_PATH&lt;/span&gt;:&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;NVIDIA_HOME&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt;/cudnn/lib
&lt;span class=&quot;nb&quot;&gt;export &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;LD_LIBRARY_PATH&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$LD_LIBRARY_PATH&lt;/span&gt;:&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;NVIDIA_HOME&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt;/cublas/lib
&lt;span class=&quot;nb&quot;&gt;export &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;LD_LIBRARY_PATH&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$LD_LIBRARY_PATH&lt;/span&gt;:&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;NVIDIA_HOME&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt;/cufft/lib
&lt;span class=&quot;nb&quot;&gt;export &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;LD_LIBRARY_PATH&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$LD_LIBRARY_PATH&lt;/span&gt;:&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;NVIDIA_HOME&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt;/curand/lib
&lt;span class=&quot;nb&quot;&gt;export &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;LD_LIBRARY_PATH&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$LD_LIBRARY_PATH&lt;/span&gt;:&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;NVIDIA_HOME&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt;/cusolver/lib
&lt;span class=&quot;nb&quot;&gt;export &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;LD_LIBRARY_PATH&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$LD_LIBRARY_PATH&lt;/span&gt;:&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;NVIDIA_HOME&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt;/cusparse/lib
&lt;span class=&quot;nb&quot;&gt;export &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;LD_LIBRARY_PATH&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$LD_LIBRARY_PATH&lt;/span&gt;:&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;NVIDIA_HOME&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt;/nccl/lib
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;here’s a bit more complex test.&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;time&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;os&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;test_tensorflow&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;():&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;--- TensorFlow Check ---&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;tensorflow&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tf&lt;/span&gt;
    
    &lt;span class=&quot;n&quot;&gt;gpus&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;config&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;list_physical_devices&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;GPU&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;gpus&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sa&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;✅ GPU Detected: &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;len&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;gpus&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; device(s)&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;gpu&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;gpus&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
            &lt;span class=&quot;k&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sa&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;   - &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;gpu&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;device_type&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;: &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;gpu&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
            
        &lt;span class=&quot;c1&quot;&gt;# Simple Compute Test
&lt;/span&gt;        &lt;span class=&quot;k&quot;&gt;with&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;device&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;/GPU:0&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;a&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;random&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;normal&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;5000&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;5000&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;b&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;random&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;normal&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;5000&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;5000&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;start&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;c&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;matmul&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;a&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;_&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;numpy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# Force execution
&lt;/span&gt;            &lt;span class=&quot;k&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sa&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;✅ Matrix Mul (5k x 5k) Time: &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;start&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;s&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;❌ TensorFlow cannot see the GPU.&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;


&lt;span class=&quot;n&quot;&gt;test_tensorflow&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h1 id=&quot;4-install-vllm&quot;&gt;4. Install vLLM&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;WARNING! here we can’t install both torch and vllm at the same time!&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;when you need to run, you need to run in virtualenv. 
if you install both torch and vllm, vllm downgrade your torch -&amp;gt; the downgraded torch will not work on RTX 6000 PRO.&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;# Run in virtualenv&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;pyenv virtualenv 3.12.12 vllm
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;pyenv activate vllm
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;pip &lt;span class=&quot;nb&quot;&gt;install &lt;/span&gt;vllm 

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;vllm serve openai/gpt-oss-20b &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--host&lt;/span&gt; 127.0.0.1 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--port&lt;/span&gt; 8082 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--tensor-parallel-size&lt;/span&gt; 1 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--gpu-memory-utilization&lt;/span&gt; 0.3 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--trust-remote-code&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--async-scheduling&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--max-num-batched-tokens&lt;/span&gt; 8192 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--max-model-len&lt;/span&gt; 35096
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h1 id=&quot;5-stable-diffusion-webui&quot;&gt;5. Stable Diffusion WebUI&lt;/h1&gt;

&lt;ul&gt;
  &lt;li&gt;python: 3.10.17&lt;/li&gt;
  &lt;li&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;export &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;STABLE_DIFFUSION_REPO&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;https://github.com/joypaul162/Stability-AI-stablediffusion.git

&lt;span class=&quot;c&quot;&gt;# Install specific setuptools&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;source&lt;/span&gt; ./venv/bin/activate
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;pip &lt;span class=&quot;nb&quot;&gt;install &lt;/span&gt;https://github.com/openai/CLIP/archive/d50d76daa670286dd6cacf3bcd80b5e4823fc8e1.zip &lt;span class=&quot;nt&quot;&gt;--no-build-isolation&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;pip &lt;span class=&quot;nb&quot;&gt;install &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;setuptools&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;==&lt;/span&gt;69.5.1 wheel
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;deactivate 

&lt;span class=&quot;c&quot;&gt;# 설치&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;./webui.sh
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;disk error issue (this is just a personal issue. just skip it)&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;lsblk &lt;span class=&quot;nt&quot;&gt;-f&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;umount &lt;span class=&quot;nt&quot;&gt;-l&lt;/span&gt; /dev/nvme1n1p2
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;ntfsfix &lt;span class=&quot;nt&quot;&gt;-d&lt;/span&gt; /dev/nvme1n1p2
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;mount /dev/nvme1n1p2 /media/anderson/HynixP41
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h1 id=&quot;6-continue-on-pycharm&quot;&gt;6. CONTINUE on Pycharm&lt;/h1&gt;

&lt;p&gt;CONTINUE is a plugin for llm in Pycharm.&lt;/p&gt;

&lt;p&gt;config.yaml&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-aiexclude&quot;&gt;name: Local Config
version: 1.0.0
schema: v1
models:
  - name: Llama 3.1 8B
    provider: ollama
    model: llama3.1:8b
    roles:
      - chat
      - edit
      - apply
  - name: Qwen2.5-Coder 1.5B
    provider: ollama
    model: qwen2.5-coder:1.5b-base
    roles:
      - autocomplete
  - name: Nomic Embed
    provider: ollama
    model: nomic-embed-text:latest
    roles:
      - embed
  - name: Qwen3-Coder-30B (Local)
    provider: openai
    model: Qwen/Qwen3-Coder-30B-A3B-Instruct
    apiBase: http://localhost:8045/v1
#    apiKey: my-secret-key
    roles:
      - chat
      - edit
      - apply
      - autocomplete
  - name: Nomic Embed
    provider: ollama
    model: nomic-embed-text:latest
    roles:
      - embed
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;you can create vllm&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;python3 &lt;span class=&quot;nt&quot;&gt;-m&lt;/span&gt; vllm.entrypoints.openai.api_server &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--model&lt;/span&gt; Qwen/Qwen3-Coder-30B-A3B-Instruct &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--dtype&lt;/span&gt; auto &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--tensor-parallel-size&lt;/span&gt; 1 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--gpu-memory-utilization&lt;/span&gt; 0.95 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--trust-remote-code&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--max-model-len&lt;/span&gt; 50000 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--port&lt;/span&gt; 8045
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
</description>
        <pubDate>Thu, 29 Jan 2026 01:00:00 +0000</pubDate>
        <link>http://incredible.ai/format/2026/01/29/Nvidia-RTX-6000-Pro-Blackwell-Setting/</link>
        <guid isPermaLink="true">http://incredible.ai/format/2026/01/29/Nvidia-RTX-6000-Pro-Blackwell-Setting/</guid>
        
        <category>pytorch</category>
        
        <category>tensorflow</category>
        
        <category>cuda</category>
        
        <category>continue</category>
        
        
        <category>format</category>
        
      </item>
    
      <item>
        <title>Precision and Kernel Selection - Only Hopper swizzling is supported for values</title>
        <description>&lt;h1 id=&quot;1-only-hopper-swizzling-is-supported-for-values&quot;&gt;1. Only Hopper swizzling is supported for values&lt;/h1&gt;

&lt;p&gt;Recently I got this error, while running Unsloth with GPT-OSS-120B on Nvidia 6000 Pro blackwell. (workstation).&lt;br /&gt;
This is due to using incorrect dtype.&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;unsloth&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;FastLanguageModel&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tokenizer&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;FastLanguageModel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;from_pretrained&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;model_name&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;unsloth/gpt-oss-20b-BF16&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;dtype&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;None&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# &quot;bfloat16&quot;, &amp;lt;- it works for Nvidia RTX 6000 PRO Blackwell.
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;max_seq_length&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2000&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;load_in_4bit&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;False&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;full_finetuning&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;False&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;low_cpu_mem_usage&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;device_map&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;cuda&quot;&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# Explicitly load to CUDA
&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;When I run the inference code, it raises an error, saying “only Hopper swizzling is supported.”
This is because the runtime entered &lt;strong&gt;Hopper-only kernel path&lt;/strong&gt; &lt;br /&gt; 
but my GPU (Nvidia RTX 6000 PRO Blackwell) does not support Hopper Kernel.&lt;br /&gt;
in short, the safest workaround is to avoid “MXFP4” and use “BF16”.&lt;/p&gt;

&lt;h1 id=&quot;2-kernel-selection-flow&quot;&gt;2. Kernel Selection Flow&lt;/h1&gt;

&lt;p&gt;This is a simplified internal routing view.&lt;br /&gt;
The actual checks are more granular and framework-specific.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Kernel Selection
├─ Op: GEMM / Linear / Matmul
│  ├─ Check device capability (SM, architecture)
│  │  ├─ Hopper? -&amp;gt; allow TMA/Swizzle paths
│  │  ├─ Blackwell? -&amp;gt; allow FP4/NVFP4 paths
│  │  └─ Other -&amp;gt; generic Tensor Core or CUDA paths
│  ├─ Check precision / format
│  │  ├─ FP32/TF32 -&amp;gt; cuBLAS / TF32-enabled GEMM
│  │  ├─ BF16/FP16 -&amp;gt; Tensor Core GEMM
│  │  ├─ FP8 -&amp;gt; FP8 kernels (often Hopper-optimized)
│  │  ├─ FP4/NVFP4 -&amp;gt; FP4 kernels (often Blackwell-optimized)
│  │  └─ MXFP4 -&amp;gt; MXFP4 kernels (specialized, high constraints)
│  ├─ Check library availability
│  │  ├─ Triton kernel exists? -&amp;gt; Triton path
│  │  ├─ CUTLASS/TensorRT path? -&amp;gt; vendor path
│  │  └─ Fallback -&amp;gt; cuBLAS / default GEMM
│  └─ Check runtime flags
│     ├─ load_in_4bit / quant config -&amp;gt; quantized kernel
│     ├─ use_cache -&amp;gt; cache-aware kernel
│     └─ debug/disable flags -&amp;gt; safe fallback
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Practical takeaway: &lt;strong&gt;selection is multi-stage&lt;/strong&gt;.&lt;br /&gt;
If any stage assumes unsupported hardware, compilation can fail early.&lt;/p&gt;

&lt;h1 id=&quot;3-precision&quot;&gt;3. Precision&lt;/h1&gt;

&lt;p&gt;Depending on the precision (BF16/FP16/FP8/MXFP4/etc…), it selects different kernels.&lt;br /&gt;
Here I summarized the precisions and kernels.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Precision/Format
├─ FP32
│  └─ GEMM (CUDA/cuBLAS default, dtype=&quot;float32&quot;)
├─ TF32
│  └─ GEMM (Tensor Core path, dtype=&quot;float32&quot; + TF32 enabled)
├─ FP16
│  └─ GEMM (Tensor Core path, dtype=&quot;float16&quot;)
├─ BF16
│  └─ GEMM (Tensor Core path, dtype=&quot;bfloat16&quot;, stable default)
├─ FP8 (E4M3/E5M2)
│  ├─ GEMM (FP8 kernels, dtype=&quot;float8_e4m3fn&quot;/&quot;float8_e5m2&quot;, Hopper-optimized)
│  └─ Swizzle/TMA (Hopper-only optimization)
├─ FP4 (NVFP4)
│  ├─ GEMM (FP4 kernels, NVFP4 format, dtype=&quot;float4&quot;/&quot;nvfp4&quot;, Blackwell-centered)
│  └─ Micro-tensor scaling (Blackwell-only flavor)
├─ MXFP4
│  ├─ GEMM (MXFP4 kernels, MXFP4 format)
│  └─ Swizzle/TMA (often assumes Hopper)
└─ INT8/INT4/NF4
   ├─ GEMM (int kernels, dtype=&quot;int8&quot;/&quot;int4&quot;, includes NF4)
   └─ Dequant (scale restore path)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
</description>
        <pubDate>Sun, 04 Jan 2026 01:00:00 +0000</pubDate>
        <link>http://incredible.ai/format/2026/01/04/Precision/</link>
        <guid isPermaLink="true">http://incredible.ai/format/2026/01/04/Precision/</guid>
        
        <category>unsloth</category>
        
        <category>triton</category>
        
        <category>dtype</category>
        
        
        <category>format</category>
        
      </item>
    
      <item>
        <title>Unsloth - Lora Fine-Tuning Hyperparameters</title>
        <description>&lt;h1 id=&quot;1-lora-핵심-개념&quot;&gt;1. LoRA 핵심 개념&lt;/h1&gt;

&lt;h2 id=&quot;11-문제-full-fine-tuning&quot;&gt;1.1 문제: Full Fine-Tuning&lt;/h2&gt;

&lt;p&gt;Full Fine-Tuning은 모든 파라미터 \(W \in \mathbb{R}^{d \times k}\)를 업데이트한다.&lt;/p&gt;

\[W&apos; = W_0 + \Delta W\]

&lt;p&gt;&lt;strong&gt;문제점&lt;/strong&gt;: 70B 모델 기준 \(\Delta W\)만 해도 140GB+ 메모리 필요.&lt;/p&gt;

&lt;h2 id=&quot;12-lora의-핵심&quot;&gt;1.2 LoRA의 핵심&lt;/h2&gt;

&lt;blockquote&gt;
  &lt;p&gt;Fine-tuning시 weight 변화량 \(\Delta W\)는 &lt;strong&gt;Low-Rank&lt;/strong&gt; 구조를 가진다. &lt;br /&gt;
&lt;strong&gt;Low-Rank = 압축 가능하다&lt;/strong&gt;는 뜻&lt;br /&gt;
JPEG 압축처럼 원본 10MB → 500KB로 줄여도 품질이 비슷한 이유는, &lt;br /&gt; 
정보에 &lt;strong&gt;중복과 패턴&lt;/strong&gt;이 있기 때문.&lt;/p&gt;

  &lt;p&gt;LLM Fine-tuning도 마찬가지. &lt;br /&gt;
연구 결과, weight 변화량이 엄청 복잡하게 변하는 게 아니라 &lt;br /&gt; 
&lt;strong&gt;몇 개의 주요 방향으로만 변한다&lt;/strong&gt;는 것을 발견 
이미 언어를 잘 아는 LLM에게 “의료 용어 좀 더 잘 알아듣게” 같은 미세 조정만 하면 되기 때문&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;즉, \(\Delta W\)를 두 개의 작은 행렬로 분해 가능:&lt;/p&gt;

\[\Delta W = BA\]

\[\begin{align}
B &amp;amp;\in \mathbb{R}^{d \times r} \\
A &amp;amp;\in \mathbb{R}^{r \times k} \\
r &amp;amp;\ll \min(d, k)
\end{align}\]

&lt;h2 id=&quot;13-파라미터-비교&quot;&gt;1.3 파라미터 비교&lt;/h2&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: center&quot;&gt; &lt;/th&gt;
      &lt;th style=&quot;text-align: center&quot;&gt;Full Fine-Tuning&lt;/th&gt;
      &lt;th style=&quot;text-align: center&quot;&gt;LoRA (r=8)&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;파라미터 수&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;\(d \times k\)&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;\(d \times r + r \times k\)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;예시 (4096×4096)&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;16.7M&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;65K&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;&lt;strong&gt;압축률&lt;/strong&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;1x&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;&lt;strong&gt;~256x&lt;/strong&gt;&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;2-수학적-구조&quot;&gt;2. 수학적 구조&lt;/h1&gt;

&lt;h2 id=&quot;21-forward-pass&quot;&gt;2.1 Forward Pass&lt;/h2&gt;

\[h = W_0 x + \frac{\alpha}{r} \cdot BAx\]

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;
Input x
    │
    ├─────────────────┐
    ▼                 ▼
┌───────┐         ┌───────┐
│  W₀   │         │   A   │ (r × k)
│frozen │         └───┬───┘
└───┬───┘             ▼
    │             ┌───────┐
    │             │   B   │ (d × r)
    │             └───┬───┘
    │                 │ × (α/r)
    ▼                 ▼
    └────────(+)──────┘
              │
              ▼
           Output h

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;matrix-크기-예제&quot;&gt;Matrix 크기 예제&lt;/h3&gt;

&lt;p&gt;구체적인 숫자로 이해해보자. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;d=64, k=32, r=4&lt;/code&gt;로 설정한 경우:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: center&quot;&gt;행렬&lt;/th&gt;
      &lt;th style=&quot;text-align: center&quot;&gt;크기&lt;/th&gt;
      &lt;th style=&quot;text-align: center&quot;&gt;파라미터 수&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;설명&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;\(W_0\)&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;64 × 32&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;2,048&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;원본 weight (frozen)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;\(A\)&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;4 × 32&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;128&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;LoRA down-projection&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;\(B\)&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;64 × 4&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;256&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;LoRA up-projection&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;\(\Delta W = BA\)&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;64 × 32&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;-&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;\(W_0\)와 같은 크기로 복원&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;&lt;strong&gt;LoRA 파라미터&lt;/strong&gt;: \(A + B = 128 + 256 = 384\) (Full의 &lt;strong&gt;18.75%&lt;/strong&gt;)&lt;/p&gt;

&lt;h2 id=&quot;22-초기화&quot;&gt;2.2 초기화&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;A&lt;/strong&gt;: Kaiming/Gaussian 초기화&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;B&lt;/strong&gt;: &lt;strong&gt;Zero 초기화&lt;/strong&gt; → 학습 시작 시 \(\Delta W = BA = 0\)&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;nn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;init&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;kaiming_uniform_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lora_A&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;a&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;math&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sqrt&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;nn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;init&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;zeros_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lora_B&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# 핵심: 0으로 시작
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;23-scaling-factor-α&quot;&gt;2.3 Scaling Factor α&lt;/h2&gt;

\[\text{scaling} = \frac{\alpha}{r}\]

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: center&quot;&gt;r&lt;/th&gt;
      &lt;th style=&quot;text-align: center&quot;&gt;α&lt;/th&gt;
      &lt;th style=&quot;text-align: center&quot;&gt;α/r&lt;/th&gt;
      &lt;th style=&quot;text-align: center&quot;&gt;효과&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;8&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;8&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;1.0&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;기본&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;8&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;16&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;2.0&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;LoRA 효과 2배&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;16&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;16&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;1.0&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;rank↑, 스케일 유지&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;3-적용-위치-target-modules&quot;&gt;3. 적용 위치 (Target Modules)&lt;/h1&gt;

&lt;p&gt;Transformer에서 LoRA 적용 가능한 위치:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;[Attention]           [MLP (SwiGLU)]
├── q_proj  ✓         ├── gate_proj  ✓
├── k_proj  ✓         ├── up_proj    ✓
├── v_proj  ✓         └── down_proj  ✓
└── o_proj  ✓
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;권장&lt;/strong&gt;: 전부 적용 (Unsloth 기본값)&lt;/p&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;4-hyperparameters-정리&quot;&gt;4. Hyperparameters 정리&lt;/h1&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Parameter&lt;/th&gt;
      &lt;th style=&quot;text-align: center&quot;&gt;권장값&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;설명&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;r&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;16~64&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;rank. 높을수록 표현력↑, 메모리↑&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;lora_alpha&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;r과 동일&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;scaling = α/r&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;lora_dropout&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;0&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;필요시 0.05&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;target_modules&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;all&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Attention + MLP 모두&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;5-qlora&quot;&gt;5. QLoRA&lt;/h1&gt;

&lt;p&gt;Base model을 &lt;strong&gt;4-bit 양자화&lt;/strong&gt; + LoRA:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Base: 4-bit (frozen)&lt;/li&gt;
  &lt;li&gt;LoRA adapters: 16-bit (trainable)&lt;/li&gt;
&lt;/ul&gt;

\[\text{메모리}: 140\text{GB} \rightarrow \sim24\text{GB}\]

&lt;hr /&gt;

&lt;h1 id=&quot;6-unsloth-code&quot;&gt;6. Unsloth Code&lt;/h1&gt;

&lt;p&gt;실제 Text-to-SQL 파인튜닝을 위한 코드 구현입니다.&lt;br /&gt; 
앞서 다룬 LoRA Hyperparameters가 코드에 어떻게 적용되는지 확인합니다.&lt;/p&gt;

&lt;h2 id=&quot;61-model--lora-configuration&quot;&gt;6.1 Model &amp;amp; LoRA Configuration&lt;/h2&gt;

&lt;p&gt;Scaling Factor($\frac{\alpha}{r}$)를 설정하는 단계입니다.&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# Initialize Model with LoRA settings
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;model&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;GptOssModel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;model_config&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ModelConfig&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;model_name&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;unsloth/gpt-oss-20b&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;max_seq_length&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;4096&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;lora_config&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;LoRAConfig&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;r&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;16&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;           &lt;span class=&quot;c1&quot;&gt;# Rank (r): SQL 로직 학습을 위해 8보다 높은 16 설정
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;lora_alpha&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;32&lt;/span&gt;   &lt;span class=&quot;c1&quot;&gt;# Alpha (α): Scaling factor = 32/16 = 2.0
&lt;/span&gt;    &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;ul&gt;
  &lt;li&gt;r=16: SQL 쿼리 생성과 같은 복잡한 논리 구조를 학습하기 위해 기본값(8)보다 Rank를 높여 표현력(Expressiveness)을 확보&lt;/li&gt;
  &lt;li&gt;lora_alpha=32: $\Delta W$의 영향력을 2배로 설정하여($\text{scaling}=2.0$), 새로운 데이터셋(SQL)의 특징을 더 강하게 반영&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;62-trainer-configuration-sft&quot;&gt;6.2 Trainer Configuration (SFT)&lt;/h2&gt;

&lt;p&gt;Unsloth의 장점인 메모리 효율성을 극대화하기 위한 SFTTrainer 설정입니다.&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;trainer&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;SFTTrainer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;model&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;processing_class&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tokenizer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;train_dataset&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;train_dataset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;args&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SFTConfig&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
        &lt;span class=&quot;c1&quot;&gt;# Memory Optimization
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;per_device_train_batch_size&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;8&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;gradient_accumulation_steps&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;32&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# Effective Batch Size = 8 * 32 = 256
&lt;/span&gt;        
        &lt;span class=&quot;c1&quot;&gt;# Optimizer &amp;amp; Precision
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;optim&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;adamw_8bit&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;              &lt;span class=&quot;c1&quot;&gt;# Optimizer State 메모리 절약
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;fp16&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;False&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;bf16&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;                       &lt;span class=&quot;c1&quot;&gt;# Ampere(RTX 30/40/6000) 이상에서 필수
&lt;/span&gt;        
        &lt;span class=&quot;c1&quot;&gt;# Learning Rate Schedule
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;learning_rate&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;2e-4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;warmup_steps&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;max_steps&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;30&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;output_dir&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;outputs&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;ul&gt;
  &lt;li&gt;gradient_accumulation_steps=32:
    &lt;ul&gt;
      &lt;li&gt;물리적 메모리 한계로 Batch Size를 작게(8) 가져가는 대신, 32번의 step 동안 gradient를 누적해 업데이트&lt;/li&gt;
      &lt;li&gt;결과적으로 대용량 배치(256)로 학습하는 것과 유사한 수렴 안정성을 확보&lt;/li&gt;
      &lt;li&gt;per_device_train_batch_size=8 이게 실제 batch size&lt;/li&gt;
      &lt;li&gt;수식: \(\text{Effective Batch Size} = \text{Micro Batch Size} \times \text{Accumulation Steps} \times \text{Num GPUs}\)
        &lt;ul&gt;
          &lt;li&gt;
\[\text{Total Batch Size} = 8 \times 32 \times 1 = \mathbf{256}\]
          &lt;/li&gt;
        &lt;/ul&gt;
      &lt;/li&gt;
      &lt;li&gt;메모리는 배치 8만큼만 쓰면서, 학습 효과는 배치 256인 것처럼 낼 수 있음&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;optim=”adamw_8bit”:
    &lt;ul&gt;
      &lt;li&gt;일반 AdamW(32-bit) 대비 Optimizer state가 차지하는 VRAM을 1/4 수준으로 줄여 OOM(Out of Memory)을 방지&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;bf16=True:
    &lt;ul&gt;
      &lt;li&gt;FP16보다 표현 가능한 수의 범위(Dynamic Range)가 넓어 학습 중 발산(NaN)할 확률이 낮습니다.&lt;/li&gt;
      &lt;li&gt;RTX 6000 Pro 환경에 최적화&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;63-training--saving&quot;&gt;6.3 Training &amp;amp; Saving&lt;/h2&gt;

&lt;p&gt;전체 파라미터($W$)가 아닌 LoRA Adapter($A, B$)만 학습을 진행합니다.&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# Start Training (Updates only A and B matrices)
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;trainer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;train&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Save LoRA Adapters
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;save&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;./text2sql_lora_model&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;ul&gt;
  &lt;li&gt;학습이 끝나면 원본 모델(GB 단위)은 그대로 두고, 학습된 &lt;strong&gt;LoRA weight(MB 단위)&lt;/strong&gt;만 저장합니다.&lt;/li&gt;
  &lt;li&gt;추론 시에는 원본 모델에 이 Adapter를 동적으로 로드하여 사용하게 됩니다.&lt;/li&gt;
&lt;/ul&gt;
</description>
        <pubDate>Sat, 03 Jan 2026 01:00:00 +0000</pubDate>
        <link>http://incredible.ai/unsloth/2026/01/03/Lora_Finetuning/</link>
        <guid isPermaLink="true">http://incredible.ai/unsloth/2026/01/03/Lora_Finetuning/</guid>
        
        <category>lora</category>
        
        <category>fine-tuning</category>
        
        <category>unsloth</category>
        
        <category>llm</category>
        
        
        <category>unsloth</category>
        
      </item>
    
      <item>
        <title>Unsloth - Model Visualization</title>
        <description>&lt;h1 id=&quot;model-visualization&quot;&gt;Model Visualization&lt;/h1&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;transformers&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;BatchEncoding&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;TextStreamer&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;unsloth&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;FastLanguageModel&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tokenizer&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;FastLanguageModel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;from_pretrained&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;model_name&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;unsloth/gpt-oss-20b&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;dtype&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;None&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# torch.bfloat16,  # None for auto detection
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;max_seq_length&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1000&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;load_in_4bit&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;False&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;full_finetuning&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;False&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;low_cpu_mem_usage&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;device_map&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;cuda&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# Explicitly load to CUDA
&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;VIsualization Codes&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;torch&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;rich.tree&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Tree&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;rich&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;print&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rprint&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;visualize_model_structure&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;# 1. Create root node
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;tree&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Tree&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sa&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;🏗️ [bold blue]Model: &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;getattr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;config&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;_name_or_path&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;Unknown&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;[/bold blue]&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    
    &lt;span class=&quot;c1&quot;&gt;# Dictionary to keep track of created nodes: {path_string: rich_tree_node}
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;node_lookup&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tree&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;module&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;named_modules&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;():&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;name&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;continue&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# Split path: &apos;model.layers.0.self_attn&apos; -&amp;gt; [&apos;model&apos;, &apos;layers&apos;, &apos;0&apos;, &apos;self_attn&apos;]
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;parts&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;split&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;.&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;parent_path&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;.&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;join&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;parts&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[:&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;current_part&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;parts&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;

        &lt;span class=&quot;c1&quot;&gt;# Calculate Size Info
&lt;/span&gt;        &lt;span class=&quot;c1&quot;&gt;# Get parameter count for this specific module
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;params_count&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;sum&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;p&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;numel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;p&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;module&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;parameters&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;recurse&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;False&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# Get shape if it&apos;s a leaf layer (like Linear or Embedding)
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;shape_info&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;&quot;&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;hasattr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;module&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;weight&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;and&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;isinstance&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;module&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;weight&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;torch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Tensor&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;shape_info&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;sa&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot; [yellow](&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;list&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;module&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;weight&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;shape&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;)[/yellow]&quot;&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;elif&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;params_count&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;shape_info&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;sa&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot; [green](&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;params_count&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; params)[/green]&quot;&lt;/span&gt;

        &lt;span class=&quot;c1&quot;&gt;# 2. Find or Create Node
&lt;/span&gt;        &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;parent_path&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;node_lookup&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;parent_node&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;node_lookup&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;parent_path&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
            &lt;span class=&quot;c1&quot;&gt;# Add new node with style and size info
&lt;/span&gt;            &lt;span class=&quot;n&quot;&gt;new_node&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;parent_node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;add&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sa&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;[bold magenta]&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;current_part&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;[/bold magenta]&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;shape_info&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;node_lookup&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;new_node&lt;/span&gt;

    &lt;span class=&quot;n&quot;&gt;rprint&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tree&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Execution
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;visualize_model_structure&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;here’s the result&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;🏗️ Model: unsloth/gpt-oss-20b
├── model
│   ├── embed_tokens &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;201088, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   ├── layers
│   │   ├── 0
│   │   │   ├── self_attn &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;64 params&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── q_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;4096, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── k_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;512, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── v_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;512, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   └── o_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880, 4096]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   ├── mlp
│   │   │   │   ├── router &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;32, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   └── experts &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;796,538,880 params&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   ├── input_layernorm &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   └── post_attention_layernorm &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   ├── 1
│   │   │   ├── self_attn &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;64 params&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── q_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;4096, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── k_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;512, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── v_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;512, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   └── o_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880, 4096]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   ├── mlp
│   │   │   │   ├── router &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;32, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   └── experts &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;796,538,880 params&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   ├── input_layernorm &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   └── post_attention_layernorm &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   ├── 2
│   │   │   ├── self_attn &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;64 params&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── q_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;4096, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── k_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;512, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── v_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;512, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   └── o_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880, 4096]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   ├── mlp
│   │   │   │   ├── router &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;32, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   └── experts &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;796,538,880 params&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   ├── input_layernorm &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   └── post_attention_layernorm &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   ├── 3
│   │   │   ├── self_attn &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;64 params&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── q_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;4096, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── k_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;512, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── v_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;512, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   └── o_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880, 4096]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   ├── mlp
│   │   │   │   ├── router &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;32, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   └── experts &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;796,538,880 params&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   ├── input_layernorm &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   └── post_attention_layernorm &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   ├── 4
│   │   │   ├── self_attn &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;64 params&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── q_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;4096, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── k_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;512, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── v_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;512, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   └── o_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880, 4096]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   ├── mlp
│   │   │   │   ├── router &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;32, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   └── experts &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;796,538,880 params&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   ├── input_layernorm &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   └── post_attention_layernorm &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   ├── 5
│   │   │   ├── self_attn &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;64 params&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── q_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;4096, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── k_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;512, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── v_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;512, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   └── o_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880, 4096]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   ├── mlp
│   │   │   │   ├── router &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;32, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   └── experts &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;796,538,880 params&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   ├── input_layernorm &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   └── post_attention_layernorm &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   ├── 6
│   │   │   ├── self_attn &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;64 params&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── q_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;4096, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── k_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;512, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── v_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;512, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   └── o_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880, 4096]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   ├── mlp
│   │   │   │   ├── router &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;32, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   └── experts &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;796,538,880 params&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   ├── input_layernorm &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   └── post_attention_layernorm &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   ├── 7
│   │   │   ├── self_attn &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;64 params&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── q_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;4096, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── k_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;512, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── v_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;512, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   └── o_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880, 4096]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   ├── mlp
│   │   │   │   ├── router &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;32, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   └── experts &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;796,538,880 params&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   ├── input_layernorm &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   └── post_attention_layernorm &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   ├── 8
│   │   │   ├── self_attn &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;64 params&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── q_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;4096, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── k_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;512, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── v_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;512, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   └── o_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880, 4096]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   ├── mlp
│   │   │   │   ├── router &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;32, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   └── experts &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;796,538,880 params&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   ├── input_layernorm &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   └── post_attention_layernorm &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   ├── 9
│   │   │   ├── self_attn &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;64 params&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── q_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;4096, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── k_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;512, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── v_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;512, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   └── o_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880, 4096]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   ├── mlp
│   │   │   │   ├── router &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;32, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   └── experts &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;796,538,880 params&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   ├── input_layernorm &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   └── post_attention_layernorm &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   ├── 10
│   │   │   ├── self_attn &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;64 params&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── q_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;4096, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── k_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;512, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── v_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;512, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   └── o_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880, 4096]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   ├── mlp
│   │   │   │   ├── router &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;32, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   └── experts &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;796,538,880 params&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   ├── input_layernorm &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   └── post_attention_layernorm &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   ├── 11
│   │   │   ├── self_attn &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;64 params&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── q_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;4096, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── k_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;512, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── v_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;512, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   └── o_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880, 4096]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   ├── mlp
│   │   │   │   ├── router &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;32, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   └── experts &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;796,538,880 params&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   ├── input_layernorm &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   └── post_attention_layernorm &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   ├── 12
│   │   │   ├── self_attn &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;64 params&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── q_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;4096, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── k_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;512, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── v_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;512, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   └── o_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880, 4096]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   ├── mlp
│   │   │   │   ├── router &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;32, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   └── experts &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;796,538,880 params&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   ├── input_layernorm &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   └── post_attention_layernorm &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   ├── 13
│   │   │   ├── self_attn &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;64 params&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── q_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;4096, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── k_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;512, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── v_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;512, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   └── o_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880, 4096]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   ├── mlp
│   │   │   │   ├── router &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;32, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   └── experts &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;796,538,880 params&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   ├── input_layernorm &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   └── post_attention_layernorm &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   ├── 14
│   │   │   ├── self_attn &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;64 params&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── q_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;4096, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── k_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;512, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── v_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;512, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   └── o_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880, 4096]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   ├── mlp
│   │   │   │   ├── router &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;32, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   └── experts &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;796,538,880 params&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   ├── input_layernorm &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   └── post_attention_layernorm &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   ├── 15
│   │   │   ├── self_attn &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;64 params&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── q_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;4096, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── k_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;512, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── v_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;512, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   └── o_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880, 4096]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   ├── mlp
│   │   │   │   ├── router &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;32, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   └── experts &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;796,538,880 params&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   ├── input_layernorm &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   └── post_attention_layernorm &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   ├── 16
│   │   │   ├── self_attn &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;64 params&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── q_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;4096, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── k_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;512, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── v_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;512, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   └── o_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880, 4096]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   ├── mlp
│   │   │   │   ├── router &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;32, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   └── experts &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;796,538,880 params&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   ├── input_layernorm &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   └── post_attention_layernorm &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   ├── 17
│   │   │   ├── self_attn &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;64 params&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── q_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;4096, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── k_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;512, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── v_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;512, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   └── o_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880, 4096]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   ├── mlp
│   │   │   │   ├── router &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;32, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   └── experts &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;796,538,880 params&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   ├── input_layernorm &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   └── post_attention_layernorm &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   ├── 18
│   │   │   ├── self_attn &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;64 params&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── q_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;4096, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── k_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;512, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── v_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;512, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   └── o_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880, 4096]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   ├── mlp
│   │   │   │   ├── router &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;32, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   └── experts &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;796,538,880 params&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   ├── input_layernorm &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   └── post_attention_layernorm &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   ├── 19
│   │   │   ├── self_attn &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;64 params&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── q_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;4096, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── k_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;512, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── v_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;512, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   └── o_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880, 4096]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   ├── mlp
│   │   │   │   ├── router &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;32, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   └── experts &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;796,538,880 params&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   ├── input_layernorm &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   └── post_attention_layernorm &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   ├── 20
│   │   │   ├── self_attn &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;64 params&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── q_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;4096, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── k_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;512, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── v_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;512, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   └── o_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880, 4096]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   ├── mlp
│   │   │   │   ├── router &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;32, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   └── experts &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;796,538,880 params&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   ├── input_layernorm &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   └── post_attention_layernorm &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   ├── 21
│   │   │   ├── self_attn &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;64 params&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── q_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;4096, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── k_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;512, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── v_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;512, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   └── o_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880, 4096]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   ├── mlp
│   │   │   │   ├── router &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;32, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   └── experts &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;796,538,880 params&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   ├── input_layernorm &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   └── post_attention_layernorm &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   ├── 22
│   │   │   ├── self_attn &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;64 params&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── q_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;4096, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── k_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;512, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   ├── v_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;512, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   └── o_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880, 4096]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   ├── mlp
│   │   │   │   ├── router &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;32, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   │   └── experts &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;796,538,880 params&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   ├── input_layernorm &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   │   └── post_attention_layernorm &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │   └── 23
│   │       ├── self_attn &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;64 params&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │       │   ├── q_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;4096, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │       │   ├── k_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;512, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │       │   ├── v_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;512, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │       │   └── o_proj &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880, 4096]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │       ├── mlp
│   │       │   ├── router &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;32, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │       │   └── experts &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;796,538,880 params&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │       ├── input_layernorm &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   │       └── post_attention_layernorm &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   ├── norm &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│   └── rotary_emb
└── lm_head &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;201088, 2880]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
</description>
        <pubDate>Fri, 02 Jan 2026 01:00:00 +0000</pubDate>
        <link>http://incredible.ai/unsloth/2026/01/02/Unsloth_Model_visualization/</link>
        <guid isPermaLink="true">http://incredible.ai/unsloth/2026/01/02/Unsloth_Model_visualization/</guid>
        
        
        <category>unsloth</category>
        
      </item>
    
      <item>
        <title>Unsloth - Odds Ratio Preference Optimization (ORPO)</title>
        <description>&lt;h1 id=&quot;1-what-is-post-training&quot;&gt;1. What is Post Training&lt;/h1&gt;

&lt;p&gt;Pre-training을 통해서 언어 자체를 배웁니다. &lt;br /&gt;
간단하게 설명하면, GPT계열 (Llama 3, Mistral, Gemma) 등은, 앞의 문장을 보고 다음 단어를 맞추는 방식으로 학습합니다.
보통 생성형 계열에서 많이 합니다. &lt;br /&gt;
Masked Language Modeling 방식은 빈칸 채우기 인데 Bert, RoBERTa 계열에서 많이 쓰입니다. &lt;br /&gt;
문장의 의미를 파악하거나, 분류하는데 좋지만, 긴 글을 지어내는 능력은 떨어집니다.&lt;/p&gt;

&lt;p&gt;Post Training 은 이렇게 언어 자체에 대해서 학습한 모델에 대해서&lt;br /&gt;
원하는 목적에 맞게 튜닝하는 과정입니다.&lt;/p&gt;

&lt;h2 id=&quot;11-supervised-fine-tuning&quot;&gt;1.1 Supervised Fine Tuning&lt;/h2&gt;

&lt;p&gt;초기 이렇게 Post Training을 했습니다.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Supervised Fine-Tuning (SFT)
    &lt;ul&gt;
      &lt;li&gt;Instruction (질문) 과 Response (모범 답안) 쌍으로 이루어진 데이터로 학습&lt;/li&gt;
      &lt;li&gt;Next token prediction 을 사용하되, 데이터 품질과 도메인 지식을 따라하도록 학습&lt;/li&gt;
      &lt;li&gt;Limitation: 해보면, 흉내는 냄. 근데 이게 안전한지, 유용한지에 대한 가치 판단을 하지 못함,&lt;/li&gt;
      &lt;li&gt;Hallucination 심함&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;12-llm-alignment&quot;&gt;1.2 LLM Alignment&lt;/h2&gt;

&lt;p&gt;이전에는 지식을 그냥 넣었다면, 사람이 원하는 목적에 맞게 다듬는 과정&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;[1세대] Reinforcement Learning from Human Feedback (RLHF)
    &lt;ul&gt;
      &lt;li&gt;ChatGPT 에서 만든 정석 방법&lt;/li&gt;
      &lt;li&gt;사람이 매긴 점수를 바탕으로 보상 모델 (Reward Model) 을 만들고, 이를 통해 강화학습 (PPO) 수행&lt;/li&gt;
      &lt;li&gt;성능은 정말 압도적. 실제 인간과 대화하는 듯한 자연스러운 성능&lt;/li&gt;
      &lt;li&gt;Limitation
        &lt;ul&gt;
          &lt;li&gt;학습 과정 자체가 너무 복잡&lt;/li&gt;
          &lt;li&gt;GPU 메모리를 많이 사용 -&amp;gt; 개인이 시도하기에는 어려움&lt;/li&gt;
          &lt;li&gt;학습시 모4개의 모델을 메모리에 올려야 함
            &lt;ul&gt;
              &lt;li&gt;Policy Model (학습 대상)&lt;/li&gt;
              &lt;li&gt;Reference Model (비교 대상)&lt;/li&gt;
              &lt;li&gt;Reward Model (점수 매김)&lt;/li&gt;
              &lt;li&gt;Critical Model (가치 판단 - PPO 내부용)&lt;/li&gt;
            &lt;/ul&gt;
          &lt;/li&gt;
        &lt;/ul&gt;
      &lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;[2세대] Direct Preference Optimization (DPO)
    &lt;ul&gt;
      &lt;li&gt;선호하는 답변 (Choosen), 거부된 답변 (Rejected) 쌍을 이용해서 학습&lt;/li&gt;
      &lt;li&gt;현재 업계 표준. 학습이 훨씬 안정적으로 빠름&lt;/li&gt;
      &lt;li&gt;Limitation
        &lt;ul&gt;
          &lt;li&gt;여전히 SFT 가 선행되야 함 (2-stage)&lt;/li&gt;
          &lt;li&gt;학습시 2개의 모델이 필요
            &lt;ul&gt;
              &lt;li&gt;Policy Model (실제 학습하는 모델)&lt;/li&gt;
              &lt;li&gt;Reference Model (비교 대상)&lt;/li&gt;
            &lt;/ul&gt;
          &lt;/li&gt;
        &lt;/ul&gt;
      &lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;[3세대] Odds Ratio Preference Optimization (ORPO)
    &lt;ul&gt;
      &lt;li&gt;SFT + Alignment 를 하나로 합쳤음 (1-Stage)&lt;/li&gt;
      &lt;li&gt;Reference Model 이 필요없음&lt;/li&gt;
      &lt;li&gt;한정된 자원으로 (GPU VRAM) 으로 학습시킬때 이 방식이 최고&lt;/li&gt;
      &lt;li&gt;Loss Function 안에서 직접 Rejected Answer에 대한 Penalty를 사용&lt;/li&gt;
      &lt;li&gt;Unsloth 추천&lt;/li&gt;
      &lt;li&gt;Limitation
        &lt;ul&gt;
          &lt;li&gt;SFT + Alignment를 동시에 하기 때문에 -&amp;gt; Chosen 답변이 정말 모범 답안이어야 함&lt;/li&gt;
          &lt;li&gt;수렴이 잘 안되기도 함, hyperparameter에 따라서 학습 불안정&lt;/li&gt;
          &lt;li&gt;&lt;/li&gt;
        &lt;/ul&gt;
      &lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;[4세대] Group Relative Policy Optimization (GRPO)
    &lt;ul&gt;
      &lt;li&gt;DeepSeek-V3 그리고 R1 으로 대중화 됨&lt;/li&gt;
      &lt;li&gt;RLHF의 강력한 성능은 유지&lt;/li&gt;
      &lt;li&gt;하나의 질문에 여러개의 답변 &lt;strong&gt;그룹&lt;/strong&gt;을 생성 -&amp;gt; 그 안에서 상대적인 점수를 매김&lt;/li&gt;
      &lt;li&gt;평균보다 잘한 답변은 강화, 못한 답변은 멀어지게 만듬&lt;/li&gt;
      &lt;li&gt;메모리 효율
        &lt;ul&gt;
          &lt;li&gt;RLHF처럼 4개의 모델을 띄울 필요 X&lt;/li&gt;
          &lt;li&gt;Policy Model 1개 학습시 필요&lt;/li&gt;
          &lt;li&gt;비교를 위해서 Reference Model을 vLLM등으로 따로 띄워서 사용 (Optional)&lt;/li&gt;
        &lt;/ul&gt;
      &lt;/li&gt;
      &lt;li&gt;Limitation
        &lt;ul&gt;
          &lt;li&gt;Group Size 에 따른 VRAM 부담 -&amp;gt; 8개 16개 답변을 동시 생성해야됨 (많은 자원 소모)&lt;/li&gt;
          &lt;li&gt;명확한 보상 함수 필요
            &lt;ul&gt;
              &lt;li&gt;친절한 말투 -&amp;gt; 이런 모호한건 학습 잘 안됨&lt;/li&gt;
              &lt;li&gt;수학문제처럼 명확한거를 잘함&lt;/li&gt;
            &lt;/ul&gt;
          &lt;/li&gt;
        &lt;/ul&gt;
      &lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;가장 중요한 메모리 필요 부분을 정리하면&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;RLHF: 🟥🟥🟥🟥 (모델 4개 분량) - Out of Memory!&lt;/li&gt;
  &lt;li&gt;DPO: 🟦🟦 (모델 2개 분량) - Heavy&lt;/li&gt;
  &lt;li&gt;ORPO: 🟩 (모델 1개 분량) - Lightweight &amp;amp; Fast&lt;/li&gt;
  &lt;li&gt;GRPO: 🟨🟨 (모델 1~2개 수준 - RLHF 대비 훨씬 가벼움)&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
  &lt;p&gt;해당 문서에서는 ORPO 를 기술적 방법을 설명합니다.&lt;/p&gt;
&lt;/blockquote&gt;
</description>
        <pubDate>Thu, 01 Jan 2026 01:00:00 +0000</pubDate>
        <link>http://incredible.ai/unsloth/2026/01/01/Odds-Ratio-Preference-Optimization/</link>
        <guid isPermaLink="true">http://incredible.ai/unsloth/2026/01/01/Odds-Ratio-Preference-Optimization/</guid>
        
        
        <category>unsloth</category>
        
      </item>
    
      <item>
        <title>K3s + Nvidia Container Toolkit + vLLM</title>
        <description>&lt;h1 id=&quot;1-installation&quot;&gt;1. Installation&lt;/h1&gt;

&lt;h2 id=&quot;11-install-k3s&quot;&gt;1.1 Install K3s&lt;/h2&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;curl &lt;span class=&quot;nt&quot;&gt;-sfL&lt;/span&gt; https://get.k3s.io | sh -

&lt;span class=&quot;c&quot;&gt;# 이후 설정&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;mkdir&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-p&lt;/span&gt; ~/.kube
&lt;span class=&quot;nb&quot;&gt;sudo cp&lt;/span&gt; /etc/rancher/k3s/k3s.yaml ~/.kube/config
&lt;span class=&quot;nb&quot;&gt;sudo chown&lt;/span&gt; &lt;span class=&quot;si&quot;&gt;$(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;id&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-u&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;)&lt;/span&gt;:&lt;span class=&quot;si&quot;&gt;$(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;id&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-g&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;)&lt;/span&gt; ~/.kube/config
&lt;span class=&quot;nb&quot;&gt;chmod &lt;/span&gt;600 ~/.kube/config
&lt;span class=&quot;nb&quot;&gt;export &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;KUBECONFIG&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$HOME&lt;/span&gt;/.kube/config
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;.bashrc 에 저장&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;# K3s&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;export &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;KUBECONFIG&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$HOME&lt;/span&gt;/.kube/config
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;12-install-helm&quot;&gt;1.2 Install Helm&lt;/h2&gt;

&lt;p&gt;helm은 package manager로서 apt, yum, brew 같은 것&lt;/p&gt;

&lt;p&gt;Ubuntu 설치시&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;# Snap 으로 설치시&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;snap &lt;span class=&quot;nb&quot;&gt;install &lt;/span&gt;helm &lt;span class=&quot;nt&quot;&gt;--classic&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Apt 로 설치시 (위에 걸로 하면됨)&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt-get &lt;span class=&quot;nb&quot;&gt;install &lt;/span&gt;curl gpg apt-transport-https &lt;span class=&quot;nt&quot;&gt;--yes&lt;/span&gt;
curl &lt;span class=&quot;nt&quot;&gt;-fsSL&lt;/span&gt; https://packages.buildkite.com/helm-linux/helm-debian/gpgkey | gpg &lt;span class=&quot;nt&quot;&gt;--dearmor&lt;/span&gt; | &lt;span class=&quot;nb&quot;&gt;sudo tee&lt;/span&gt; /usr/share/keyrings/helm.gpg &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; /dev/null
&lt;span class=&quot;nb&quot;&gt;echo&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;deb [signed-by=/usr/share/keyrings/helm.gpg] https://packages.buildkite.com/helm-linux/helm-debian/any/ any main&quot;&lt;/span&gt; | &lt;span class=&quot;nb&quot;&gt;sudo tee&lt;/span&gt; /etc/apt/sources.list.d/helm-stable-debian.list
&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt-get update
&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt-get &lt;span class=&quot;nb&quot;&gt;install &lt;/span&gt;helm
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;설치 확인&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;helm version
helm list
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;13-nvidia-container-tookit&quot;&gt;1.3 Nvidia Container Tookit&lt;/h2&gt;

&lt;blockquote&gt;
  &lt;p&gt;K3s는 기본적으로 containerd를 사용합니다. &lt;br /&gt;
따라서 “Docker 컨테이너에서 GPU 사용”과 “K3s Pod에서 GPU 사용”은 설정 위치가 다릅니다.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;# Install prerequisites&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt-get update &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt-get &lt;span class=&quot;nb&quot;&gt;install&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-y&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--no-install-recommends&lt;/span&gt; curl gnupg2

&lt;span class=&quot;c&quot;&gt;# Configure the production repository&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;curl &lt;span class=&quot;nt&quot;&gt;-fsSL&lt;/span&gt; https://nvidia.github.io/libnvidia-container/gpgkey | &lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;gpg &lt;span class=&quot;nt&quot;&gt;--dearmor&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-o&lt;/span&gt; /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
  &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; curl &lt;span class=&quot;nt&quot;&gt;-s&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-L&lt;/span&gt; https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nb&quot;&gt;sed&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;s#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g&apos;&lt;/span&gt; | &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nb&quot;&gt;sudo tee&lt;/span&gt; /etc/apt/sources.list.d/nvidia-container-toolkit.list

&lt;span class=&quot;c&quot;&gt;# Install&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt-get update

&lt;span class=&quot;c&quot;&gt;# 설치&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;export &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;NVIDIA_CONTAINER_TOOLKIT_VERSION&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;1.18.1-1
  &lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt-get &lt;span class=&quot;nb&quot;&gt;install&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-y&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
      nvidia-container-toolkit&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;NVIDIA_CONTAINER_TOOLKIT_VERSION&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
      nvidia-container-toolkit-base&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;NVIDIA_CONTAINER_TOOLKIT_VERSION&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
      libnvidia-container-tools&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;NVIDIA_CONTAINER_TOOLKIT_VERSION&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
      libnvidia-container1&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;NVIDIA_CONTAINER_TOOLKIT_VERSION&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;아래처럼 설정하면 Docker Container 가 Nvidia GPU 를 사용 하도록 설정해 줍니다. &lt;br /&gt;
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nvidia-ctk runtime configure --runtime=docker&lt;/code&gt; 실행시 /etc/docker/daemon.json 에 세팅값이 설정됩니다&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;# Docker 에서 GPU 사용 가능하게 해줌&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# Docker runtime config&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;nvidia-ctk runtime configure &lt;span class=&quot;nt&quot;&gt;--runtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;docker
&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;systemctl restart docker

&lt;span class=&quot;c&quot;&gt;# Verify (Docker)&lt;/span&gt;
docker run &lt;span class=&quot;nt&quot;&gt;--rm&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--gpus&lt;/span&gt; all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Containerd 에서 GPU 사용은 다음과 같이 합니다. &lt;br /&gt; 
K3s는 Containerd 를 사용 (반드시 설치 해야 함)&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;# K3s 에서 GPU 사용 가능하게 해줌 (K3s 는 Containerd 사용)&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;nvidia-ctk runtime configure &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;--runtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;containerd &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;--config&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;/var/lib/rancher/k3s/agent/etc/containerd/config.toml

&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;systemctl restart containerd
&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;systemctl restart k3s
kubectl rollout restart daemonset nvdp-nvidia-device-plugin &lt;span class=&quot;nt&quot;&gt;-n&lt;/span&gt; nvidia-device-plugin

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;14-nvidia-device-plugin--gpu-feature-discovery-gfd&quot;&gt;1.4 Nvidia Device Plugin + GPU Feature Discovery (GFD)&lt;/h2&gt;

&lt;p&gt;Kubernetes 에서 GPU를 자원(resource)로 인식하게 만드는 컴포넌트&lt;br /&gt; 
cpu, memory 지정은 기본적으로 지원되는데, gpu: 2 이런건 Nvidia device plugin을 설치해야지 됨&lt;/p&gt;

&lt;h3 id=&quot;runtimeclass-생성&quot;&gt;RuntimeClass 생성&lt;/h3&gt;

&lt;p&gt;K3s의 팟이 nvidia-container-runtime을 사용하도록 RuntimeClass를 먼저 생성합니다.&lt;/p&gt;
&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nb&quot;&gt;cat&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;no&quot;&gt;EOF&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt; | kubectl apply -f -
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia
handler: nvidia
&lt;/span&gt;&lt;span class=&quot;no&quot;&gt;EOF
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;실제-설치&quot;&gt;실제 설치&lt;/h3&gt;

&lt;p&gt;(여기서는 helm 으로 설치 방법을 알려줌)&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gfd.enabled=true&lt;/code&gt;
    &lt;ul&gt;
      &lt;li&gt;GPU Feature Discovery 활성화&lt;/li&gt;
      &lt;li&gt;GPU가 존재하는 노드에 자동으로 라벨 부착&lt;/li&gt;
      &lt;li&gt;nodeAffinity 조건을 만족시켜 Device Plugin DaemonSet이 정상적으로 실행됨&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update

&lt;span class=&quot;c&quot;&gt;# 어떤 버젼 있는지 확인&lt;/span&gt;
helm search repo nvdp &lt;span class=&quot;nt&quot;&gt;--devel&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# 위에서 확인한 버젼을 사용해서 설치/업그레이드&lt;/span&gt;
helm upgrade &lt;span class=&quot;nt&quot;&gt;-i&lt;/span&gt; nvdp nvdp/nvidia-device-plugin &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;--namespace&lt;/span&gt; nvidia-device-plugin &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;--create-namespace&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;--set&lt;/span&gt; gfd.enabled&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;true&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;--set&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;runtimeClassName&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;nvidia &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;--version&lt;/span&gt; 0.18.0
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;설치 확인&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;# DaemonSet 확인&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;kubectl get pods &lt;span class=&quot;nt&quot;&gt;-n&lt;/span&gt; nvidia-device-plugin &lt;span class=&quot;nt&quot;&gt;-o&lt;/span&gt; wide
NAME                                                    READY   STATUS    RESTARTS        AGE    IP           NODE                   NOMINATED NODE   READINESS GATES
nvdp-node-feature-discovery-gc-6476cc6bf4-t655p         1/1     Running   0               55m    10.42.0.23   anderson-ubuntu-3090   &amp;lt;none&amp;gt;           &amp;lt;none&amp;gt;
nvdp-node-feature-discovery-master-58788687cc-tzl9v     1/1     Running   0               55m    10.42.0.22   anderson-ubuntu-3090   &amp;lt;none&amp;gt;           &amp;lt;none&amp;gt;
nvdp-node-feature-discovery-worker-849zk                1/1     Running   1 &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;2m25s ago&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;   55m    10.42.0.26   anderson-ubuntu-3090   &amp;lt;none&amp;gt;           &amp;lt;none&amp;gt;
nvdp-nvidia-device-plugin-gpu-feature-discovery-m2fz8   1/1     Running   0               2m4s   10.42.0.34   anderson-ubuntu-3090   &amp;lt;none&amp;gt;           &amp;lt;none&amp;gt;
nvdp-nvidia-device-plugin-prt9m                         1/1     Running   0               2m4s   10.42.0.33   anderson-ubuntu-3090   &amp;lt;none&amp;gt;           &amp;lt;none&amp;gt;

&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;kubectl get ds &lt;span class=&quot;nt&quot;&gt;-n&lt;/span&gt; nvidia-device-plugin
NAME                                              DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                 AGE
nvdp-node-feature-discovery-worker                1         1         1       1            1           &amp;lt;none&amp;gt;                        72m
nvdp-nvidia-device-plugin                         1         1         1       1            1           &amp;lt;none&amp;gt;                        72m
nvdp-nvidia-device-plugin-gpu-feature-discovery   1         1         1       1            1           &amp;lt;none&amp;gt;                        72m
nvdp-nvidia-device-plugin-mps-control-daemon      0         0         0       0            0           nvidia.com/mps.capable&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;true   &lt;/span&gt;72m

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;15-kubernetes-operation-tools&quot;&gt;1.5 Kubernetes Operation Tools&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;k9s&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;터미널용 툴&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;wget https://github.com/derailed/k9s/releases/latest/download/k9s_Linux_amd64.tar.gz
&lt;span class=&quot;nb&quot;&gt;tar&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-xzf&lt;/span&gt; k9s_Linux_amd64.tar.gz
&lt;span class=&quot;nb&quot;&gt;sudo mv &lt;/span&gt;k9s /usr/local/bin/

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h1 id=&quot;2-vllm&quot;&gt;2. vLLM&lt;/h1&gt;

&lt;h2 id=&quot;21-vllm-namespace&quot;&gt;2.1 vllm namespace&lt;/h2&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;kubectl create ns vllm &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;21-persistent-volume-생성&quot;&gt;2.1 Persistent Volume 생성&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;PVC 생성&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;k3s 디렉토리 만들고 그 안에다가 다음의 파일들을 생성합니다.&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nb&quot;&gt;cat&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;no&quot;&gt;EOF&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos; | kubectl apply -f -
# pvc-gpt-oss-20b.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: gpt-oss-20b
  namespace: vllm
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 50Gi  # Adjust based on model size
&lt;/span&gt;&lt;span class=&quot;no&quot;&gt;EOF
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;확인은 다음과 같이 합니다.&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;kubectl get pvc &lt;span class=&quot;nt&quot;&gt;-n&lt;/span&gt; vllm
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;22-secret-생성-hugging-face-token&quot;&gt;2.2 Secret 생성 (Hugging Face Token)&lt;/h2&gt;

&lt;p&gt;Gated model이거나 인증이 필요한 경우 Secret 설정이 필수입니다. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CreateContainerConfigError&lt;/code&gt;가 발생한다면 이 설정이 빠졌는지 확인하세요.&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;# 터미널에서 직접 생성하는 것이 편합니다.&lt;/span&gt;
kubectl create secret generic hf-token-secret &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;-n&lt;/span&gt; vllm &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;--from-literal&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;token&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;YOUR_HUGGINGFACE_TOKEN&quot;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;23-deploy-vllm&quot;&gt;2.3 Deploy vLLM&lt;/h2&gt;

&lt;p&gt;실제 vLLM을 배포합니다. &lt;br /&gt;
vLLM의 주요 옵션들을 정리하면 다음과 같습니다.&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Option&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Description&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Usage&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--model&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;사용할 모델의 이름 또는 경로&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;vllm serve facebook/opt-125m&lt;/code&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--tensor-parallel-size&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-tp&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Tensor Parallelism을 위한 GPU 개수&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-tp 2&lt;/code&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--pipeline-parallel-size&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-pp&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Pipeline Parallelism을 위한 GPU 개수&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-pp 2&lt;/code&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--gpu-memory-utilization&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;GPU 메모리 사용률 (기본값 0.9)&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--gpu-memory-utilization 0.95&lt;/code&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--max-model-len&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;모델의 최대 컨텍스트 길이 제한&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--max-model-len 4096&lt;/code&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--trust-remote-code&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;모델의 커스텀 코드 실행 허용 (필요시)&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--trust-remote-code&lt;/code&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--dtype&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;데이터 타입 설정 (auto, float16, bfloat16 등)&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--dtype bfloat16&lt;/code&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--quantization&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-q&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;양자화 설정 (awq, gptq, bitsandbytes 등)&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-q awq&lt;/code&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--enforce-eager&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;CUDA Graph 대신 Eager 모드 사용 (디버깅용)&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--enforce-eager&lt;/code&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--served-model-name&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;API에서 노출될 모델 이름 변경&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--served-model-name my-gpt&lt;/code&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--api-key&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;API 접속을 위한 API Key 설정&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--api-key mysecret&lt;/code&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--enable-lora&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;LoRA 어댑터 사용 활성화&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--enable-lora&lt;/code&gt;&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;div class=&quot;language-yaml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# vllm-deployment.yaml&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;apiVersion&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;apps/v1&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;kind&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;Deployment&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;metadata&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;gpt-oss-20b&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;namespace&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;vllm&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;labels&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;app&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;gpt-oss-20b&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;spec&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;replicas&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;selector&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;matchLabels&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;app&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;gpt-oss-20b&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;template&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;metadata&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;labels&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;na&quot;&gt;app&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;gpt-oss-20b&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;spec&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;runtimeClassName&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;nvidia&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# RuntimeClass 추가 (K3s에서 필수)&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;volumes&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
      &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;cache-volume&lt;/span&gt;
        &lt;span class=&quot;na&quot;&gt;persistentVolumeClaim&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
          &lt;span class=&quot;na&quot;&gt;claimName&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;gpt-oss-20b&lt;/span&gt;
      &lt;span class=&quot;c1&quot;&gt;# vLLM needs to access the host&apos;s shared memory for tensor parallel inference.&lt;/span&gt;
      &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;shm&lt;/span&gt;
        &lt;span class=&quot;na&quot;&gt;emptyDir&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
          &lt;span class=&quot;na&quot;&gt;medium&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;Memory&lt;/span&gt;
          &lt;span class=&quot;na&quot;&gt;sizeLimit&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;8Gi&quot;&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;containers&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
      &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;gpt-oss-20b&lt;/span&gt;
        &lt;span class=&quot;na&quot;&gt;image&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;vllm/vllm-openai:latest&lt;/span&gt;
        &lt;span class=&quot;na&quot;&gt;command&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;pi&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;/bin/sh&quot;&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;-c&quot;&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;]&lt;/span&gt;
        &lt;span class=&quot;na&quot;&gt;args&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;pi&quot;&gt;[&lt;/span&gt;
          &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;vllm&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;serve&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;openai/gpt-oss-20b&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt; &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\
&lt;/span&gt;            &lt;span class=&quot;s&quot;&gt;--trust-remote-code&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt; &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\
&lt;/span&gt;            &lt;span class=&quot;s&quot;&gt;--enable-chunked-prefill&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt; &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\
&lt;/span&gt;            &lt;span class=&quot;s&quot;&gt;--max-model-len&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;4096&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt; &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\
&lt;/span&gt;            &lt;span class=&quot;s&quot;&gt;--gpu-memory-utilization&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;0.90&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt; &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\
&lt;/span&gt;            &lt;span class=&quot;s&quot;&gt;--dtype&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;auto&quot;&lt;/span&gt;
        &lt;span class=&quot;pi&quot;&gt;]&lt;/span&gt;
        &lt;span class=&quot;na&quot;&gt;env&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;HUGGING_FACE_HUB_TOKEN&lt;/span&gt;
          &lt;span class=&quot;na&quot;&gt;valueFrom&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
            &lt;span class=&quot;na&quot;&gt;secretKeyRef&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
              &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;hf-token-secret&lt;/span&gt;
              &lt;span class=&quot;na&quot;&gt;key&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;token&lt;/span&gt;
        &lt;span class=&quot;na&quot;&gt;ports&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;containerPort&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;8000&lt;/span&gt;
        &lt;span class=&quot;na&quot;&gt;resources&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
          &lt;span class=&quot;na&quot;&gt;limits&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
            &lt;span class=&quot;na&quot;&gt;cpu&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;8&quot;&lt;/span&gt;
            &lt;span class=&quot;na&quot;&gt;memory&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;32Gi&lt;/span&gt;
            &lt;span class=&quot;na&quot;&gt;nvidia.com/gpu&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;1&quot;&lt;/span&gt;
          &lt;span class=&quot;na&quot;&gt;requests&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
            &lt;span class=&quot;na&quot;&gt;cpu&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;2&quot;&lt;/span&gt;
            &lt;span class=&quot;na&quot;&gt;memory&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;16Gi&lt;/span&gt;
            &lt;span class=&quot;na&quot;&gt;nvidia.com/gpu&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;1&quot;&lt;/span&gt;
        &lt;span class=&quot;na&quot;&gt;volumeMounts&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;mountPath&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;/root/.cache/huggingface&lt;/span&gt;
          &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;cache-volume&lt;/span&gt;
        &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;shm&lt;/span&gt;
          &lt;span class=&quot;na&quot;&gt;mountPath&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;/dev/shm&lt;/span&gt;
        &lt;span class=&quot;na&quot;&gt;livenessProbe&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
          &lt;span class=&quot;na&quot;&gt;httpGet&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
            &lt;span class=&quot;na&quot;&gt;path&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;/health&lt;/span&gt;
            &lt;span class=&quot;na&quot;&gt;port&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;8000&lt;/span&gt;
          &lt;span class=&quot;na&quot;&gt;initialDelaySeconds&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;300&lt;/span&gt;
          &lt;span class=&quot;na&quot;&gt;periodSeconds&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;10&lt;/span&gt;
        &lt;span class=&quot;na&quot;&gt;readinessProbe&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
          &lt;span class=&quot;na&quot;&gt;httpGet&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
            &lt;span class=&quot;na&quot;&gt;path&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;/health&lt;/span&gt;
            &lt;span class=&quot;na&quot;&gt;port&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;8000&lt;/span&gt;
          &lt;span class=&quot;na&quot;&gt;initialDelaySeconds&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;300&lt;/span&gt;
          &lt;span class=&quot;na&quot;&gt;periodSeconds&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;5&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;설치가 완료되었으면 실제로 API가 동작하는지 테스트해 봅니다. &lt;br /&gt;
K3s 환경에서는 Cluster IP에 직접 접근이 가능하므로, Pod IP를 확인하여 바로 테스트할 수 있습니다.&lt;/p&gt;

&lt;p&gt;먼저 Pod의 IP를 확인합니다.&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;kubectl get pod &lt;span class=&quot;nt&quot;&gt;-n&lt;/span&gt; vllm &lt;span class=&quot;nt&quot;&gt;-o&lt;/span&gt; wide

NAME                           READY   STATUS    RESTARTS   AGE   IP            NODE                   NOMINATED NODE   READINESS GATES
gpt-oss-20b-6b44fc66b6-kkr8t   1/1     Running   0          28m   10.42.0.114   anderson-ubuntu-3090   &amp;lt;none&amp;gt;           &amp;lt;none&amp;gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;출력된 IP를 확인한 후 &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;curl&lt;/code&gt; 명령어로 &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;health&lt;/code&gt; 엔드포인트를 호출합니다.&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;# 예: IP가 10.42.0.114 인 경우&lt;/span&gt;
curl http://10.42.0.114:8000/health &lt;span class=&quot;nt&quot;&gt;-v&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;아무런 내용 없이 &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;200 OK&lt;/code&gt; 응답이 오면 정상적으로 실행 중인 상태입니다.&lt;/p&gt;

&lt;h2 id=&quot;24-production-level-access-ingress--service&quot;&gt;2.4 Production-Level Access (Ingress &amp;amp; Service)&lt;/h2&gt;

&lt;p&gt;앞서 &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Test&lt;/code&gt; 단계에서는 Pod IP로 직접 테스트했지만, &lt;br /&gt;
실제 운영 환경(Production)에서는 &lt;strong&gt;절대 Pod IP를 직접 사용하지 않습니다.&lt;/strong&gt; &lt;br /&gt;
Pod는 언제든 죽었다 살아나며 IP가 바뀌기 때문입니다.&lt;/p&gt;

&lt;p&gt;이를 해결하기 위해 &lt;strong&gt;Service&lt;/strong&gt;와 &lt;strong&gt;Ingress&lt;/strong&gt;를 사용합니다.&lt;/p&gt;

&lt;h3 id=&quot;1-service-load-balancer&quot;&gt;1) Service (Load Balancer)&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Service&lt;/strong&gt;는 쉽게 말해 &lt;strong&gt;“내부 로드 밸런서”&lt;/strong&gt; 입니다.&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;여러 개의 Pod(replica)가 떠 있을 때, 이들을 하나의 &lt;strong&gt;고정된 IP(ClusterIP)&lt;/strong&gt; 로 묶어줍니다.&lt;/li&gt;
  &lt;li&gt;트래픽이 들어오면 살아있는 Pod들 중 하나로 &lt;strong&gt;부하 분산(Load Balancing)&lt;/strong&gt; 을 해줍니다.&lt;/li&gt;
  &lt;li&gt;즉, Pod가 죽고 새로 태어나도 Service의 주소는 변하지 않습니다.&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;language-yaml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# vllm-service.yaml&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;apiVersion&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;v1&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;kind&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;Service&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;metadata&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;gpt-oss-20b-service&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;namespace&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;vllm&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;spec&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;selector&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;app&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;gpt-oss-20b&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# Deployment의 label과 일치해야 함&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;ports&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;protocol&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;TCP&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;port&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;80&lt;/span&gt;        &lt;span class=&quot;c1&quot;&gt;# Service가 노출할 포트&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;targetPort&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;8000&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# 실제 Pod가 떠 있는 포트&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;type&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;ClusterIP&lt;/span&gt;     &lt;span class=&quot;c1&quot;&gt;# 외부 노출 없이 내부 전용&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;kubectl get service &lt;span class=&quot;nt&quot;&gt;-n&lt;/span&gt; vllm

NAME                  TYPE        CLUSTER-IP     EXTERNAL-IP   PORT&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;S&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;   AGE
gpt-oss-20b-service   ClusterIP   10.43.61.210   &amp;lt;none&amp;gt;        80/TCP    81s

&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;curl httP://10.43.61.210/health  &lt;span class=&quot;nt&quot;&gt;-v&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;2-ingress-gateway&quot;&gt;2) Ingress (Gateway)&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Ingress&lt;/strong&gt;는 클러스터 &lt;strong&gt;“외부에서 들어오는 대문”&lt;/strong&gt; 역할을 합니다.&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;llm.incredible.ai&lt;/code&gt; 처럼 도메인 주소를 보고 적절한 Service로 연결해 줍니다.&lt;/li&gt;
  &lt;li&gt;SSL/TLS 인증서 처리나 라우팅 규칙을 담당합니다.&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;language-yaml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# vllm-ingress.yaml&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;apiVersion&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;networking.k8s.io/v1&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;kind&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;Ingress&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;metadata&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;gpt-oss-20b-ingress&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;namespace&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;vllm&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;spec&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;rules&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;host&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;llm.incredible.ai&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# 1. 사용자가 이 도메인으로 접속하면&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;http&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;paths&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
      &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;path&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;/&lt;/span&gt;
        &lt;span class=&quot;na&quot;&gt;pathType&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;Prefix&lt;/span&gt;
        &lt;span class=&quot;na&quot;&gt;backend&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
          &lt;span class=&quot;na&quot;&gt;service&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
            &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;gpt-oss-20b-service&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# 2. 이 서비스(내부 로드밸런서)로 보냄&lt;/span&gt;
            &lt;span class=&quot;na&quot;&gt;port&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
              &lt;span class=&quot;na&quot;&gt;number&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;80&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;3-local-test-가짜-도메인-사용&quot;&gt;3) Local Test (가짜 도메인 사용)&lt;/h3&gt;
&lt;p&gt;로컬 개발 환경에서는 &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;llm.incredible.ai&lt;/code&gt; 같은 도메인이 실제 인터넷에 없으므로, 내 컴퓨터가 이를 인식하도록 속여야 합니다.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Linux/Mac&lt;/strong&gt;의 경우 &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/etc/hosts&lt;/code&gt; 파일을 수정합니다.&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;# /etc/hosts 파일 수정&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;vim /etc/hosts

&lt;span class=&quot;c&quot;&gt;# 맨 아래에 다음 줄 추가 (K3s가 설치된 로컬 IP)&lt;/span&gt;
127.0.0.1 llm.incredible.ai
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;이제 주소창이나 터미널에서 진짜 도메인처럼 호출할 수 있습니다.&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;# 이제 IP 대신 도메인으로 호출 가능!&lt;/span&gt;
curl http://llm.incredible.ai/health
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h1 id=&quot;uninstall&quot;&gt;Uninstall&lt;/h1&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;helm uninstall nvdp &lt;span class=&quot;nt&quot;&gt;-n&lt;/span&gt; nvidia-device-plugin
kubectl delete namespace nvidia-device-plugin
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;확인&lt;/p&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;bash
kubectl get pods -A | grep -i nvidia || true
kubectl get ds -A | grep -i nvidia || true&lt;/code&gt;&lt;/p&gt;
</description>
        <pubDate>Sat, 20 Dec 2025 01:00:00 +0000</pubDate>
        <link>http://incredible.ai/kubernetes/2025/12/20/K3s_vLLM/</link>
        <guid isPermaLink="true">http://incredible.ai/kubernetes/2025/12/20/K3s_vLLM/</guid>
        
        <category>gpu</category>
        
        <category>pod</category>
        
        <category>docker</category>
        
        <category>helm</category>
        
        <category>kubernetes</category>
        
        <category>vllm</category>
        
        
        <category>kubernetes</category>
        
      </item>
    
      <item>
        <title>Antigravity Personal Settings</title>
        <description>&lt;ul&gt;
  &lt;li&gt;혹시 윈도우면, wsl –install 이걸로 우분투 사용 가능함. 이후 wsl 로 ubuntu 처럼 사용&lt;/li&gt;
&lt;/ul&gt;

&lt;h1 id=&quot;1-hot-keys&quot;&gt;1. Hot Keys&lt;/h1&gt;

&lt;h2 id=&quot;11-hot-keys&quot;&gt;1.1 Hot Keys&lt;/h2&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Category&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Title&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Hot Key&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Description&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Antigravity&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Open Command&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;CTRL + SHIFT + P&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt; &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt; &lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Terminal&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;CTRL + `&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Open/focus terminal&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt; &lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Word Wrap&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;ALT + Z&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Toggle Word Wrap&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;h2 id=&quot;12-command&quot;&gt;1.2 Command&lt;/h2&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Category&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Command&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Description&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Python&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Python: Select Interpreter&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;특정 버젼 Python 선택&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt; &lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Python: Configure Test&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Pytest등의 test 툴을 선택&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;h1 id=&quot;2-extensions&quot;&gt;2. Extensions&lt;/h1&gt;

&lt;h2 id=&quot;21-python-support-in-antigravity&quot;&gt;2.1 Python Support in Antigravity&lt;/h2&gt;

&lt;p&gt;Python 실행하고 하려면 해당 extensions 도 설치해야 함. &lt;br /&gt;
실제 Python을 설치하는게 아니라, Python을 실행할수 있도록 도와주는 extension&lt;/p&gt;

&lt;p&gt;아래와 같이 검색&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;@category:debuggers Python
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/antigravity-sftp-04.png&quot; class=&quot;img-responsive img-rounded img-fluid center&quot; style=&quot;border: 2px solid #333333&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;22-remote-ssh-connect-to-ssh-host&quot;&gt;2.2 Remote-SSH: Connect to SSH Host…&lt;/h2&gt;

&lt;p&gt;먼저 &lt;strong&gt;~/.ssh/config&lt;/strong&gt; 에 다음을 작성&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Host oracle
    HostName 134.185.117.137
    Port 22
    User ubuntu
    IdentityFile C:&lt;span class=&quot;se&quot;&gt;\U&lt;/span&gt;sers&lt;span class=&quot;se&quot;&gt;\a&lt;/span&gt;nderson&lt;span class=&quot;se&quot;&gt;\.&lt;/span&gt;ssh&lt;span class=&quot;se&quot;&gt;\i&lt;/span&gt;d_ed25519
    ControlMaster auto
    ControlPath ~/.ssh/cm-%r@%h:%p
    ControlPersist 10m
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;윈도우에서는 Multiplexing (한번 만든 SSH연결을 여러 세션이 같이 쓰게 하는 기능) 끄는게 좋아&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Host oracle
    HostName 134.185.117.137
    Port 22
    User ubuntu
    IdentityFile C:&lt;span class=&quot;se&quot;&gt;\U&lt;/span&gt;sers&lt;span class=&quot;se&quot;&gt;\a&lt;/span&gt;nderson&lt;span class=&quot;se&quot;&gt;\.&lt;/span&gt;ssh&lt;span class=&quot;se&quot;&gt;\i&lt;/span&gt;d_ed25519
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Bastion 에서는 다음과 같이 설정&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;# Bastion 호스트 서버&lt;/span&gt;
Host oracle-bastion
    HostName 1.2.3.4
    Port 22
    User ec2-user
    IdentityFile C:/Users/anderson/.ssh/bastion.pem
    IdentitiesOnly &lt;span class=&quot;nb&quot;&gt;yes
    &lt;/span&gt;ServerAliveInterval 60
    ServerAliveCountMax 3
    
&lt;span class=&quot;c&quot;&gt;# 실제 내부 서버 정의&lt;/span&gt;
Host oracle
    HostName 10.0.1.15
    Port 22
    User ubuntu
    IdentityFile C:/Users/anderson/.ssh/id_ed25519
    IdentitiesOnly &lt;span class=&quot;nb&quot;&gt;yes
    
    &lt;/span&gt;ProxyJump oracle-bastion

    ControlMaster auto
    ControlPath ~/.ssh/cm-%r@%h:%p
    ControlPersist 10m
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;23-ftpsftpssh-sync-tool&quot;&gt;2.3 FTP/SFTP/SSH Sync Tool&lt;/h2&gt;

&lt;p&gt;FTP/SFTP/SSH Sync 툴에서 + 를 클릭&lt;br /&gt;
여기서 해당 remote를 대표하는 이름을 적어 넣습니다.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/antigravity-sftp-01.png&quot; class=&quot;img-responsive img-rounded img-fluid center&quot; style=&quot;border: 2px solid #333333&quot; /&gt;&lt;/p&gt;

&lt;p&gt;SFTP 선택 (따로 FTP 21을 오픈할 필요없이 22번 SSH로 접속 가능)&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/antigravity-sftp-02.png&quot; class=&quot;img-responsive img-rounded img-fluid center&quot; style=&quot;border: 2px solid #333333&quot; /&gt;&lt;/p&gt;

&lt;p&gt;다음을 선택&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Real-time submission after saving&lt;/li&gt;
  &lt;li&gt;is this the default configuration&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;다른거 선택하면 안됨!!&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/antigravity-sftp-03.png&quot; class=&quot;img-responsive img-rounded img-fluid center&quot; style=&quot;border: 2px solid #333333&quot; /&gt;&lt;/p&gt;

&lt;p&gt;이후에 sync_config.jsonc 가 나오고 여기서 실제 설정.&lt;br /&gt;
다음을 반드시 설정&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;host&lt;/li&gt;
  &lt;li&gt;port&lt;/li&gt;
  &lt;li&gt;privateKeyPath&lt;/li&gt;
  &lt;li&gt;remotePath&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;language-json highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;oracle_server&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;type&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;sftp&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;host&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;134.185.117.137&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;port&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;22&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;username&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;ubuntu&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;privateKeyPath&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;C:/Users/anderson/.ssh/id_ed25519&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;proxy&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;false&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;upload_on_save&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;true&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;watch&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;false&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;submit_git_before_upload&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;false&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;submit_git_msg&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;build&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;compress&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;false&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;remote_unpacked&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;false&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;delete_remote_compress&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;false&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;delete_local_compress&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;false&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;deleteRemote&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;false&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;upload_to_root&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;false&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;distPath&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[],&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;remotePath&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;/home/ubuntu/projects&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;excludePath&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[],&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;downloadPath&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;downloadExcludePath&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[],&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;default&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;true&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

</description>
        <pubDate>Mon, 01 Dec 2025 01:00:00 +0000</pubDate>
        <link>http://incredible.ai/development/2025/12/01/Antigravity-Personal-Settings/</link>
        <guid isPermaLink="true">http://incredible.ai/development/2025/12/01/Antigravity-Personal-Settings/</guid>
        
        <category>sfpt</category>
        
        <category>oracle</category>
        
        
        <category>development</category>
        
      </item>
    
      <item>
        <title>ToolFormer - Language Models Can Teach Themselves to Use Tools</title>
        <description>&lt;h1 id=&quot;1-introduction&quot;&gt;1. Introduction&lt;/h1&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Key&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Value&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Paper&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;https://arxiv.org/pdf/2302.04761&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Publication Date&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;2023&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;h2 id=&quot;11-tldr&quot;&gt;1.1 TL;DR&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Problem&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;대형 언어 모델은 언어 생성 능력은 강하지만, 정확한 산술 계산, 최신 사실 조회, 날짜 연산 등에서는 취약합니다.&lt;br /&gt; 
가장 단순한 해결책은 검색엔진·계산기·캘린더 같은 외부 도구를 쓰게 하는 것이지만, 기존 접근은&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;(1) 대량의 사람 주석에 의존&lt;/li&gt;
  &lt;li&gt;(2) 특정 과제 전용 설계에 묶여 일반성이 떨어진다는 한계 존재&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Brief Solution&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;ToolFormer는 최소한의 데모만으로 모델이 스스로 도구 사용을 학습하도록 합니다. 핵심 요건은 다음과 같습니다.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;LLM이 스스로 외부 API 를 호출할지, self supervised 학습.
    &lt;ul&gt;
      &lt;li&gt;데이터 레이블을 스스로 직접 생성하고 자동으로 미세 조정&lt;/li&gt;
      &lt;li&gt;실제 NLL(negative log likelihood)을 낮추는 호출을 남김&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;계산기·검색·번역·캘린더 등 다양한 도구를 하나의 통일된 포맷으로 다룸&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;12-examples&quot;&gt;1.2 Examples&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;계산기&lt;/strong&gt;: [Calculator(400 / 1400) → 0.29]&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;QA&lt;/strong&gt;: [QA(“Who is the publisher of The New England Journal of Medicine?”) → Massachusetts Medical Society]&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Wiki Search&lt;/strong&gt;: [WikiSearch(“BrownAct”) → The Ralph M. Brown Act is an act of the California State Legislature that guarantees the public’s right to attend and participate in meetings of local legislative bodies.]&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/toolformer_example.png&quot; class=&quot;img-responsive img-rounded img-fluid center&quot; style=&quot;border: 2px solid #333333&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;13-pipeline&quot;&gt;1.3 Pipeline&lt;/h2&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/toolformer-pipeline.png&quot; class=&quot;img-responsive img-rounded img-fluid center&quot; style=&quot;border: 2px solid #333333&quot; /&gt;&lt;/p&gt;

&lt;h1 id=&quot;2-how-it-works&quot;&gt;2. How it works&lt;/h1&gt;

&lt;h2 id=&quot;21-format--notation&quot;&gt;2.1 Format &amp;amp; Notation&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;API 호출 표현: \(c = (a_c, i_c)\)
    &lt;ul&gt;
      &lt;li&gt;a_c: API 이름 (예: Calculator, WikiSearch)&lt;/li&gt;
      &lt;li&gt;i_c: API arguments (예: 질의, 수식 등)&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;위의 수식적 표현은 Tuple 로 이루어져 있습니다. (수학적 관점에서)&lt;br /&gt;
하지만 Language Model에서는 문자열 즉 Token Sequence 를 다룰수 있기 때문에 Tuple 같은 구조체를 다룰수 없습니다.&lt;br /&gt;
따라서 \((a_c, i_c, r)\) 같은 구조를 -&amp;gt; 일렬의 문자열로 바꿔야 합니다.&lt;br /&gt; 
이것을 논문에서는 &lt;strong&gt;Linearized Sequence&lt;/strong&gt; 라고 부릅니다.&lt;/p&gt;

\[\begin{aligned}
\mathbf{e}(c) &amp;amp;= \langle \text{API} \rangle\, a_c(i_c)\, \langle /\text{API} \rangle \\
\mathbf{e}(c, r) &amp;amp;= \langle \text{API} \rangle\, a_c(i_c) \to r\, \langle /\text{API} \rangle
\end{aligned}\]

&lt;p&gt;여기서 &lt;API&gt; 는 special token입니다.&lt;/API&gt;&lt;/p&gt;

&lt;h2 id=&quot;22-sampling-api-calls&quot;&gt;2.2 Sampling API Calls&lt;/h2&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Prompt 를 만듬 (P(x))&lt;/strong&gt;
    &lt;ul&gt;
      &lt;li&gt;각 API 마다 prompt 템플릿을 만듬&lt;/li&gt;
      &lt;li&gt;prompt는 모델한테 해당 텍스트 안에 API call을 생성하도록 힌트를 줌&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Question Answering API Prompt 예시&lt;/p&gt;

&lt;div class=&quot;language-text highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Your task is to add calls to a Question Answering API to a piece of text.
The questions should help you get information required to complete the text. 
You can call the API by writing &quot;[QA(question)]&quot; where &quot;question&quot; is the
question you want to ask. 

Here are some examples of API calls:

Input: Joe Biden was born in Scranton, Pennsylvania.
Output: Joe Biden was born in [QA(&quot;Where was Joe Biden born?&quot;)] Scranton, [QA(&quot;In which state is Scranton?&quot;)] Pennsylvania.
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Calculator API Prompt 예시&lt;/p&gt;

&lt;div class=&quot;language-text highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Your task is to add calls to a Calculator API to a piece of text.
The calls should help you get information required to complete the text. 
You can call the API by writing &quot;[Calculator(expression)]&quot; 
where &quot;expression&quot; is the expression to be computed. 

Here are some examples of API calls:

Input: The number in the next term is 18 + 12 x 3 = 54.
Output: The number in the next term is 18 + 12 x 3 = [Calculator(18 + 12 * 3)] 54.
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;API Call 위치 후보 찾기&lt;/strong&gt;
    &lt;ul&gt;
      &lt;li&gt;Input Text의 i위치 마다, &amp;lt;API&amp;gt; 토큰이 나올 확률을 모델한테 계산&lt;/li&gt;
      &lt;li&gt;
\[p_i = P_M (\langle \text{API} \rangle | P(x), x_{1:i-1})\]
      &lt;/li&gt;
      &lt;li&gt;여기서 나온 확률이 특정 threshold 보다 크면, 해당 positions 을 들고 있음.&lt;/li&gt;
      &lt;li&gt;관련 positions 이 너무 많으면 상위 K 개만 선택함&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;다음은 예제 코드. (이렇게 돌리면 안됨. 그냥 예제)&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# ------------------------------
# 1. Prompt P(x) 준비
# ------------------------------
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;text&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Joe Biden was born in Scranton, Pennsylvania.&quot;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;prompt&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Add helpful API calls to clarify this text:&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# P(x)
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;inputs&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tokenizer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;prompt&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;text&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;return_tensors&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;pt&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;to&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;device&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# ------------------------------
# 2. 각 위치별로 &amp;lt;API&amp;gt; 시작 확률 계산
# ------------------------------
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;input_ids&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;inputs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;input_ids&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;][&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# shape: [seq_len]
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;seq_len&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;input_ids&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;size&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# special token 정의 (논문에서는 실제론 &apos;[&apos; 사용)
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;API_START&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tokenizer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;convert_tokens_to_ids&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;[&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;threshold&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mf&quot;&gt;0.01&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;top_k&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;api_probs&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[]&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;with&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;torch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;no_grad&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;():&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;seq_len&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;  
        &lt;span class=&quot;n&quot;&gt;prefix_ids&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;input_ids&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;].&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;unsqueeze&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# prefix
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;outputs&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;prefix_ids&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;logits&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;outputs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;logits&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[:,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;:]&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# last token logits
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;probs&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;torch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;softmax&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;logits&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dim&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;p_api&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;probs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;API_START&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;].&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;item&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;api_probs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;append&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;p_api&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# threshold filtering + top-k
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;candidates&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;p&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;p&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;api_probs&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;p&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;threshold&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;candidates&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;sorted&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;candidates&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;key&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;lambda&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;reverse&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)[:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;top_k&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;23-api-호출&quot;&gt;2.3 API 호출&lt;/h2&gt;

&lt;p&gt;여기서 전체 API 호출을 합니다.&lt;br /&gt; 
구현에 달려 있는 것이기 때문에, python 함수를 실행할지, API 를 호출할지, 다른 model 을 호출할지 등등 모두  자유&lt;/p&gt;

&lt;h2 id=&quot;24-filtering-api-calls&quot;&gt;2.4 Filtering API Calls&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;\(\mathcal{L}_{\text{plain}}\) : API 호출 없이, plain LM loss&lt;/li&gt;
  &lt;li&gt;\(\mathcal{L}_{\text{aug}}\) : API call + 결과를 포함한 LM Loss&lt;/li&gt;
  &lt;li&gt;\(\Delta^{(i)}\) : loss 의 개선 정도
    &lt;ul&gt;
      &lt;li&gt;\(\Delta^{(i)} \gt 0\) : API Call 이 실제로 두움이 됨&lt;/li&gt;
      &lt;li&gt;\(\Delta^{(i)} \ge \gamma_f\) : API Call 이 충분히 도움이 된다고 판단 -&amp;gt; 해당 API 호출을 유지&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

\[\begin{aligned}
\mathcal{L}_{\text{plain}}^{(i)} &amp;amp;= - \sum_{t \in \mathcal{W}_i} \log p_\theta(x_t \mid x_{\le i}) \\
\mathcal{L}_{\text{aug}}^{(i)} &amp;amp;= - \sum_{t \in \mathcal{W}_i} \log p_\theta(x_t \mid x_{\le i}, \mathbf{e}(c_i, r)) \\
\Delta^{(i)} &amp;amp;= \mathcal{L}_{\text{plain}}^{(i)} - \mathcal{L}_{\text{aug}}^{(i)} \\
\end{aligned}\]

&lt;p&gt;이중에서 Delta 값이 특정 threshold 이상인 것만 남김니다.&lt;/p&gt;

\[\mathcal{L}_{\text{plain}}^{(i)} - \mathcal{L}_{\text{aug}}^{(i)} \ge \gamma_f\]

&lt;h2 id=&quot;25-model-finetuning-after-sampling--filtering&quot;&gt;2.5 Model Finetuning After Sampling &amp;amp; Filtering&lt;/h2&gt;

&lt;p&gt;이렇게 필터링된 API 호출들을 포함한 결과 (\mathcal{C}^*) 를 가지고, 모델을 파인튜닝합니다.&lt;br /&gt;
삽입 과정은 다음과 같이 합니다.&lt;/p&gt;

&lt;div class=&quot;language-text highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;# Input text
User: What’s the weather in Seoul today?

# Internal API call
[API CALL] weather_api(&quot;seoul&quot;, &quot;today&quot;)
[API_RESULT] {&quot;temperature&quot;: &quot;25°C&quot;, &quot;condition&quot;: &quot;Sunny&quot;}

# 다음과 같이 결과가 만들어 집니다. 
&amp;lt;API&amp;gt; weather_api(&quot;seoul&quot;, &quot;today&quot;) -&amp;gt; {&quot;temperature&quot;: &quot;25°C&quot;, &quot;condition&quot;: &quot;Sunny&quot;} &amp;lt;/API&amp;gt;
 
# 위의 API 호출 결과를 LLM이 참조후에 최정 답변을 생성
[Target Label] IT&apos;s 25°C and Sunny in Seoul today.
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h1 id=&quot;3-implementation&quot;&gt;3. Implementation&lt;/h1&gt;

&lt;h2 id=&quot;31-open-webui&quot;&gt;3.1 Open WebUI&lt;/h2&gt;

&lt;p&gt;예제 구현 코드.&lt;/p&gt;

&lt;p&gt;“dragon fire 343 는 몇이야?” 이런식으로 질문하면, 아래의 함수를 실행하고 답변을 줍니다. &lt;br /&gt;
답변은 테스트 용이기 때문에, 지정된 답변을 줍니다.&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;s&quot;&gt;&quot;&quot;&quot;
title: Dragon Fire Bitcoin Analyzer
author: Anderson
description: A tool that analyzes Bitcoin&apos;s dragon fire intensity level
version: 1.0.0
license: MIT
&quot;&quot;&quot;&lt;/span&gt;

&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;typing&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Optional&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Optional manifest
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;manifest&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;&quot;title&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Dragon Fire Bitcoin Analyzer&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;&quot;author&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Anderson&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;&quot;version&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;1.0.0&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;&quot;license&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;MIT&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;&quot;description&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;A tool that analyzes Bitcoin&apos;s dragon fire intensity level&quot;&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;


&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Tools&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;__init__&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;pass&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;dragon_fire&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
        &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
        &lt;span class=&quot;n&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;__event_emitter__&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Optional&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;object&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;None&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;s&quot;&gt;&quot;&quot;&quot;
        Calculate dragon fire intensity by squaring the input value.
        
        Args:
            value: Integer value to calculate dragon fire for
            __event_emitter__: Optional event emitter for status updates
        
        Returns:
            Dragon fire result with squared value
        &quot;&quot;&quot;&lt;/span&gt;
        
        &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;__event_emitter__&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;__event_emitter__&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;({&lt;/span&gt;
                &lt;span class=&quot;s&quot;&gt;&quot;type&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;status&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                &lt;span class=&quot;s&quot;&gt;&quot;data&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;description&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;sa&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Calculating dragon fire for &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;...&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;done&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;False&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
            &lt;span class=&quot;p&quot;&gt;})&lt;/span&gt;
        
        &lt;span class=&quot;n&quot;&gt;result&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;value&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;**&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;
        
        &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;__event_emitter__&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;__event_emitter__&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;({&lt;/span&gt;
                &lt;span class=&quot;s&quot;&gt;&quot;type&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;status&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                &lt;span class=&quot;s&quot;&gt;&quot;data&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;description&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Dragon fire calculation completed&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;done&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
            &lt;span class=&quot;p&quot;&gt;})&lt;/span&gt;
        
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;sa&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;dragon fire &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;result&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;아래는 실제로 돌려본 예제 입니다.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/dragon_fire_severity.png&quot; class=&quot;img-responsive img-rounded img-fluid center&quot; style=&quot;border: 2px solid #333333&quot; /&gt;&lt;/p&gt;

</description>
        <pubDate>Sun, 10 Aug 2025 01:00:00 +0000</pubDate>
        <link>http://incredible.ai/llm/2025/08/10/ToolFormer/</link>
        <guid isPermaLink="true">http://incredible.ai/llm/2025/08/10/ToolFormer/</guid>
        
        <category>post-training</category>
        
        <category>self-supervised</category>
        
        <category>tools</category>
        
        <category>api-calls</category>
        
        <category>instruction-tuning</category>
        
        
        <category>llm</category>
        
      </item>
    
      <item>
        <title>Switch Transformer - Scaling to Trillion Parameter Models with Simple and Efficient Sparsity</title>
        <description>&lt;h1 id=&quot;switch-transformer---scaling-to-trillion-parameter-models-with-simple-and-efficient-sparsity&quot;&gt;Switch Transformer - Scaling to Trillion Parameter Models with Simple and Efficient Sparsity&lt;/h1&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Key&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Value&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Paper&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;https://arxiv.org/pdf/2101.03961&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;publication Date&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;2021&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;h1 id=&quot;problem&quot;&gt;Problem&lt;/h1&gt;

&lt;p&gt;MoE (Mixture of Experts) 모델은 inputs 마다 서로다른 parameters 사용하기 때문에,
전체 parameters 수는 폭발적으로 늘어나도, 실제 사용하는 부분은 소수라 연산량은 일정합니다.&lt;/p&gt;

&lt;p&gt;하지만 실제 적용은 어렵습니다.&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;기존 MoE 모델의 문제점
    &lt;ul&gt;
      &lt;li&gt;복잡한 Routing 알고리즘이 필요&lt;/li&gt;
      &lt;li&gt;communication overhead가 큼&lt;/li&gt;
      &lt;li&gt;학습 불안정성 gradient exploding 현상 발생&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;Switch Transformer 는 이러한 기존의 문제들을 해결하였음.&lt;/li&gt;
&lt;/ul&gt;

&lt;h1 id=&quot;model&quot;&gt;Model&lt;/h1&gt;

&lt;p&gt;핵심은&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Sparse Training&lt;/li&gt;
  &lt;li&gt;궁극적으로 parameter 갯수 자체를 maximize 하는 것. (매우 효율적인 방식으로)&lt;/li&gt;
  &lt;li&gt;이렇게 parameters 를 늘리지만, Floating point operations (FLOPs) 를 constant 값으로 바꾸는 것.
    &lt;ul&gt;
      &lt;li&gt;즉 parameters 는 늘리지만, 연산은 constant 하게 됨.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/switch_transformer_model.png&quot; class=&quot;img-responsive img-rounded img-fluid center&quot; style=&quot;border: 2px solid #333333&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;기존-moe-모델-방식---mixture-of-experts-routing&quot;&gt;기존 MOE 모델 방식 - Mixture of Experts Routing&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;MOE (Mixture of Experts) 개념은 이미 2017년 Shazeer et al에 의해 제안되었음.&lt;/li&gt;
  &lt;li&gt;route 에 해당하는 \(W_r\) 값이 존재하고, input값과 곱해져서 logits을 생성함 \(h(x) = W_r \cdot x\)&lt;/li&gt;
  &lt;li&gt;이후에 softmax를 적용하고, top-k 개의 experts 를 선택함. 포인트는 여러개의 experts를 선택함&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;아래는 구체적인 공식&lt;/p&gt;

\[h(x) = W_r \cdot x\]

&lt;ul&gt;
  &lt;li&gt;W_r : route matrix (이게 학습이 되면서 experts를 선택하는 역활을 함)&lt;/li&gt;
  &lt;li&gt;x: input vector&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;이후에 softmax를 적용하여, top-k개의 experts를 선택함.&lt;/p&gt;

\[p_i(x) = \frac{e^{h(x)_i}}{\sum^N_j e^{h(x)_j}}\]

&lt;p&gt;만약 T가 선택된 experts의 집합이라면, 최종 output은 다음과 같이 계산됩니다.&lt;/p&gt;

\[y = \sum_{i \in T} p_i(x) E_i(x)\]

&lt;ul&gt;
  &lt;li&gt;E_i(x): 선택된 experts에서 i번째 expert의 output&lt;/li&gt;
  &lt;li&gt;그냥 곱하기 하면 됨&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;이걸 Pytorch 로 구현하면 다음과 같습니다.&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;torch&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;torch.nn&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;nn&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;torch.nn.functional&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;F&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;MoELayer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Module&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;__init__&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;input_dim&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;output_dim&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;num_experts&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;k&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;nb&quot;&gt;super&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;().&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;__init__&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
        &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;k&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;k&lt;/span&gt;
        &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;num_experts&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;num_experts&lt;/span&gt;

        &lt;span class=&quot;c1&quot;&gt;# Router: logits = W_r · x
&lt;/span&gt;        &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;router&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;nn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Linear&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;input_dim&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;num_experts&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

        &lt;span class=&quot;c1&quot;&gt;# Experts: each is a small MLP (or Linear here)
&lt;/span&gt;        &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;experts&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;nn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ModuleList&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;nn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Linear&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;input_dim&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;output_dim&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;num_experts&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;forward&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;s&quot;&gt;&quot;&quot;&quot;
        x: (batch_size, input_dim)
        returns: (batch_size, output_dim)
        &quot;&quot;&quot;&lt;/span&gt;
        &lt;span class=&quot;c1&quot;&gt;# Compute routing logits
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;logits&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;router&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# (B, N)
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;gate_probs&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;F&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;softmax&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;logits&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dim&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# (B, N)
&lt;/span&gt;
        &lt;span class=&quot;c1&quot;&gt;# Select top-k experts per example
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;topk_vals&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;topk_idx&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;torch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;topk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;gate_probs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;k&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dim&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# (B, k)
&lt;/span&gt;
        &lt;span class=&quot;c1&quot;&gt;# Initialize output
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;output&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;torch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;zeros&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;size&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;experts&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;].&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;out_features&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;device&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;device&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

        &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;k&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;idx&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;topk_idx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[:,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# expert index
&lt;/span&gt;            &lt;span class=&quot;n&quot;&gt;weight&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;topk_vals&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[:,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;].&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;unsqueeze&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# (B, 1)
&lt;/span&gt;
            &lt;span class=&quot;c1&quot;&gt;# For batch-wise selection
&lt;/span&gt;            &lt;span class=&quot;n&quot;&gt;expert_outputs&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;torch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;stack&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;
                &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;experts&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;expert_id&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;](&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;j&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;].&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;unsqueeze&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
                &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;j&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;expert_id&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;enumerate&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;idx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
            &lt;span class=&quot;p&quot;&gt;]).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;squeeze&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# (B, output_dim)
&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;output&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;weight&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;expert_outputs&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# weighted sum
&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;output&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;the-problem-of-moe-routing&quot;&gt;The problem of MOE Routing&lt;/h2&gt;

&lt;p&gt;2017년도에 나온 Mixture of Experts의 모델은 복잡한 routing알고리즘이 필요했습니다. 
특히 1개 이상의 expersts를 선택하는 방식을 취했는데, 그 이유는 non-trivial gradients 를 갖기 때문이라고 합니다.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;trivial gradients: gradients 값이 0에 가까운 매우 작은 수 -&amp;gt; 학습이 안됨&lt;/li&gt;
  &lt;li&gt;non-trivial gradients: 실제로 의미있는 (학습이 가능한) gradients&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;만약 softmax 의 결과값 중에서 하나만 취하게 된다면, argmax 를 사용할수 있습니다.&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;idx&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;torch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;argmax&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;logits&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;문제는 argmax를 사용하게 되면 non-differentiable 연산이기 때문에, backpropagation이 불가능하게 됩니다.&lt;br /&gt;
(그냥 큰 값을 선택해서 index를 리턴하는 것기 때문에 discrete operation 이며, discontinuous function임 -&amp;gt; 미분 안됨)
즉 해당 값이 왜 선택되었는지, gradient 정보가 없습니다.&lt;/p&gt;

&lt;p&gt;따라서 여러개를 선택하는 방식이 필요했습니다.&lt;/p&gt;

&lt;h2 id=&quot;expert-capacity&quot;&gt;Expert Capacity&lt;/h2&gt;

&lt;p&gt;Switch Transformer 를 배우기 전에, 먼저 Expert Capacity 개념을 이해해야 합니다.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Imbalance는 expert 가 선택되는 빈도에 따라서 발생하기도 합니다.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;예를 들어서 “Mixture of Experts is AWESOME” 이라는 문장이 Router를 거쳤을때, 특정 Expert 만 선택된다면 undertraining이 발생할수 있습니다.&lt;br /&gt;&lt;/p&gt;

&lt;div class=&quot;language-text highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Token      Softmax(logits)                  Selected Expert
------------------------------------------------------------
Mixture    [0.7, 0.1, 0.1, 0.1]             Expert 0
of         [0.8, 0.1, 0.0, 0.1]             Expert 0
Experts    [0.6, 0.1, 0.1, 0.2]             Expert 0
is         [0.5, 0.2, 0.1, 0.2]             Expert 0
AWESOME    [0.1, 0.1, 0.7, 0.1]             Expert 2
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;즉 어떤 expert가 선택 되느냐가 아니라, 얼마나 많이 특정 expert가 선택되느냐가 중요합니다.&lt;br /&gt;
해당 문제를 해결하기 위해서, Expert Capacity 개념이 도입되었습니다.&lt;br /&gt;
특정 expert의 capacity (token의 갯수)를 정해놓고, 해당 capacity를 초과하는 경우에는 다른 expert를 선택하도록 합니다.&lt;br /&gt;
이것을 “token overflow” 라고 합니다.&lt;/p&gt;

&lt;p&gt;Transformer 모델 쓰는데, 어떻게 이게 가능하냐 라고 생각이 들면 아키텍쳐를 보면 이해가 됩니다.&lt;/p&gt;

&lt;div class=&quot;language-text highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Input Sentence (문장)
    ↓
Tokenization
    ↓
Embedding
    ↓
Multi-Head Attention (공통 처리)
    ↓
FFN (token마다 expert 선택!)
    ↓
Output
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;즉 FFN 부분에서 expert가 선택되고 FFN가 처리되기 때문에, token마다 다른 expert가 선택될 수 있습니다.&lt;br /&gt;
그래서 expert capacity를 정해놓고, 해당 capacity를 초과하는 경우에는 다른 expert를 선택하도록 정할수 있습니다.
또는 dropped token 이라고 해서, 해당 token을 처리하지 않고, 넘어가도록 할수도 있습니다.&lt;/p&gt;

\[\text{Expert Capacity} = \left( \frac{\text{tokens_per_batch}}{\text{num_experts}} \times \text{capacity_factor} \right)\]

&lt;ul&gt;
  &lt;li&gt;tokens_per_batch: 배치당 토큰의 갯수&lt;/li&gt;
  &lt;li&gt;num_experts: 전체 expert의 갯수&lt;/li&gt;
  &lt;li&gt;capacity_factor: expert의 capacity를 조정하는 factor (예를 들어서 1.0 이면, 각 expert가 배치당 토큰의 갯수 / 전체 expert의 갯수 만큼의 capacity를 갖게 됨)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;예제&lt;/p&gt;

\[\text{Expert Capacity} = \left( \frac{100}{4} \times 1.0 \right) = 25\]

&lt;p&gt;각 Expert는 배치당 25개의 토큰을 처리할 수 있습니다. &lt;br /&gt;
만약 어떤 expert가 25개를 초과하는 경우에는 다른 expert를 선택하거나, dropped token 으로 그냥 스킵합니다.&lt;br /&gt;
보통 capacity_factor는 1.0 에서 1.5 사이로 설정합니다.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;capacity_factor 가 높을수록: dropped token이 줄어듬 -&amp;gt; 반면에 연산량이 늘어남&lt;/li&gt;
  &lt;li&gt;capacity_factor 가 낮을수록: dropped token이 늘어남 -&amp;gt; 반면에 연산량이 줄어듬&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;아래는 Switch Transformer 사용시 capacity factor가 줄어들면서 , 연산량이 줄어드는 것을 보여줍니다.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/capacity-factor-switch-transformer.png&quot; class=&quot;img-responsive img-rounded img-fluid center&quot; style=&quot;border: 2px solid #333333&quot; /&gt;&lt;/p&gt;

&lt;h1 id=&quot;switch-transformer&quot;&gt;Switch Transformer&lt;/h1&gt;

&lt;h2 id=&quot;switch-routing-rethinking-mixture-of-experts&quot;&gt;Switch Routing: Rethinking Mixture of Experts&lt;/h2&gt;

&lt;p&gt;Switch Transformer 는 이러한 복잡한 Routing 알고리즘을 단순화 하면서, 더 높은 성능을 보여줍니다.&lt;br /&gt;
즉 k=1 routing은 Switch Routing이라고 부릅니다.&lt;/p&gt;

&lt;p&gt;이로인 인한 장점은 다음과 같습니다.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Router computation 이 단순해지며, 오직 single expert만 선택합니다.&lt;/li&gt;
  &lt;li&gt;expert capacity가 반으로 줄어들수 있습니다.&lt;/li&gt;
  &lt;li&gt;routing implementaion이 단순해집니다.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;1번은 single expert만 선택하기 때문에, routing computation이 단순해집니다.&lt;br /&gt;
2번은 expert capacity에서 쉽게 말하면, dropped token이 발생하면서 연산량이 줄어들수 있습니다.&lt;br /&gt;
3번은 그냥 구현이 단순해집니다.&lt;/p&gt;

&lt;h2 id=&quot;differentiable-load-balancing-loss&quot;&gt;Differentiable Load Balancing Loss&lt;/h2&gt;

\[\begin{align}
P_i &amp;amp;= \frac{1}{T} \sum_{x \in \beta} p_i(x) \\
fi &amp;amp;= \frac{1}{T} \sum_{x \in \beta} \mathbf{1} \{ argmax p(x) = i \} \\
loss &amp;amp;= \alpha \cdot N \cdot \sum^N_{i=1} f_i \cdot P_i \\
\end{align}\]

&lt;ul&gt;
  &lt;li&gt;P_i
    &lt;ul&gt;
      &lt;li&gt;\(P_i (X)\): token x가 expert i로 갈 확률. (softmax의 결과값)&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;f_i
    &lt;ul&gt;
      &lt;li&gt;T: batch안의 token 갯수&lt;/li&gt;
      &lt;li&gt;argmax p(x): token x를 라우팅할때 선택된 expert의 index&lt;/li&gt;
      &lt;li&gt;쉽게 설명하면, batch 안의 token이 expert i로 라우팅 됐는지 비율&lt;/li&gt;
      &lt;li&gt;예를 들어서 batch안에 token이 100개가 있고, 그중에 20개가 expert i로 라우팅 됐다면, f_i = 0.2 가 됩니다.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;loss
    &lt;ul&gt;
      &lt;li&gt;a (alpha): scaling factor (hyperparameter 이며 보통 0.01로 설정)&lt;/li&gt;
      &lt;li&gt;N: number of experts&lt;/li&gt;
      &lt;li&gt;f_i: 실제 routing된 token 중에서 expert i에 할당된 비율 (fraction of tokens dispatched to expert i)&lt;/li&gt;
      &lt;li&gt;P_i: expert i에 할당된 확률 총합 (평균)&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;즉, 실제 token 분배 분포 f_i 와 router (softmax)의 분포 P_i 의 dot product를 계산 -&amp;gt; scaling 계수를 곱한것&lt;br /&gt;
f_i (token 분배 분포) 와 P_i (softmax의 분포)가 일치할수록 loss가 작아집니다.&lt;br /&gt;
(이때 gradient계산 할때 f 는 non-differentiable 이며, P는 differentiaible 입니다.)&lt;/p&gt;

&lt;p&gt;즉 다음은 “완벽히 균형 잡힌 상태 (ideal case)” 입니다.&lt;/p&gt;

&lt;div class=&quot;language-text highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Expert:       E1     E2     E3     E4
f_i:          0.25   0.25   0.25   0.25   (실제 라우팅 분포)
P_i:          0.25   0.25   0.25   0.25   (softmax 기대 확률)

f_i * P_i:    0.0625 0.0625 0.0625 0.0625
Sum(f ⋅ P):   0.25   (최소값!)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;만약 편향된 상태가 된다면, 다음과 같습니다.&lt;/p&gt;

&lt;div class=&quot;language-text highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Expert:       E1     E2     E3     E4
f_i:          0.70   0.10   0.10   0.10   (거의 E1만 쓰임)
P_i:          0.40   0.20   0.20   0.20   (softmax도 약간 E1 치우침)

f_i * P_i:    0.28   0.02   0.02   0.02
Sum(f ⋅ P):   0.34   (커짐 → loss ↑)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Router Implementation&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;torch&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;torch.nn.functional&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;F&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;torch&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;nn&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Router&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Module&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;&quot;&quot;&quot;
    The router module determines which expert each token is sent to.
    It also computes the load balancing loss.
    &quot;&quot;&quot;&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;__init__&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;d_model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;num_experts&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;nb&quot;&gt;super&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;().&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;__init__&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
        &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;num_experts&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;num_experts&lt;/span&gt;
        &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;gate&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;nn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Linear&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;d_model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;num_experts&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;forward&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;c1&quot;&gt;# x: (batch_size, seq_len, d_model)
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;logits&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;gate&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;c1&quot;&gt;# logits: (batch_size, seq_len, num_experts)
&lt;/span&gt;        
        &lt;span class=&quot;c1&quot;&gt;# Get the top-1 expert for each token
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;top1_logits&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;top1_indices&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;logits&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;max&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dim&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;c1&quot;&gt;# top1_indices: (batch_size, seq_len)
&lt;/span&gt;        
        &lt;span class=&quot;c1&quot;&gt;# Create a one-hot encoding of the expert indices
&lt;/span&gt;        &lt;span class=&quot;c1&quot;&gt;# This will be used to dispatch tokens to the correct expert
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;expert_mask&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;F&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;one_hot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;top1_indices&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;num_experts&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
        &lt;span class=&quot;c1&quot;&gt;# expert_mask: (batch_size, seq_len, num_experts)
&lt;/span&gt;        
        &lt;span class=&quot;c1&quot;&gt;# Calculate the load balancing loss
&lt;/span&gt;        &lt;span class=&quot;c1&quot;&gt;# This loss encourages all experts to be used equally.
&lt;/span&gt;        
        &lt;span class=&quot;c1&quot;&gt;# Count how many tokens are sent to each expert
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;tokens_per_expert&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;expert_mask&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sum&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dim&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# (num_experts)
&lt;/span&gt;        &lt;span class=&quot;c1&quot;&gt;# Total number of tokens
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;total_tokens&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;size&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;size&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# Calculate the fraction of tokens sent to each expert
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;fraction_tokens_per_expert&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tokens_per_expert&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;total_tokens&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# Calculate the expert probabilities from the logits
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;expert_probs&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;F&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;softmax&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;logits&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dim&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mean&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dim&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# (num_experts)
&lt;/span&gt;        
        &lt;span class=&quot;c1&quot;&gt;# The load balancing loss is the dot product of these two quantities
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;load_balancing_loss&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;num_experts&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;torch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fraction_tokens_per_expert&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;expert_probs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;expert_mask&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;load_balancing_loss&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;expert-capacity-1&quot;&gt;Expert Capacity&lt;/h2&gt;

&lt;h1 id=&quot;implementation&quot;&gt;Implementation&lt;/h1&gt;

&lt;p&gt;여기서 전체 코드가 아닌 핵심이 되는 코드만 공유 합니다.&lt;/p&gt;

&lt;h2 id=&quot;switch-transformer-model&quot;&gt;Switch Transformer Model&lt;/h2&gt;

&lt;div class=&quot;language-text highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Input Tokens
     │
     ▼
[Embedding Layer] ──► nn.Embedding(ntoken, d_model)
     │
     ▼
[Positional Encoding] ──► PositionalEncoding(d_model, dropout)
     │
     ▼
[Transformer Encoder Stack]
     │
     └─► (L layers of)
           └─► [Multi-head Attention]
           └─► [LayerNorm + Residual]
           └─► [Switch FFN (MoE Layer)]
               └─► Router → Expert selection
               └─► Dispatch token to Expert
               └─► Gather output
           └─► [LayerNorm + Residual]
     │
     ▼
[Final Linear Decoder] ──► nn.Linear(d_model, ntoken)
     │
     ▼
Vocabulary Logits (for language modeling)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;SwitchTransformerLM&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Module&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;__init__&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ntoken&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;d_model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;nhead&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;d_ff&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;num_experts&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;num_layers&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                 &lt;span class=&quot;n&quot;&gt;dropout&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;float&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mf&quot;&gt;0.5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;nb&quot;&gt;super&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;().&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;__init__&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
        &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;model_type&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;Transformer&apos;&lt;/span&gt;
        &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;d_model&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;d_model&lt;/span&gt;
        &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;encoder&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;nn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Embedding&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ntoken&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;d_model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pos_encoder&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;PositionalEncoding&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;d_model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dropout&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

        &lt;span class=&quot;n&quot;&gt;encoder_layer&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;SwitchTransformerEncoderLayer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;d_model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;nhead&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;d_ff&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;num_experts&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dropout&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;transformer_encoder&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;SwitchTransformerEncoder&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;encoder_layer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;num_layers&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

        &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;decoder&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;nn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Linear&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;d_model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ntoken&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;switchtransformer-encoder-layer---moe-layer&quot;&gt;SwitchTransformer Encoder Layer  + MoE Layer&lt;/h2&gt;

&lt;div class=&quot;language-text highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Input: Token Hidden States (src)
     │
     ▼
[Multi-Head Self-Attention]
     │
     ▼
🟦 MoELayer  ← ← ← ← ← ← ← ← ← ← ← ← ← 🧠 핵심 포인트!
     │
     ├─► Router(x)
     │     └─► Routing logits (W_r · x)
     │     └─► Softmax → Top-1 Expert 선택
     │     └─► expert_mask 생성 (one-hot)  
     │
     ├─► Capacity 계산 (token overflow 방지)
     │
     ├─► 각 Expert 별로:
     │     └─ Token dispatch (x[expert_indices])
     │     └─ Expert FFN 처리
     │     └─ Output scatter to original positions
     │
     └─► 최종 Output (sparse computation 결과)
             + Load Balancing Loss
     ▼
Dropout + Residual
     ▼
LayerNorm #2
     ▼
Output Hidden States, Load Balancing Loss
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;SwitchTransformerEncoderLayer 에서 가장 중요한 부분은 아래 코드이며,&lt;br /&gt;
기존 TransformerEncoderLayer와 MoELayer를 결합한 것입니다.&lt;br /&gt;
Attention 연산 후에 MoE layer를 적용하여, 각 토큰마다 다른 expert를 선택하고 처리합니다.&lt;br /&gt;&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;SwitchTransformerEncoderLayer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Module&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;__init__&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;d_model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;nhead&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;d_ff&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;num_experts&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dropout&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;float&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mf&quot;&gt;0.1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;None&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;nb&quot;&gt;super&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;().&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;__init__&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
        &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;self_attn&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;nn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;MultiheadAttention&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;d_model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;nhead&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dropout&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dropout&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;moe_layer&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;MoELayer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;d_model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;d_ff&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;num_experts&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

        &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;norm1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;nn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;LayerNorm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;d_model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;norm2&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;nn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;LayerNorm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;d_model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dropout&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;nn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Dropout&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dropout&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;forward&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;src&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Tensor&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;src_mask&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Optional&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Tensor&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;None&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;src_key_padding_mask&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Optional&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Tensor&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;None&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Tuple&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Tensor&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Tensor&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]:&lt;/span&gt;
        &lt;span class=&quot;c1&quot;&gt;# Multi-head self-attention
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;src2&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;self_attn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;src&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;src&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;src&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;attn_mask&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;src_mask&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;key_padding_mask&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;src_key_padding_mask&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;src&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;src&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dropout&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;src2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;src&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;norm1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;src&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

        &lt;span class=&quot;c1&quot;&gt;# MoE layer
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;src2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;load_balancing_loss&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;moe_layer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;src&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;src&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;src&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dropout&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;src2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;src&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;norm2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;src&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;src&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;load_balancing_loss&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;moe-layer&quot;&gt;MoE Layer&lt;/h2&gt;

&lt;div class=&quot;language-text highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Input (x): (batch_size, seq_len, d_model)
   │
   ▼
Router (W_r · x → softmax → top-1 expert 선택)
   │
   ▼
Expert Mask (one-hot): (batch_size, seq_len, num_experts)
   │
   ▼
Flatten: x → (B×T, d_model), mask → (B×T, num_experts)
   │
   ▼
for each expert i in num_experts:
    - token 선택 (해당 expert로 라우팅된 것만)
    - capacity 초과 시 overflow drop
    - expert_i(input) → FFN 처리
    - 결과를 final_output 에 다시 index_add

   ▼
Reshape: (B×T, d_model) → (batch_size, seq_len, d_model)

Return: final_output, load_balancing_loss
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;
&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;MoELayer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Module&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;&quot;&quot;&quot;
    The Mixture-of-Experts (MoE) layer, which replaces the FFN layer in a standard Transformer.
    &quot;&quot;&quot;&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;__init__&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;d_model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;d_ff&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;num_experts&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;capacity_factor&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;float&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mf&quot;&gt;1.25&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;None&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;nb&quot;&gt;super&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;().&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;__init__&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
        &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;d_model&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;d_model&lt;/span&gt;
        &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;num_experts&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;num_experts&lt;/span&gt;
        &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;capacity_factor&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;capacity_factor&lt;/span&gt;

        &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;router&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Router&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;d_model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;num_experts&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;experts&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;nn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ModuleList&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Expert&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;d_model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;d_ff&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;num_experts&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)])&lt;/span&gt;
        
    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;forward&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Tensor&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Tuple&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Tensor&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Tensor&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]:&lt;/span&gt;
        &lt;span class=&quot;s&quot;&gt;&quot;&quot;&quot;
        Example: Let&apos;s say we have:
        - batch_size=2, seq_len=4, d_model=8, num_experts=2
        - Input x: shape (2, 4, 8) - 2 sequences, each with 4 tokens of 8 dimensions
        &quot;&quot;&quot;&lt;/span&gt;
        &lt;span class=&quot;c1&quot;&gt;# x: (batch_size, seq_len, d_model)
&lt;/span&gt;        &lt;span class=&quot;c1&quot;&gt;# Example: x.shape = (2, 4, 8)
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;batch_size&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;seq_len&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;d_model&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;shape&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# Get the expert mask and load balancing loss from the router
&lt;/span&gt;        &lt;span class=&quot;c1&quot;&gt;# The router decides which expert should process each token
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;expert_mask&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;load_balancing_loss&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;router&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;c1&quot;&gt;# expert_mask: (batch_size, seq_len, num_experts)
&lt;/span&gt;        &lt;span class=&quot;c1&quot;&gt;# Example: expert_mask.shape = (2, 4, 2)
&lt;/span&gt;        &lt;span class=&quot;c1&quot;&gt;#   expert_mask[0, 0, :] = [1, 0] means token 0 goes to expert 0
&lt;/span&gt;        &lt;span class=&quot;c1&quot;&gt;#   expert_mask[0, 1, :] = [0, 1] means token 1 goes to expert 1
&lt;/span&gt;        
        &lt;span class=&quot;c1&quot;&gt;# Determine the capacity of each expert
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;capacity&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;seq_len&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;num_experts&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;capacity_factor&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# Reshape tensors for easier processing - flatten batch and sequence dimensions
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;x_flat&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;view&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;d_model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# (batch_size * seq_len, d_model)
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;expert_mask_flat&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;expert_mask&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;view&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;num_experts&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# (batch_size * seq_len, num_experts)
&lt;/span&gt;
        &lt;span class=&quot;c1&quot;&gt;# Initialize output tensor with zeros - same shape as flattened input
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;final_output&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;torch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;zeros_like&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x_flat&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;c1&quot;&gt;# Example: final_output.shape = (8, 8)
&lt;/span&gt;        
        &lt;span class=&quot;c1&quot;&gt;# Process each expert separately
&lt;/span&gt;        &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;expert&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;enumerate&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;experts&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
            &lt;span class=&quot;c1&quot;&gt;# Find which tokens should go to this expert (non-zero entries in the mask)
&lt;/span&gt;            &lt;span class=&quot;n&quot;&gt;expert_indices&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;torch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;where&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;expert_mask_flat&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[:,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
            &lt;span class=&quot;c1&quot;&gt;# Example for expert 0: expert_indices might be [0, 2, 4] (tokens 0, 2, 4)
&lt;/span&gt;            &lt;span class=&quot;c1&quot;&gt;# Example for expert 1: expert_indices might be [1, 3, 5, 6, 7] (tokens 1, 3, 5, 6, 7)
&lt;/span&gt;            
            &lt;span class=&quot;c1&quot;&gt;# Apply capacity constraint - if too many tokens assigned, keep only the first &apos;capacity&apos; tokens
&lt;/span&gt;            &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;expert_indices&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;shape&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;capacity&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;expert_indices&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;expert_indices&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;capacity&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
                &lt;span class=&quot;c1&quot;&gt;# Example: if expert 1 has 5 tokens but capacity=2, keep only [1, 3]
&lt;/span&gt;
            &lt;span class=&quot;c1&quot;&gt;# Process tokens assigned to this expert
&lt;/span&gt;            &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;expert_indices&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;shape&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
                &lt;span class=&quot;c1&quot;&gt;# Extract the tokens that should go to this expert
&lt;/span&gt;                &lt;span class=&quot;n&quot;&gt;expert_input&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x_flat&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;expert_indices&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
                &lt;span class=&quot;c1&quot;&gt;# Example: if expert_indices=[1, 3], expert_input.shape = (2, 8)
&lt;/span&gt;                
                &lt;span class=&quot;c1&quot;&gt;# Process the tokens through the expert network
&lt;/span&gt;                &lt;span class=&quot;n&quot;&gt;expert_output&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;expert&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;expert_input&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
                &lt;span class=&quot;c1&quot;&gt;# Example: expert_output.shape = (2, 8) - same as expert_input
&lt;/span&gt;                
                &lt;span class=&quot;c1&quot;&gt;# Ensure dtype consistency
&lt;/span&gt;                &lt;span class=&quot;n&quot;&gt;expert_output&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;expert_output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;to&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;final_output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dtype&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

                &lt;span class=&quot;c1&quot;&gt;# Place the expert&apos;s output back to the final output at the correct positions
&lt;/span&gt;                &lt;span class=&quot;c1&quot;&gt;# Switch Transformer uses exclusive routing: each token goes to exactly ONE expert
&lt;/span&gt;                &lt;span class=&quot;c1&quot;&gt;# So we use assignment (=) not addition (+=)
&lt;/span&gt;                &lt;span class=&quot;c1&quot;&gt;# Example: if expert_indices=[1, 3] and expert_output has 2 rows
&lt;/span&gt;                &lt;span class=&quot;c1&quot;&gt;#   final_output[1] = expert_output[0]  # Place expert&apos;s output for token 1
&lt;/span&gt;                &lt;span class=&quot;c1&quot;&gt;#   final_output[3] = expert_output[1]  # Place expert&apos;s output for token 3
&lt;/span&gt;                &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;j&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;token_idx&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;enumerate&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;expert_indices&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
                    &lt;span class=&quot;n&quot;&gt;final_output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;token_idx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;expert_output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;j&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;

        &lt;span class=&quot;c1&quot;&gt;# Reshape back to original dimensions
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;final_output&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;final_output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;view&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;batch_size&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;seq_len&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;d_model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;c1&quot;&gt;# Example: final_output.shape = (2, 4, 8) - back to original shape
&lt;/span&gt;        
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;final_output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;load_balancing_loss&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
</description>
        <pubDate>Sun, 01 Jun 2025 01:00:00 +0000</pubDate>
        <link>http://incredible.ai/nlp/2025/06/01/Switch-Transformer/</link>
        <guid isPermaLink="true">http://incredible.ai/nlp/2025/06/01/Switch-Transformer/</guid>
        
        
        <category>nlp</category>
        
      </item>
    
  </channel>
</rss>
