태그: moe

CS336 4강 — Mixture of Experts: 연산은 그대로, 파라미터만 키우기

2026년 06월 26일

llm moe mixture-of-experts deepseek cs336 language-modeling

Stanford CS336 4강 정리. Dense FFN을 라우터 + 여러 전문가로 바꿔 FLOPs는 그대로 두고 파라미터만 키우는 MoE — token-choice top-k 라우팅, fine-grained·shared 전문가, 부하 분산 손실과 DeepSeek V3의 보조손실 없는 균형, 그리고 MLA·MTP까지.