Fine-tuning 101: Qwen3 0.6B on CodeForces Data

Open Table of Contents

Introduction
What We’re Doing
Keywords
- Key Hyper-parameters
- Model Evaluation Parameters
Fine-tuning Implementation
- Finally, ran on 1000 examples with train-eval split
Key Learnings & What Next!
- Key References

Introduction

Powerful AI models are heavily post-trained for reasoning, coding tasks, agentic capabilities like instruction-following and tool calling. In fact, one of the hypotheses of why Claude Code (CLI) is better is that the model is post-trained on tools that the harness(Claude Code) uses making the full stack (harness + model) even better. It’s the result of deliberate post-training using techniques like Supervised Fine-Tuning (SFT), Reinforcement Learning from Human Feedback (RLHF), and Direct Preference Optimization (DPO).

*I wanted to understand this process hands-on. In this blog, I’ll walk you through my journey fine-tuning Qwen3-0.6B on HuggingFace CodeForces using its cloud infrastructure and HF model-trainer claude skill and Claude Code.

We will try Supervised Fine-Tuning (SFT) today.

What We’re Doing

Base Model: Qwen3-0.6B
CodeForces Competition Problems Dataset: The open-r1/codeforces-cots dataset contains 47,000 competitive programming problems with solutions and chain-of-thought reasoning traces. Each example has: Problem description: A competitive programming challenge, Messages format: Conversational format with the problem as a user message and the solution with reasoning as an assistant message. Example:

Problem :

You are given an array a of n integers, where n is odd. In one operation, you will remove two adjacent elements from the array a, and then concatenate the remaining parts. For example, given the array [4,7,4,2,9], we can obtain [4,2,9] by removing [4,7].You will repeatedly perform this operation until exactly one element remains in a. Find the maximum possible value of the remaining element.

Messages:

{
  "messages": [
    {
      "role": "user",
      "content": "You will be given a competitive programming problem. Please reason step by step about the solution, then provide a complete implementation in C++17..."
    },
    {
      "role": "assistant",
      "content": "Let's think through this step by step...\n\n```cpp\n\n```"
    }
  ]
}

Using HuggingFace Infrastructure & Tools (model-trainer skill and HF MCP)
- Hugging Face Jobs: Cloud GPU infrastructure (requires HF Pro account)
- TRL (Transformer Reinforcement Learning): The library that actually handles the training loop, LoRA integration, and optimization
- HF Model-Trainer Skill : https://github.com/huggingface/skills/tree/main/hf-llm-trainer/skills/model-trainer

Claude Code + HF MCP helps in coding the fine-tuning script, monitoring and pushing metrics, runs jobs in the HF GPU infra.

Keywords

Terms	Definitions
Supervised Fine-Tuning (SFT)	The pre-trained model (like Qwen3) has broad knowledge from internet text. SFT teaches it task-specific skills by showing it examples of correct behavior. : Understand competitive programming problems, Reason through solution approaches, Generate correct C++ code, Explain the solution clearly
LoRA: Efficient Fine-Tuning	Qwen3-0.6B has 600 million parameters. LoRA (Low rank adaptation) doesn’t modify the original model weights. Instead, it add a small “adapter” layer that learns tasks-specific transformations. Finally, we only train about 1% of the total parameters, but still achieve effective learning. The adapters captures the task-specific patterns while the frozen base model retains general knowledge.

Terms

Definitions

Supervised Fine-Tuning (SFT)

The pre-trained model (like Qwen3) has broad knowledge from internet text. SFT teaches it task-specific skills by showing it examples of correct behavior. : Understand competitive programming problems, Reason through solution approaches, Generate correct C++ code, Explain the solution clearly

LoRA: Efficient Fine-Tuning

Qwen3-0.6B has 600 million parameters. LoRA (Low rank adaptation) doesn’t modify the original model weights. Instead, it add a small “adapter” layer that learns tasks-specific transformations. Finally, we only train about 1% of the total parameters, but still achieve effective learning. The adapters captures the task-specific patterns while the frozen base model retains general knowledge.

Key Hyper-parameters

Parameter	Value	What It Does	Why This Value?
Batch Size	`2` (per GPU)	Examples processed simultaneously	Memory-efficient for 24GB GPU
Epochs	`3`	Full passes through the training data to optimise the evaluation	Multiple passes help with 1,000

Model Evaluation Parameters

How do we evaluate the fine-tuned model.

Metric	Definition
Training loss	Measures how well the model predicts tokens on the training data
Validation loss	Measures how well the model predicts tokens on unseen evaluation data to check generalization
Token Accuracy on eval set	Model predicts next tokens and we compare with the ground truth

Fine-tuning Implementation

First run, a sample fine-tuning run on 500 examples with 1 epoch
Second run, fine-tuning run with 1000 examples and 3 epochs

First goal was seeing a training job complete successfully. I started with just 500 examples for rapid iteration.

Early attempts had lot of failures:

❌ Missing dependencies (trl, peft, bitsandbytes)
❌ Format detection errors
❌ Job timeouts from underestimating training time
❌ Errors with KeyError: 'completion'

Key learnings:

Dataset format: The TRL library was getting confused with messages and prompt columns in the dataset. SFTTrainer saw 'prompt' in the first example and assumed prompt-completion format. Then it looked for 'completion'—which didn’t exist—and crashed.
Use Python UV script format:: Inline dependencies in script comments. Jobs can’t access local files

Finally, ran on 1000 examples with train-eval split

Here is the code : https://gist.github.com/kn-neeraj/140c2184233d922be0c2cb92d21dfa00

Results & Interpretation

Metric	500 Examples (1 epoch)	1000 Examples (3 epochs) train-eval	Improvement
Final Loss	training loss: 1.1328	training loss : 0.85 evaluation loss : 0.35	65% better
Token Accuracy	training: 75%	training : 78% evaluation : 89%	+14%
Training Time	~7 min	~18 min	2.6x longer

Loss Function Graphs from the run

Observations:

Both train and eval loss decreased together, good for generalization
Eval metrics were slightly worse than train (expected), but the gap was small
No signs of severe overfitting even after 3 epochs. Need more epochs
GPU memory: Still only ~10-12% utilized

The model was genuinely learning to solve new problems, not just memorizing the training set.

Key Learnings & What Next!

Learning 1 -> Claude Code with HF MCP & model-trainer skill works but with a lot of context. HF MCP gives claude tools like below to help in full cycle of fine-tuning:

mcp__hf-skills__hf_jobs - Submit training jobs to cloud GPUs
mcp__hf-skills__model_search - Find models on the Hub
mcp__hf-skills__dataset_search - Discover datasets
mcp__hf-skills__hub_repo_details - Get model/dataset info

Claude Code helped write training script and use HF MCP to submit the training jobs.

What’s Next -> Doing fine-tuning run on full dataset (47000 examples) and doing evaluation on benchmark dataset using LLM-as-a-judge.

Visualising the Full Stack

┌─────────────────────────────────────────┐
│  Claude Code + HF MCP                   │
│  (Orchestration & Debugging)            │
└───────────────┬─────────────────────────┘
                │
┌───────────────▼─────────────────────────┐
│  Hugging Face Jobs                      │
│  (Cloud GPU Infrastructure)             │
└───────────────┬─────────────────────────┘
                │
┌───────────────▼─────────────────────────┐
│  TRL (Transformer RL Library)           │
│  (SFTTrainer, LoRA, Training Loop)      │
└───────────────┬─────────────────────────┘
                │
┌───────────────▼─────────────────────────┐
│  Trackio                                │
│  (Metrics & Monitoring)                 │
└─────────────────────────────────────────┘

Key References

Hugging Face TRL Docs
Claude model-trainer skill so we can guide claude code to do most of the heavy lifting https://github.com/huggingface/skills/tree/main/hf-llm-trainer/skills/model-trainer