Skip to content

Fine-tuning 101: Qwen3 0.6B on CodeForces Data

Published:
6 min read

image.png

Table of Contents

Open Table of Contents

Introduction

Powerful AI models are heavily post-trained for reasoning, coding tasks, agentic capabilities like instruction-following and tool calling. In fact, one of the hypotheses of why Claude Code (CLI) is better is that the model is post-trained on tools that the harness(Claude Code) uses making the full stack (harness + model) even better. It’s the result of deliberate post-training using techniques like Supervised Fine-Tuning (SFT), Reinforcement Learning from Human Feedback (RLHF), and Direct Preference Optimization (DPO).

*I wanted to understand this process hands-on. In this blog, I’ll walk you through my journey fine-tuning Qwen3-0.6B on HuggingFace CodeForces using its cloud infrastructure and HF model-trainer claude skill and Claude Code.

We will try Supervised Fine-Tuning (SFT) today.

What We’re Doing

Problem :

You are given an array a of n integers, where n is odd. In one operation, you will remove two adjacent elements from the array a, and then concatenate the remaining parts. For example, given the array [4,7,4,2,9], we can obtain [4,2,9] by removing [4,7].You will repeatedly perform this operation until exactly one element remains in a. Find the maximum possible value of the remaining element.

Messages:

{
  "messages": [
    {
      "role": "user",
      "content": "You will be given a competitive programming problem. Please reason step by step about the solution, then provide a complete implementation in C++17..."
    },
    {
      "role": "assistant",
      "content": "Let's think through this step by step...\n\n```cpp\n\n```"
    }
  ]
}

Claude Code + HF MCP helps in coding the fine-tuning script, monitoring and pushing metrics, runs jobs in the HF GPU infra.

Keywords

TermsDefinitions
Supervised Fine-Tuning (SFT)The pre-trained model (like Qwen3) has broad knowledge from internet text. SFT teaches it task-specific skills by showing it examples of correct behavior. : Understand competitive programming problems, Reason through solution approaches, Generate correct C++ code, Explain the solution clearly
LoRA: Efficient Fine-TuningQwen3-0.6B has 600 million parameters. LoRA (Low rank adaptation) doesn’t modify the original model weights. Instead, it add a small “adapter” layer that learns tasks-specific transformations. Finally, we only train about 1% of the total parameters, but still achieve effective learning. The adapters captures the task-specific patterns while the frozen base model retains general knowledge.

Key Hyper-parameters

ParameterValueWhat It DoesWhy This Value?
Batch Size2 (per GPU)Examples processed simultaneouslyMemory-efficient for 24GB GPU
Epochs3Full passes through the training data to optimise the evaluationMultiple passes help with 1,000

Model Evaluation Parameters

How do we evaluate the fine-tuned model.

MetricDefinition
Training lossMeasures how well the model predicts tokens on the training data
Validation lossMeasures how well the model predicts tokens on unseen evaluation data to check generalization
Token Accuracy on eval setModel predicts next tokens and we compare with the ground truth

Fine-tuning Implementation

  1. First run, a sample fine-tuning run on 500 examples with 1 epoch
  2. Second run, fine-tuning run with 1000 examples and 3 epochs

First goal was seeing a training job complete successfully. I started with just 500 examples for rapid iteration.

Early attempts had lot of failures:

image.png

Key learnings:

Finally, ran on 1000 examples with train-eval split

Here is the code : https://gist.github.com/kn-neeraj/140c2184233d922be0c2cb92d21dfa00

Results & Interpretation

Metric500 Examples (1 epoch)1000 Examples (3 epochs) train-evalImprovement
Final Losstraining loss: 1.1328training loss : 0.85
evaluation loss : 0.35
65% better
Token Accuracytraining: 75%training : 78%
evaluation : 89%
+14%
Training Time~7 min~18 min2.6x longer

Loss Function Graphs from the run

image.png

Observations:

The model was genuinely learning to solve new problems, not just memorizing the training set.

Key Learnings & What Next!

Learning 1 -> Claude Code with HF MCP & model-trainer skill works but with a lot of context. HF MCP gives claude tools like below to help in full cycle of fine-tuning:

  1. mcp__hf-skills__hf_jobs - Submit training jobs to cloud GPUs
  2. mcp__hf-skills__model_search - Find models on the Hub
  3. mcp__hf-skills__dataset_search - Discover datasets
  4. mcp__hf-skills__hub_repo_details - Get model/dataset info

Claude Code helped write training script and use HF MCP to submit the training jobs.

What’s Next -> Doing fine-tuning run on full dataset (47000 examples) and doing evaluation on benchmark dataset using LLM-as-a-judge.

Visualising the Full Stack

┌─────────────────────────────────────────┐
│  Claude Code + HF MCP                   │
│  (Orchestration & Debugging)            │
└───────────────┬─────────────────────────┘

┌───────────────▼─────────────────────────┐
│  Hugging Face Jobs                      │
│  (Cloud GPU Infrastructure)             │
└───────────────┬─────────────────────────┘

┌───────────────▼─────────────────────────┐
│  TRL (Transformer RL Library)           │
│  (SFTTrainer, LoRA, Training Loop)      │
└───────────────┬─────────────────────────┘

┌───────────────▼─────────────────────────┐
│  Trackio                                │
│  (Metrics & Monitoring)                 │
└─────────────────────────────────────────┘

Key References

Subscribe for new posts to land in your inbox. No spam, ever.