Recently I stumbled upon a very simple problem which most LLMs could not seem to solve. Likely due to the fact that the specific goal of the otherwise common problem was not present (enough) in the training data. The problem is as follows: given a pytorch convolutional-encoder and -decoder, change the decoder such that it returns a BxCx28x28 tensor, instead of a BxCx32x32 tensor which it currently does. Simple, right? Any person with knowledge of Pytorch and CNNs would be able to solve this within a couple minutes or so.
import torch
import torch.nn as nn
class CNNEncoder(nn.Module):
...
class CNNDecoder(nn.Module):
def __init__(self, num_input_channels: int = 16, num_filters: int = 32, z_dim: int = 20):
"""
num_input_channels - Number of channels of the image to reconstruct.
num_filters - Number of filters we use in the last convolutional layers.
z_dim - Dimensionality of latent representation z
"""
super().__init__()
self.num_input_channels = num_input_channels
c_hid = num_filters
self.linear = nn.Sequential(nn.Linear(z_dim, 2 * 16 * c_hid), nn.GELU())
self.net = nn.Sequential(
nn.ConvTranspose2d(2 * c_hid, 2 * c_hid, kernel_size=3, output_padding=1, padding=1, stride=2), # 4x4 => 8x8
nn.GELU(),
nn.Conv2d(2 * c_hid, 2 * c_hid, kernel_size=3, padding=1),
nn.GELU(),
nn.ConvTranspose2d(2 * c_hid, c_hid, kernel_size=3, output_padding=1, padding=1, stride=2), # 8x8 => 16x16
nn.GELU(),
nn.Conv2d(c_hid, c_hid, kernel_size=3, padding=1),
nn.GELU(),
nn.ConvTranspose2d(c_hid, num_input_channels, kernel_size=3, output_padding=1, padding=1, stride=2), # 16x16 => 32x32
)
def forward(self, z):
"""
z - Latent vector of shape [B,z_dim]
"""
x = self.linear(z)
x = x.reshape(x.shape[0], -1, 4, 4)
x = self.net(x)
return x
To my surprise however, most LLMs I tried were not able to solve the problem. I kept my prompt simple:
The following code contains an encoder and a decoder. Currently, the decoder outputs a BxCx32x32 tensor. Change the decoder so it returns a BxCx28x28 tensor. You are only allowed to change the __init__ method of the CNNDecoder.
< code>
Some models tried to be sneaky and also changed the decoder's forward function, so I had to add a restriction stating only the __init__ method could be altered. I gave all models three tries, and I gave them no hints or redos after answering. Below there is a table with the results. The code solutions given by the various models can be found on my Github.
Model | Shape | Thinking time | Notes |
---|---|---|---|
deepseek-r1-distill-qwen-32B | BxCx28x28 | 184 seconds | - |
deepseek-r1 | BxCx28x28 | 112 seconds | - |
deepseek-v3 | BxCx31x31 | - | - |
gemini-1.5-pro-002 | BxCx25x25 | - | - |
GPT-4o | BxCx31x31 | - | - |
GPT-o3-mini | BxCx28x28 | 36 seconds | Second try |
llama-3.1-405B-instruct-turbo.py | BxCx28x28 | - | - |
llama-3.3-70B-instruct | BxCx32x32 | - | - |
mixtral-8x22B | BxCx31x31 | - | - |
qwen-2.5-72B-instruct | BxCx31x31 | - | - |
sonar-pro-reasoning-pro | Syntax error | 170 seconds | Went super wrong somewhere in the thinking process (spammed emojis and switched to japanese suddenly) |
Sonar-pro | BxCx29x29 | - | - |
As can be seen from the table above, only four out of the 12 tested SOTA (roughly) models came to the correct model. Notably, all these models are reasoning models (and Llama-3.1-405B). If you ask me, a problem like this should not be so difficult that only the most recent reasoning models and a 405B parameter model can solve it.