LLMs are dumb

Recently I stumbled upon a very simple problem which most LLMs could not seem to solve. Likely due to the fact that the specific goal of the otherwise common problem was not present (enough) in the training data. The problem is as follows: given a pytorch convolutional-encoder and -decoder, change the decoder such that it returns a BxCx28x28 tensor, instead of a BxCx32x32 tensor which it currently does. Simple, right? Any person with knowledge of Pytorch and CNNs would be able to solve this within a couple minutes or so.

import torch
import torch.nn as nn


class CNNEncoder(nn.Module):
    ...


class CNNDecoder(nn.Module):
    def __init__(self, num_input_channels: int = 16, num_filters: int = 32, z_dim: int = 20):
        """
        num_input_channels - Number of channels of the image to reconstruct.
        num_filters - Number of filters we use in the last convolutional layers.
        z_dim - Dimensionality of latent representation z
        """
        super().__init__()

        self.num_input_channels = num_input_channels
        c_hid = num_filters
        self.linear = nn.Sequential(nn.Linear(z_dim, 2 * 16 * c_hid), nn.GELU())

        self.net = nn.Sequential(
            nn.ConvTranspose2d(2 * c_hid, 2 * c_hid, kernel_size=3, output_padding=1, padding=1, stride=2), # 4x4 => 8x8
            nn.GELU(),
            nn.Conv2d(2 * c_hid, 2 * c_hid, kernel_size=3, padding=1),
            nn.GELU(),
            nn.ConvTranspose2d(2 * c_hid, c_hid, kernel_size=3, output_padding=1, padding=1, stride=2), # 8x8 => 16x16
            nn.GELU(),
            nn.Conv2d(c_hid, c_hid, kernel_size=3, padding=1),
            nn.GELU(),
            nn.ConvTranspose2d(c_hid, num_input_channels, kernel_size=3, output_padding=1, padding=1, stride=2), # 16x16 => 32x32
        )

    def forward(self, z):
        """
        z - Latent vector of shape [B,z_dim]
        """
        x = self.linear(z)
        x = x.reshape(x.shape[0], -1, 4, 4)
        x = self.net(x)
        return x

To my surprise however, most LLMs I tried were not able to solve the problem. I kept my prompt simple:

The following code contains an encoder and a decoder. Currently, the decoder outputs a BxCx32x32 tensor. Change the decoder so it returns a BxCx28x28 tensor. You are only allowed to change the __init__ method of the CNNDecoder.

< code>

Some models tried to be sneaky and also changed the decoder's forward function, so I had to add a restriction stating only the __init__ method could be altered. I gave all models three tries, and I gave them no hints or redos after answering. Below there is a table with the results. The code solutions given by the various models can be found on my Github.

Model	Shape	Thinking time	Notes
deepseek-r1-distill-qwen-32B	BxCx28x28	184 seconds	-
deepseek-r1	BxCx28x28	112 seconds	-
deepseek-v3	BxCx31x31	-	-
gemini-1.5-pro-002	BxCx25x25	-	-
GPT-4o	BxCx31x31	-	-
GPT-o3-mini	BxCx28x28	36 seconds	Second try
llama-3.1-405B-instruct-turbo.py	BxCx28x28	-	-
llama-3.3-70B-instruct	BxCx32x32	-	-
mixtral-8x22B	BxCx31x31	-	-
qwen-2.5-72B-instruct	BxCx31x31	-	-
sonar-pro-reasoning-pro	Syntax error	170 seconds	Went super wrong somewhere in the thinking process (spammed emojis and switched to japanese suddenly)
Sonar-pro	BxCx29x29	-	-

As can be seen from the table above, only four out of the 12 tested SOTA (roughly) models came to the correct model. Notably, all these models are reasoning models (and Llama-3.1-405B). If you ask me, a problem like this should not be so difficult that only the most recent reasoning models and a 405B parameter model can solve it.