LLMs are dumb

Recently I stumbled upon a very simple problem which most LLMs could not seem to solve. Likely due to the fact that the specific goal of the otherwise common problem was not present (enough) in the training data. The problem is as follows: given a pytorch convolutional-encoder and -decoder, change the decoder such that it returns a BxCx28x28 tensor, instead of a BxCx32x32 tensor which it currently does. Simple, right? Any person with knowledge of Pytorch and CNNs would be able to solve this within a couple minutes or so.

import torch
import torch.nn as nn


class CNNEncoder(nn.Module):
    ...


class CNNDecoder(nn.Module):
    def __init__(self, num_input_channels: int = 16, num_filters: int = 32, z_dim: int = 20):
        """
        num_input_channels - Number of channels of the image to reconstruct.
        num_filters - Number of filters we use in the last convolutional layers.
        z_dim - Dimensionality of latent representation z
        """
        super().__init__()

        self.num_input_channels = num_input_channels
        c_hid = num_filters
        self.linear = nn.Sequential(nn.Linear(z_dim, 2 * 16 * c_hid), nn.GELU())

        self.net = nn.Sequential(
            nn.ConvTranspose2d(2 * c_hid, 2 * c_hid, kernel_size=3, output_padding=1, padding=1, stride=2), # 4x4 => 8x8
            nn.GELU(),
            nn.Conv2d(2 * c_hid, 2 * c_hid, kernel_size=3, padding=1),
            nn.GELU(),
            nn.ConvTranspose2d(2 * c_hid, c_hid, kernel_size=3, output_padding=1, padding=1, stride=2), # 8x8 => 16x16
            nn.GELU(),
            nn.Conv2d(c_hid, c_hid, kernel_size=3, padding=1),
            nn.GELU(),
            nn.ConvTranspose2d(c_hid, num_input_channels, kernel_size=3, output_padding=1, padding=1, stride=2), # 16x16 => 32x32
        )

    def forward(self, z):
        """
        z - Latent vector of shape [B,z_dim]
        """
        x = self.linear(z)
        x = x.reshape(x.shape[0], -1, 4, 4)
        x = self.net(x)
        return x

To my surprise however, most LLMs I tried were not able to solve the problem. I kept my prompt simple:

The following code contains an encoder and a decoder. Currently, the decoder outputs a BxCx32x32 tensor. Change the decoder so it returns a BxCx28x28 tensor. You are only allowed to change the __init__ method of the CNNDecoder.

< code>

Some models tried to be sneaky and also changed the decoder's forward function, so I had to add a restriction stating only the __init__ method could be altered. I gave all models three tries, and I gave them no hints or redos after answering. Below there is a table with the results. The code solutions given by the various models can be found on my Github.

Model Shape Thinking time Notes
deepseek-r1-distill-qwen-32B BxCx28x28 184 seconds -
deepseek-r1 BxCx28x28 112 seconds -
deepseek-v3 BxCx31x31 - -
gemini-1.5-pro-002 BxCx25x25 - -
GPT-4o BxCx31x31 - -
GPT-o3-mini BxCx28x28 36 seconds Second try
llama-3.1-405B-instruct-turbo.py BxCx28x28 - -
llama-3.3-70B-instruct BxCx32x32 - -
mixtral-8x22B BxCx31x31 - -
qwen-2.5-72B-instruct BxCx31x31 - -
sonar-pro-reasoning-pro Syntax error 170 seconds Went super wrong somewhere in the thinking process (spammed emojis and switched to japanese suddenly)
Sonar-pro BxCx29x29 - -

As can be seen from the table above, only four out of the 12 tested SOTA (roughly) models came to the correct model. Notably, all these models are reasoning models (and Llama-3.1-405B). If you ask me, a problem like this should not be so difficult that only the most recent reasoning models and a 405B parameter model can solve it.