在PyTorch中训练时遇到“没有梯度集”（No Grad Set）错误

Question

在进行模型训练时，我遇到了一个运行时错误，具体信息为：RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn。这个错误出现在以下训练循环中，按理说Sequential模型自身应当已经设置了梯度计算，但目前的提示是找不到梯度。

"""Training"""
Epochs = 100

for epoch in range(Epochs):
    model.train()

    train_logits = model(X_train)
    # 这里直接进行了预测并转换为概率和类别，应仅计算logits然后应用损失函数
    train_preds_probs = torch.softmax(train_logits, dim=1).argmax(dim=1).type(torch.float32)
    loss = loss_fn(train_preds_probs, y_train)  # 这里出错是因为传入了已经是预测类别的张量
    train_accu = accuracy(y_train, train_preds_probs)
    print(train_preds_probs)
    optimiser.zero_grad()

    loss.backward()  # 这里尝试计算损失的梯度，但由于数据问题失败了

    optimiser.step()

    # 应该是验证过程，但误用了'train()'状态和'y_train'作为标签
    model.eval()
    with torch.inference_mode():
        test_logits = model(X_test)
        # 类似的，这里也过早地进行了预测处理
        test_preds = torch.softmax(test_logits.type(torch.float32), dim=1).argmax(dim=1)
        test_loss = loss_fn(test_preds, y_train)  # 应该使用'y_test'
        test_acc = accuracy(y_test, test_preds)

    if epoch % 10 == 0:
        print(f'Epoch:{epoch} | Train loss: {loss} | Training acc:{train_accu} | Test Loss: {test_loss} | Test accu: {test_acc}')

我曾尝试上网搜索这个问题，但未能找到解决方案。

任何帮助都将不胜感激！

Vlad from Moscow · Answer

有时，这类错误的出现是因为你的代码某部分被with torch.no_grad():语句包裹了。

我建议检查代码中的各个函数，看看是否有这样的情况。
也许你已经检查过了，但这是个不错的排查起点！

Colin Hebert · Answer

错误发生在这里：

train_preds_probs = torch.softmax(train_logits, dim=1).argmax(dim=1).type(torch.float32)

当你使用 argmax 函数时，会丢失梯度链，导致无法进行反向传播。在训练过程中，你应该直接将 softmax 的输出传递给损失函数，而不应用 argmax。

修正后的做法是：

在计算损失前，保留 train_logits 的 softmax 输出形式，不要进行 argmax 转换，因为梯度计算需要连续的、可微分的操作。损失函数会直接基于这些概率值工作，例如交叉熵损失函数就能直接处理概率分布。