Return a tensor in training_step leads to a list of dictionaries in training_epoch_end #9968

BramVanroy · 2021-10-16T22:33:46Z

When I run the following code in my model

    def training_step(self, batch, batch_idx):
        loss = self._step(batch, batch_idx)
        print(type(loss))
        self.log("train_loss", loss, sync_dist=True)
        return loss

    def training_epoch_end(self, outputs):
        print(outputs)
        self.log("train_ppl", self.calculate_ppl(outputs), sync_dist=True)

I see that the training step returns a tensor. So I would expect that the aggregated values that are passed to training_epoch_end is a list of tensors. But, as this code shows, it is not. Instead it is a list of dictionaries that looks like this:

[{'loss': tensor(4.1520, device='cuda:0')}, {'loss': tensor(4.1750, device='cuda:0')}]

Annoyingly, the expected behavior does seem to occur in the validation and testing loops. So I am not sure whether I am simply doing something wrong, or whether the training_step forcefully collects dictionaries for a specific reason, or whether this is actually a bug.

PL version: 1.4.9
Torch: 1.9.1+cu111

cc @Borda @carmocca @justusschock @ananthsub @ninginthecloud

The text was updated successfully, but these errors were encountered:

Programmer-RD-AI · 2021-10-17T03:29:03Z

hi,

Can you send the reproducible script?

I will try to check the issue.

With best regards,
Ranuga

carmocca · 2021-10-17T13:54:09Z

Hi @BramVanroy

This is working as expected. The returns are standardized to a dictionary because you can also return a dictionary with multiple keys:

def training_step(...):
    ...
    return {'loss': ..., 'jiraffe': "hello"}

And we do special handling of the loss keyword tensor for automatic optimization (calling backward etc).

What you are asking is sensible but would require breaking backwards compatibility or doing a complex deprecation path.
For now, you can quickly convert the dict to a list with outputs = [d["loss"] for d in outputs]

cc @awaelchli in case you want to add or correct something

BramVanroy · 2021-10-17T15:04:03Z

@carmocca That is indeed my current approach. However, it seems that this conversion to a dict is only done for the training step and not for validation and test steps. As a user, I would expect the same behavior because that allows me to write a general *_epoch_end . Let's say I want to calculate perplexity at the end of every epoch for all scenarios, now I need to write something like this:

    def training_step(self, batch, batch_idx):
        loss = self._step(batch, batch_idx)
        self.log("train_loss", loss, sync_dist=True)
        return loss

    def training_epoch_end(self, outputs):
        self.log("train_ppl", self.calculate_ppl(outputs), sync_dist=True)

    def validation_step(self, batch, batch_idx, dataloader_idx=0):
        loss = self._step(batch, batch_idx)
        self.log("val_loss", loss, sync_dist=True)
        return loss

    def validation_epoch_end(self, outputs: List[torch.tensor]):
        self.log("val_ppl", self.calculate_ppl(outputs), sync_dist=True)

    @staticmethod
    def calculate_ppl(losses) -> float:
        if isinstance(losses[0], torch.Tensor):
            losses = torch.tensor([t.item() for t in losses])
        else:
            losses = torch.tensor([list(d.values())[0].item() for d in losses])
        return torch.exp(losses.mean()).item()

This "reads" very counter-intuitive: my return value for both steps is the same, so why do I need to provide two possible scenarios to deal with the aggregated outputs?

carmocca · 2021-10-17T15:17:58Z

As I said, your suggestion does make sense to me. Let's wait for somebody else's opinion before the change is approved to be implemented.

carmocca · 2021-10-19T16:19:36Z

If you are feeling adventurous, you could install master and do the following:

import os
from dataclasses import dataclass

import torch
from torch.utils.data import DataLoader, Dataset

from pytorch_lightning import LightningModule, Trainer
from pytorch_lightning.loops.optimization.optimizer_loop import ClosureResult


class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("train_loss", loss)
        return loss

    def training_epoch_end(self, outputs):
        print(outputs)

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)


@dataclass
class MyClosureResult(ClosureResult):
    was_dict: bool = False

    @classmethod
    def from_training_step_output(cls, training_step_output, *args, **kwargs):
        super_obj = ClosureResult.from_training_step_output(training_step_output, *args, **kwargs)
        obj = cls(super_obj.closure_loss, super_obj.extra, was_dict=isinstance(training_step_output, dict))
        return obj

    def asdict(self):
        if not self.was_dict:
            return self.loss
        return super().asdict()


def run():
    train_data = DataLoader(RandomDataset(32, 64), batch_size=2)

    model = BoringModel()
    trainer = Trainer(
        default_root_dir=os.getcwd(),
        limit_train_batches=1,
        limit_val_batches=1,
        num_sanity_val_steps=0,
        max_epochs=1,
        enable_model_summary=False,
        enable_progress_bar=False,
    )

    trainer.fit_loop.epoch_loop.batch_loop.optimizer_loop.output_result_cls = MyClosureResult

    trainer.fit(model, train_dataloaders=train_data)


if __name__ == "__main__":
    run()

stale · 2021-11-19T07:55:42Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

carmocca · 2021-12-01T17:18:05Z

Friendly ping for thoughts here @justusschock @awaelchli

I opened a PoC in #10878 with the required changes

This would be a breaking change for those returning Tensors in training_step or training_step_end who also implement training_epoch_end and expect a dictionary input.
To avoid this, we would need a deprecation process similar to the one proposed in #9737.

Is the UX improvement of respecting the user output type worth the deprecation process?

If yes, we should do both this and #9737 together

awaelchli · 2021-12-02T20:57:17Z

I align with @carmocca. I think the inconsistency between the training hooks and eval hooks should be fixed. At the same time, it is expected to be very common that users implement the step + epoch_end combo, and potentially many LightningModules could break if we don't do a deprecation. So I'm in favor of doing that, if it is possible.

BramVanroy · 2022-01-22T16:06:17Z

Looking forward to any updates on this.

carmocca · 2022-01-22T17:06:22Z

The plan is to add deprecations and an opt-in mechanism for this and #9737 before the 1.6 release

BramVanroy added bug Something isn't working help wanted Open to be worked on labels Oct 16, 2021

carmocca added feature Is an improvement or enhancement working as intended Working as intended and removed bug Something isn't working help wanted Open to be worked on labels Oct 17, 2021

carmocca mentioned this issue Oct 18, 2021

Set the optimization output result class as a class attribute #9977

Merged

11 tasks

stale bot added the won't fix This will not be worked on label Nov 19, 2021

carmocca added loops Related to the Loop API and removed won't fix This will not be worked on labels Nov 19, 2021

carmocca added this to the 1.6 milestone Nov 19, 2021

carmocca self-assigned this Nov 19, 2021

ananthsub mentioned this issue Nov 19, 2021

[RFC] Deprecate the _epoch_end hooks #8731

Closed

carmocca mentioned this issue Dec 1, 2021

Keep the training_step* output type #10878

Closed

12 tasks

carmocca added this to Frameworks Planning Feb 14, 2022

carmocca moved this to Todo in Frameworks Planning Feb 14, 2022

carmocca moved this from Todo to In Progress in Frameworks Planning Mar 2, 2022

Borda modified the milestones: 1.6, 1.6.x Mar 21, 2022

Borda modified the milestones: 1.6, 1.7 Mar 21, 2022

carmocca modified the milestones: pl:1.7, pl:1.8 Jul 25, 2022

carmocca moved this from In Progress to Todo in Frameworks Planning Sep 18, 2022

carmocca modified the milestones: v1.8, future Oct 13, 2022

carmocca mentioned this issue Jan 26, 2023

Remove memory-retaining epoch-end hooks #16520

Merged

carmocca removed this from the future milestone Feb 1, 2023

carmocca closed this as completed in #16520 Feb 6, 2023

github-project-automation bot moved this from Todo to Done in Frameworks Planning Feb 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Return a tensor in training_step leads to a list of dictionaries in training_epoch_end #9968

Return a tensor in training_step leads to a list of dictionaries in training_epoch_end #9968

BramVanroy commented Oct 16, 2021 •

edited by github-actions bot

Loading

Programmer-RD-AI commented Oct 17, 2021

carmocca commented Oct 17, 2021

BramVanroy commented Oct 17, 2021 •

edited

Loading

carmocca commented Oct 17, 2021

carmocca commented Oct 19, 2021

stale bot commented Nov 19, 2021

carmocca commented Dec 1, 2021 •

edited

Loading

awaelchli commented Dec 2, 2021

BramVanroy commented Jan 22, 2022

carmocca commented Jan 22, 2022

Return a tensor in training_step leads to a list of dictionaries in training_epoch_end #9968

Return a tensor in training_step leads to a list of dictionaries in training_epoch_end #9968

Comments

BramVanroy commented Oct 16, 2021 • edited by github-actions bot Loading

Programmer-RD-AI commented Oct 17, 2021

carmocca commented Oct 17, 2021

BramVanroy commented Oct 17, 2021 • edited Loading

carmocca commented Oct 17, 2021

carmocca commented Oct 19, 2021

stale bot commented Nov 19, 2021

carmocca commented Dec 1, 2021 • edited Loading

awaelchli commented Dec 2, 2021

BramVanroy commented Jan 22, 2022

carmocca commented Jan 22, 2022

BramVanroy commented Oct 16, 2021 •

edited by github-actions bot

Loading

BramVanroy commented Oct 17, 2021 •

edited

Loading

carmocca commented Dec 1, 2021 •

edited

Loading