Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return a tensor in training_step leads to a list of dictionaries in training_epoch_end #9968

Closed
BramVanroy opened this issue Oct 16, 2021 · 10 comments · Fixed by #16520
Closed
Assignees
Labels
feature Is an improvement or enhancement loops Related to the Loop API working as intended Working as intended

Comments

@BramVanroy
Copy link

BramVanroy commented Oct 16, 2021

When I run the following code in my model

    def training_step(self, batch, batch_idx):
        loss = self._step(batch, batch_idx)
        print(type(loss))
        self.log("train_loss", loss, sync_dist=True)
        return loss

    def training_epoch_end(self, outputs):
        print(outputs)
        self.log("train_ppl", self.calculate_ppl(outputs), sync_dist=True)

I see that the training step returns a tensor. So I would expect that the aggregated values that are passed to training_epoch_end is a list of tensors. But, as this code shows, it is not. Instead it is a list of dictionaries that looks like this:

[{'loss': tensor(4.1520, device='cuda:0')}, {'loss': tensor(4.1750, device='cuda:0')}]

Annoyingly, the expected behavior does seem to occur in the validation and testing loops. So I am not sure whether I am simply doing something wrong, or whether the training_step forcefully collects dictionaries for a specific reason, or whether this is actually a bug.

PL version: 1.4.9
Torch: 1.9.1+cu111

cc @Borda @carmocca @justusschock @ananthsub @ninginthecloud

@BramVanroy BramVanroy added bug Something isn't working help wanted Open to be worked on labels Oct 16, 2021
@Programmer-RD-AI
Copy link
Contributor

hi,

Can you send the reproducible script?

I will try to check the issue.

With best regards,
Ranuga

@carmocca
Copy link
Contributor

Hi @BramVanroy

This is working as expected. The returns are standardized to a dictionary because you can also return a dictionary with multiple keys:

def training_step(...):
    ...
    return {'loss': ..., 'jiraffe': "hello"}

And we do special handling of the loss keyword tensor for automatic optimization (calling backward etc).

What you are asking is sensible but would require breaking backwards compatibility or doing a complex deprecation path.
For now, you can quickly convert the dict to a list with outputs = [d["loss"] for d in outputs]

cc @awaelchli in case you want to add or correct something

@carmocca carmocca added feature Is an improvement or enhancement working as intended Working as intended and removed bug Something isn't working help wanted Open to be worked on labels Oct 17, 2021
@BramVanroy
Copy link
Author

BramVanroy commented Oct 17, 2021

@carmocca That is indeed my current approach. However, it seems that this conversion to a dict is only done for the training step and not for validation and test steps. As a user, I would expect the same behavior because that allows me to write a general *_epoch_end . Let's say I want to calculate perplexity at the end of every epoch for all scenarios, now I need to write something like this:

    def training_step(self, batch, batch_idx):
        loss = self._step(batch, batch_idx)
        self.log("train_loss", loss, sync_dist=True)
        return loss

    def training_epoch_end(self, outputs):
        self.log("train_ppl", self.calculate_ppl(outputs), sync_dist=True)

    def validation_step(self, batch, batch_idx, dataloader_idx=0):
        loss = self._step(batch, batch_idx)
        self.log("val_loss", loss, sync_dist=True)
        return loss

    def validation_epoch_end(self, outputs: List[torch.tensor]):
        self.log("val_ppl", self.calculate_ppl(outputs), sync_dist=True)

    @staticmethod
    def calculate_ppl(losses) -> float:
        if isinstance(losses[0], torch.Tensor):
            losses = torch.tensor([t.item() for t in losses])
        else:
            losses = torch.tensor([list(d.values())[0].item() for d in losses])
        return torch.exp(losses.mean()).item()

This "reads" very counter-intuitive: my return value for both steps is the same, so why do I need to provide two possible scenarios to deal with the aggregated outputs?

@carmocca
Copy link
Contributor

As I said, your suggestion does make sense to me. Let's wait for somebody else's opinion before the change is approved to be implemented.

@carmocca
Copy link
Contributor

If you are feeling adventurous, you could install master and do the following:

import os
from dataclasses import dataclass

import torch
from torch.utils.data import DataLoader, Dataset

from pytorch_lightning import LightningModule, Trainer
from pytorch_lightning.loops.optimization.optimizer_loop import ClosureResult


class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("train_loss", loss)
        return loss

    def training_epoch_end(self, outputs):
        print(outputs)

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)


@dataclass
class MyClosureResult(ClosureResult):
    was_dict: bool = False

    @classmethod
    def from_training_step_output(cls, training_step_output, *args, **kwargs):
        super_obj = ClosureResult.from_training_step_output(training_step_output, *args, **kwargs)
        obj = cls(super_obj.closure_loss, super_obj.extra, was_dict=isinstance(training_step_output, dict))
        return obj

    def asdict(self):
        if not self.was_dict:
            return self.loss
        return super().asdict()


def run():
    train_data = DataLoader(RandomDataset(32, 64), batch_size=2)

    model = BoringModel()
    trainer = Trainer(
        default_root_dir=os.getcwd(),
        limit_train_batches=1,
        limit_val_batches=1,
        num_sanity_val_steps=0,
        max_epochs=1,
        enable_model_summary=False,
        enable_progress_bar=False,
    )

    trainer.fit_loop.epoch_loop.batch_loop.optimizer_loop.output_result_cls = MyClosureResult

    trainer.fit(model, train_dataloaders=train_data)


if __name__ == "__main__":
    run()

@stale
Copy link

stale bot commented Nov 19, 2021

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Nov 19, 2021
@carmocca carmocca added loops Related to the Loop API and removed won't fix This will not be worked on labels Nov 19, 2021
@carmocca carmocca added this to the 1.6 milestone Nov 19, 2021
@carmocca carmocca self-assigned this Nov 19, 2021
@carmocca
Copy link
Contributor

carmocca commented Dec 1, 2021

Friendly ping for thoughts here @justusschock @awaelchli

I opened a PoC in #10878 with the required changes

This would be a breaking change for those returning Tensors in training_step or training_step_end who also implement training_epoch_end and expect a dictionary input.
To avoid this, we would need a deprecation process similar to the one proposed in #9737.

Is the UX improvement of respecting the user output type worth the deprecation process?

If yes, we should do both this and #9737 together

@awaelchli
Copy link
Contributor

I align with @carmocca. I think the inconsistency between the training hooks and eval hooks should be fixed. At the same time, it is expected to be very common that users implement the step + epoch_end combo, and potentially many LightningModules could break if we don't do a deprecation. So I'm in favor of doing that, if it is possible.

@BramVanroy
Copy link
Author

Looking forward to any updates on this.

@carmocca
Copy link
Contributor

The plan is to add deprecations and an opt-in mechanism for this and #9737 before the 1.6 release

@carmocca carmocca moved this to Todo in Frameworks Planning Feb 14, 2022
@carmocca carmocca moved this from Todo to In Progress in Frameworks Planning Mar 2, 2022
@Borda Borda modified the milestones: 1.6, 1.6.x Mar 21, 2022
@Borda Borda modified the milestones: 1.6, 1.7 Mar 21, 2022
@carmocca carmocca modified the milestones: pl:1.7, pl:1.8 Jul 25, 2022
@carmocca carmocca moved this from In Progress to Todo in Frameworks Planning Sep 18, 2022
@carmocca carmocca modified the milestones: v1.8, future Oct 13, 2022
@carmocca carmocca removed this from the future milestone Feb 1, 2023
@github-project-automation github-project-automation bot moved this from Todo to Done in Frameworks Planning Feb 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement loops Related to the Loop API working as intended Working as intended
Projects
No open projects
Status: Done
5 participants