-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Return a tensor in training_step leads to a list of dictionaries in training_epoch_end #9968
Comments
hi, Can you send the reproducible script? I will try to check the issue. With best regards, |
Hi @BramVanroy This is working as expected. The returns are standardized to a dictionary because you can also return a dictionary with multiple keys: def training_step(...):
...
return {'loss': ..., 'jiraffe': "hello"} And we do special handling of the What you are asking is sensible but would require breaking backwards compatibility or doing a complex deprecation path. cc @awaelchli in case you want to add or correct something |
@carmocca That is indeed my current approach. However, it seems that this conversion to a dict is only done for the training step and not for validation and test steps. As a user, I would expect the same behavior because that allows me to write a general def training_step(self, batch, batch_idx):
loss = self._step(batch, batch_idx)
self.log("train_loss", loss, sync_dist=True)
return loss
def training_epoch_end(self, outputs):
self.log("train_ppl", self.calculate_ppl(outputs), sync_dist=True)
def validation_step(self, batch, batch_idx, dataloader_idx=0):
loss = self._step(batch, batch_idx)
self.log("val_loss", loss, sync_dist=True)
return loss
def validation_epoch_end(self, outputs: List[torch.tensor]):
self.log("val_ppl", self.calculate_ppl(outputs), sync_dist=True)
@staticmethod
def calculate_ppl(losses) -> float:
if isinstance(losses[0], torch.Tensor):
losses = torch.tensor([t.item() for t in losses])
else:
losses = torch.tensor([list(d.values())[0].item() for d in losses])
return torch.exp(losses.mean()).item() This "reads" very counter-intuitive: my return value for both steps is the same, so why do I need to provide two possible scenarios to deal with the aggregated outputs? |
As I said, your suggestion does make sense to me. Let's wait for somebody else's opinion before the change is approved to be implemented. |
If you are feeling adventurous, you could install master and do the following: import os
from dataclasses import dataclass
import torch
from torch.utils.data import DataLoader, Dataset
from pytorch_lightning import LightningModule, Trainer
from pytorch_lightning.loops.optimization.optimizer_loop import ClosureResult
class RandomDataset(Dataset):
def __init__(self, size, length):
self.len = length
self.data = torch.randn(length, size)
def __getitem__(self, index):
return self.data[index]
def __len__(self):
return self.len
class BoringModel(LightningModule):
def __init__(self):
super().__init__()
self.layer = torch.nn.Linear(32, 2)
def forward(self, x):
return self.layer(x)
def training_step(self, batch, batch_idx):
loss = self(batch).sum()
self.log("train_loss", loss)
return loss
def training_epoch_end(self, outputs):
print(outputs)
def configure_optimizers(self):
return torch.optim.SGD(self.layer.parameters(), lr=0.1)
@dataclass
class MyClosureResult(ClosureResult):
was_dict: bool = False
@classmethod
def from_training_step_output(cls, training_step_output, *args, **kwargs):
super_obj = ClosureResult.from_training_step_output(training_step_output, *args, **kwargs)
obj = cls(super_obj.closure_loss, super_obj.extra, was_dict=isinstance(training_step_output, dict))
return obj
def asdict(self):
if not self.was_dict:
return self.loss
return super().asdict()
def run():
train_data = DataLoader(RandomDataset(32, 64), batch_size=2)
model = BoringModel()
trainer = Trainer(
default_root_dir=os.getcwd(),
limit_train_batches=1,
limit_val_batches=1,
num_sanity_val_steps=0,
max_epochs=1,
enable_model_summary=False,
enable_progress_bar=False,
)
trainer.fit_loop.epoch_loop.batch_loop.optimizer_loop.output_result_cls = MyClosureResult
trainer.fit(model, train_dataloaders=train_data)
if __name__ == "__main__":
run() |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
Friendly ping for thoughts here @justusschock @awaelchli I opened a PoC in #10878 with the required changes This would be a breaking change for those returning Is the UX improvement of respecting the user output type worth the deprecation process? If yes, we should do both this and #9737 together |
I align with @carmocca. I think the inconsistency between the training hooks and eval hooks should be fixed. At the same time, it is expected to be very common that users implement the step + epoch_end combo, and potentially many LightningModules could break if we don't do a deprecation. So I'm in favor of doing that, if it is possible. |
Looking forward to any updates on this. |
The plan is to add deprecations and an opt-in mechanism for this and #9737 before the 1.6 release |
When I run the following code in my model
I see that the training step returns a tensor. So I would expect that the aggregated values that are passed to
training_epoch_end
is a list of tensors. But, as this code shows, it is not. Instead it is a list of dictionaries that looks like this:Annoyingly, the expected behavior does seem to occur in the validation and testing loops. So I am not sure whether I am simply doing something wrong, or whether the
training_step
forcefully collects dictionaries for a specific reason, or whether this is actually a bug.PL version: 1.4.9
Torch: 1.9.1+cu111
cc @Borda @carmocca @justusschock @ananthsub @ninginthecloud
The text was updated successfully, but these errors were encountered: