Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No .ckpt file #1

Open
YanzeZhang97 opened this issue Aug 6, 2024 · 15 comments
Open

No .ckpt file #1

YanzeZhang97 opened this issue Aug 6, 2024 · 15 comments

Comments

@YanzeZhang97
Copy link

Dear authors,

Hope this message finds you well!

I ran your code for the didactic training using the wandb. However, after finishing the traning (finishing traning_didactic.py), I indeed saw the log dir but I did not see the .ckpt file. The file tree is like below. Could you please help to give me more guidance?

image
Thanks,
Max

@hadar-hai
Copy link

Dear authors,
I’ve also tried everything and still no .ckpt file is being created. We would really appreciate your help.

Thanks,
Hadar

@hadar-hai
Copy link

Hi,
It could be connected to this change:
- def training_epoch_end(self, outputs: List[dict]) -> None:
+ def on_train_epoch_end(self) -> None:

Here they talk about it:
Lightning-AI/pytorch-lightning#16520

-    def training_epoch_end(self, outputs):
-        epoch_average = torch.stack([output["loss"] for output in outputs]).mean()
+    def on_train_epoch_end(self):
+        epoch_average = torch.stack(self.training_step_outputs).mean()
         self.log("training_epoch_average", epoch_average)
+        self.training_step_outputs.clear()  # free memory

Thanks,
Hadar

@HarukiNishimura-TRI
Copy link

Hi @YanzeZhang97 @hadar-hai, thank you for bringing the issue to the attention. Can you try downgrading lighting to v1.8.6 (released Dec 21, 2022) and see if the issue still persists? We have not run the code ourselves for a while and apparently there was a major version change to lightning since we released the code, which might have caused this issue. Since the code is no longer in a status of active development, we would greatly appreciate your contribution to either determine appropriate versions of dependencies or update the code appropriately.

@hadar-hai
Copy link

@YanzeZhang97 Did Haruki's response help you?

@YanzeZhang97
Copy link
Author

I used pytorch-lightning v1.7.7 and successfully got the .ckpt file. But the reason is still not clear.

@hadar-hai
Copy link

@YanzeZhang97 Could you share your pip list? Also, did you clone the latest version of the code from the repo and run training_didactic.py as is, or did you make any changes to the code?
Thank you very much!

@YanzeZhang97
Copy link
Author

absl-py 0.15.0
actionlib 1.14.0
addict 2.4.0
aiobotocore 2.13.1
aiofiles 23.2.1
aiohttp 3.9.5
aioitertools 0.11.0
aiosignal 1.2.0
altair 5.3.0
angles 1.9.13
annotated-types 0.6.0
anyio 4.3.0
argon2-cffi 23.1.0
argon2-cffi-bindings 21.2.0
arrow 1.3.0
astor 0.8.1
asttokens 2.2.1
astunparse 1.6.3
async-lru 2.0.4
async-timeout 4.0.3
attrs 23.2.0
Babel 2.15.0
backcall 0.2.0
base_local_planner 1.17.3
beautifulsoup4 4.12.3
bleach 6.1.0
blessed 1.20.0
blinker 1.6.2
bondpy 1.8.6
boto3 1.34.131
botocore 1.34.131
Brotli 1.0.9
cachetools 5.3.3
camera-calibration 1.17.0
camera-calibration-parsers 1.12.0
casadi 3.6.4
catkin 0.8.10
certifi 2021.5.30
cffi 1.16.0
charset-normalizer 3.3.2
clang 5.0
click 8.1.7
cloudpickle 1.6.0
cmake 3.26.4
comm 0.2.2
contourpy 1.1.1
controller-manager 0.20.0
controller-manager-msgs 0.20.0
croniter 1.3.15
cryptography 42.0.5
cv-bridge 1.16.2
cvxopt 1.3.2
cycler 0.12.1
debtcollector 2.5.0
debugpy 1.8.1
deepbots 1.0.0
deepdiff 7.0.1
diagnostic-analysis 1.11.0
diagnostic-common-diagnostics 1.11.0
diagnostic-updater 1.11.0
dnspython 2.6.1
do-mpc 4.6.4
docker-pycreds 0.4.0
dynamic-reconfigure 1.7.3
editor 1.6.6
einops 0.8.0
email_validator 2.1.1
exceptiongroup 1.2.1
executing 1.2.0
Farama-Notifications 0.0.4
fastapi 0.111.0
fastapi-cli 0.0.3
fastjsonschema 2.19.1
ffmpy 0.3.2
filelock 3.12.2
fire 0.6.0
flatbuffers 1.12
fonttools 4.40.0
frozenlist 1.4.0
fsspec 2024.3.1
gast 0.4.0
gazebo_plugins 2.9.2
gazebo_ros 2.9.2
gencpp 0.7.0
geneus 3.0.0
genlisp 0.4.18
genmsg 0.6.0
gennodejs 2.0.2
genpy 0.6.15
gitdb 4.0.11
GitPython 3.1.43
google-auth 2.29.0
google-auth-oauthlib 1.0.0
google-pasta 0.2.0
gradio 4.29.0
gradio_client 0.16.1
grpcio 1.62.2
gym 0.21.0
h11 0.14.0
h5py 3.1.0
highway-env 1.8.2
httpcore 1.0.5
httptools 0.6.1
httpx 0.27.0
huggingface-hub 0.23.0
idna 3.7
image-geometry 1.16.2
imageio 2.21.1
importlib-metadata 7.0.1
importlib_resources 6.4.0
inquirer 3.3.0
interactive-markers 1.12.0
ipykernel 6.29.4
ipython 8.12.0
itsdangerous 2.2.0
jedi 0.18.2
Jinja2 3.1.2
jmespath 1.0.1
joblib 1.2.0
joint-state-publisher 1.15.1
joint-state-publisher-gui 1.15.1
json5 0.9.25
jsonschema 4.22.0
jsonschema-specifications 2023.12.1
jupyter_client 8.6.2
jupyter_core 5.7.2
jupyter-events 0.10.0
jupyter-lsp 2.2.5
jupyter_server 2.14.0
jupyter_server_terminals 0.5.3
jupyterlab 4.2.1
jupyterlab_pygments 0.3.0
jupyterlab_server 2.27.2
keras 2.12.0
Keras-Preprocessing 1.1.2
kiwisolver 1.3.1
laser_geometry 1.6.7
lightning 1.8.6
lightning-cloud 0.5.70
lightning-lite 1.8.6
lightning-utilities 0.11.6
lit 16.0.6
Markdown 3.4.1
markdown-it-py 3.0.0
MarkupSafe 2.1.3
matplotlib 3.3.4
matplotlib-inline 0.1.6
mdurl 0.1.2
message-filters 1.16.0
mistune 3.0.2
mkl-fft 1.3.8
mkl-random 1.2.4
mkl-service 2.4.0
mmcv-full 1.7.2
mmengine 0.10.4
mpmath 1.3.0
multidict 6.0.4
nbclient 0.10.0
nbconvert 7.16.4
nbformat 5.10.4
nest-asyncio 1.6.0
netaddr 0.8.0
networkx 3.1
notebook 7.2.0
notebook_shim 0.2.4
numpy 1.19.5
nvidia-cublas-cu11 11.10.3.66
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu11 11.7.101
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu11 11.7.99
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu11 8.5.0.96
nvidia-cudnn-cu12 9.1.0.70
nvidia-cufft-cu11 10.9.0.58
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu11 10.2.10.91
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu11 11.4.0.1
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu11 11.7.4.91
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu11 2.14.3
nvidia-nccl-cu12 2.20.5
nvidia-nvjitlink-cu12 12.6.20
nvidia-nvtx-cu11 11.7.91
nvidia-nvtx-cu12 12.1.105
oauthlib 3.2.2
onnx 1.16.2
opencv-python 4.6.0.66
opt-einsum 3.3.0
ordered-set 4.1.0
orjson 3.10.3
oslo.config 9.0.0
oslo.i18n 5.1.0
osqp 0.6.5
overrides 7.7.0
packaging 22.0
pandas 1.3.0
pandocfilters 1.5.1
parso 0.8.3
pathlib 1.0.1
pbr 0.11.1
pickleshare 0.7.5
pillow 10.4.0
pip 24.0
pkgutil_resolve_name 1.3.10
platformdirs 4.2.1
plotly 5.23.0
pooch 1.7.0
prettytable 3.5.0
prometheus_client 0.20.0
prompt-toolkit 3.0.38
protobuf 3.20.3
psutil 6.0.0
ptyprocess 0.7.0
pure-eval 0.2.2
pyarrow 17.0.0
pyasn1 0.4.8
pyasn1-modules 0.2.8
pycparser 2.22
pydantic 1.10.2
pyDeprecate 0.3.2
pydub 0.25.1
pygame 2.5.2
pyglet 1.5.15
Pygments 2.15.1
PyJWT 2.8.0
pyOpenSSL 24.0.0
pyparsing 2.4.7
Pyro4 4.82
PySocks 1.7.1
python-dateutil 2.8.2
python-dotenv 1.0.1
python-json-logger 2.0.7
python-multipart 0.0.9
python-qt-binding 0.4.4
python-version 0.0.2
pytorch-lightning 1.7.7
pytz 2021.1
PyVirtualDisplay 3.0
PyWavelets 1.4.1
PyYAML 6.0.1
pyzabbix 1.2.1
pyzmq 24.0.1
qdldl 0.1.7.post0
qpsolvers 4.3.1
qt-dotgraph 0.4.2
qt-gui 0.4.2
qt-gui-cpp 0.4.2
qt-gui-py-common 0.4.2
readchar 4.1.0
referencing 0.35.1
requests 2.32.2
requests-oauthlib 2.0.0
resource_retriever 1.12.7
rfc3339-validator 0.1.4
rfc3986 2.0.0
rfc3986-validator 0.1.1
rich 13.7.1
rosbag 1.16.0
rosboost-cfg 1.15.8
rosclean 1.15.8
roscreate 1.15.8
rosgraph 1.16.0
roslaunch 1.16.0
roslib 1.15.8
roslint 0.12.0
roslz4 1.16.0
rosmake 1.15.8
rosmaster 1.16.0
rosmsg 1.16.0
rosnode 1.16.0
rosparam 1.16.0
rospy 1.16.0
rosserial_python 0.9.2
rosservice 1.16.0
rostest 1.16.0
rostopic 1.16.0
rosunit 1.15.8
roswtf 1.16.0
rpds-py 0.18.1
rqt_action 0.4.9
rqt_bag 0.5.1
rqt_bag_plugins 0.5.1
rqt-console 0.4.12
rqt_dep 0.4.12
rqt_graph 0.4.14
rqt_gui 0.5.3
rqt_gui_py 0.5.3
rqt-image-view 0.4.17
rqt_launch 0.4.9
rqt-logger-level 0.4.12
rqt-moveit 0.5.11
rqt_msg 0.4.10
rqt_nav_view 0.5.7
rqt_plot 0.4.13
rqt_pose_view 0.5.11
rqt_publisher 0.4.10
rqt_py_common 0.5.3
rqt_py_console 0.4.10
rqt-reconfigure 0.5.5
rqt-robot-dashboard 0.5.8
rqt-robot-monitor 0.5.15
rqt_robot_steering 0.5.12
rqt-runtime-monitor 0.5.10
rqt-rviz 0.7.0
rqt_service_caller 0.4.10
rqt_shell 0.4.11
rqt_srv 0.4.9
rqt-tf-tree 0.6.4
rqt_top 0.4.10
rqt_topic 0.4.13
rqt_web 0.4.10
rsa 4.7.2
rtabmap-python 0.21.3
ruff 0.4.4
runs 1.2.2
rviz 1.14.20
s3fs 2024.3.1
s3transfer 0.10.2
scikit-image 0.19.3
scikit-learn 1.1.2
scipy 1.7.0
semantic-version 2.10.0
Send2Trash 1.8.3
sensor-msgs 1.13.1
sentry-sdk 2.12.0
serpent 1.41
setproctitle 1.3.3
setuptools 69.5.1
shapely 2.0.3
shellingham 1.5.4
six 1.15.0
smach 2.5.2
smach-ros 2.5.2
smclib 1.8.6
smmap 5.0.1
sniffio 1.3.1
soupsieve 2.5
stack-data 0.6.2
starlette 0.37.2
starsessions 1.3.0
stevedore 4.1.1
sympy 1.12
tenacity 8.5.0
tensorboard 2.12.3
tensorboard-data-server 0.7.1
tensorboard-plugin-wit 1.8.1
tensorboardX 2.6.1
tensorflow 2.4.1
tensorflow-estimator 2.12.0
termcolor 1.1.0
terminado 0.18.1
tf 1.13.2
tf-conversions 1.13.2
tf2-geometry-msgs 0.7.7
tf2-kdl 0.7.7
tf2-py 0.7.7
tf2-ros 0.7.7
threadpoolctl 3.1.0
tifffile 2023.4.12
tinycss2 1.3.0
tomli 2.0.1
tomlkit 0.12.0
toolz 0.12.1
topic-tools 1.16.0
torch 2.4.0
torchaudio 2.4.0
torchmetrics 0.11.4
torchvision 0.19.0
tornado 6.4
tqdm 4.66.4
traitlets 5.9.0
triton 3.0.0
turtlebot3_example 1.2.5
turtlebot3_teleop 1.2.5
typer 0.12.3
types-python-dateutil 2.9.0.20240316
typing_extensions 4.12.2
ujson 5.9.0
urllib3 1.26.19
uvicorn 0.29.0
uvloop 0.19.0
wandb 0.17.5
watchfiles 0.23.0
waymo-open-dataset-tf-2-6-0 1.4.9
wcwidth 0.2.5
websocket-client 1.8.0
websockets 11.0.3
Werkzeug 3.0.3
wheel 0.43.0
wrapt 1.12.1
xacro 1.14.17
xmod 1.8.1
yapf 0.40.2
yarl 1.9.3
zipp 3.17.0
zm 1.0
zmq 0.0.0

I just copied all the packages included in my conda env. I guess the best way is to degrade the pytorch-lightning and make everything compatible with this version pytorch-lightning. Just some basic configurations like some paths are modified. But the code version is not the latest.

@hadar-hai
Copy link

@YanzeZhang97 thank you very much! What do you mean by "the code version is not the latest"? So which version?

@hadar-hai
Copy link

@YanzeZhang97 Is there a way to contact you? I’m having trouble getting a working version that produces ckpt. files. If you could send me the version you have, it would be very helpful. Also, what's your Python version?
Many thanks!

@YanzeZhang97
Copy link
Author

@hadar-hai Sorry for the late reply. It is quit busy in the begining of the new semester. The python version is 3.8. For the code, @HarukiNishimura-TRI Would you mind to push the old version code as a new branch so that people can access to the two versions of code? Thanks!

@hadar-hai
Copy link

@YanzeZhang97 Thank you! Is it this version: e5fe65f ?
image

@HarukiNishimura-TRI
Copy link

HarukiNishimura-TRI commented Aug 21, 2024

@YanzeZhang97 Thank you for trying it out. It is great to hear that downgrading the lightning version resolved the issue for you. @jmercat has made a few commits lately, trying to resolve some of the issues by making changes to our code. I wonder which commit your local changes are based off of. Is it d363fde or e5fe65f?
(I am guessing it's the latter, because otherwise you would not be able to even import lightning, due to the name change of the package from pytorch_lightning to lightning.pytorch.)

@YanzeZhang97
Copy link
Author

Hello @hadar-hai and @HarukiNishimura-TRI,
Yes, e5fe65f is the version I successfully implemented. And yes, the package is pytorch-lightning.

@hadar-hai
Copy link

hadar-hai commented Aug 24, 2024

Hello @YanzeZhang97, did you change "mmcv" to "mmengine.config"?
If you could kindly send me the working version by mail ([email protected]) it would be greatly appreciated.
I created the same conda environment as you, used python 3.8 and e5fe65f as is and there are still some problems.

@jmercat
Copy link
Collaborator

jmercat commented Aug 25, 2024

Hello @hadar-hai and @YanzeZhang97.
I looked into how to install the correct versions of everything instead of trying to update the code to the new versions of pytorch-lightning (which was a mess, sorry about that). I will never use pytorch-lightning again
For this to work I pushed a roll back of my previous attempt and a new install.sh script that should install the correct packages and hopefully run correctly. It does not work on multiple gpus but I could run a simple training on one gpu on a fresh environment with the new installation script. I hope this helps.
Thanks for the interest in our work. I’m sorry I don’t have much time to maintain this code base. Your contributions are welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants