Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow gc.collect() on close() #489

Closed
capyvara opened this issue Nov 4, 2022 · 4 comments · Fixed by #490
Closed

Slow gc.collect() on close() #489

capyvara opened this issue Nov 4, 2022 · 4 comments · Fixed by #490
Labels
enhancement New feature or request

Comments

@capyvara
Copy link

capyvara commented Nov 4, 2022

Specially when reading many small files, the time spent on gc.collect() is greater than the time decompressing data.

Example, 500 ~90kb files (600kb uncompressed) , LZMA, it can only read ~12 files per second.

  0%|▏                | 500/472075 [00:41<10:45:57, 12.17it/s]
         2249416 function calls (2217648 primitive calls) in 44.815 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    883/1    0.003    0.000   44.816   44.816 {built-in method builtins.exec}
...
      512    0.001    0.000   36.190    0.071 py7zr.py:408(__exit__)
      512    0.003    0.000   36.189    0.071 py7zr.py:1081(close)
      512    0.005    0.000   36.167    0.071 py7zr.py:816(_var_release)
      512   36.161    0.071   36.161    0.071 {built-in method gc.collect}
      512    0.001    0.000    3.510    0.007 py7zr.py:970(readall)
      512    0.017    0.000    3.508    0.007 py7zr.py:524(_extract)
      512    0.003    0.000    3.363    0.007 py7zr.py:1203(extract)
      512    0.002    0.000    3.283    0.006 py7zr.py:1271(extract_single)
      512    0.042    0.000    3.280    0.006 py7zr.py:1298(_extract_single)
      502    0.063    0.000    3.223    0.006 py7zr.py:1377(decompress)
     1292    0.037    0.000    3.071    0.002 compressor.py:681(decompress)
     1292    0.006    0.000    2.993    0.002 compressor.py:652(_decompress)
     1292    0.002    0.000    2.987    0.002 compressor.py:559(decompress)
     1292    2.985    0.002    2.985    0.002 {method 'decompress' of '_lzma.LZMADecompressor' objects}

Commenting out the gc.collect() on _var_release() ref it speeds up to 138 files per second, 10x faster

  0%|▏                    | 500/472075 [00:03<56:33, 138.97it/s]
         1988045 function calls (1956277 primitive calls) in 7.341 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    883/1    0.003    0.000    7.341    7.341 {built-in method builtins.exec}
...
      512    0.001    0.000    2.473    0.005 py7zr.py:970(readall)
      512    0.011    0.000    2.472    0.005 py7zr.py:524(_extract)
      512    0.002    0.000    2.356    0.005 py7zr.py:1203(extract)
      5/1    0.034    0.007    2.329    2.329 {method 'to_pandas' of 'pyarrow.lib._PandasConvertible' objects}
        1    0.004    0.004    2.329    2.329 pandas_compat.py:797(table_to_blockmanager)
      512    0.001    0.000    2.300    0.004 py7zr.py:1271(extract_single)
      512    0.011    0.000    2.298    0.004 py7zr.py:1298(_extract_single)
      502    0.014    0.000    2.274    0.005 py7zr.py:1377(decompress)
        1    0.000    0.000    2.210    2.210 pandas_compat.py:1165(_table_to_blocks)
        1    2.209    2.209    2.209    2.209 {pyarrow.lib.table_to_blocks}
     1292    0.024    0.000    2.185    0.002 compressor.py:681(decompress)
     1292    0.004    0.000    2.134    0.002 compressor.py:652(_decompress)
     1292    0.001    0.000    2.130    0.002 compressor.py:559(decompress)

IMHO it should not be in the lib responsibility to force invoke a a manual GC, it's the user that should handle its own memory the lib should only ensure no memory leaks on their own.

I see a commit about one year ago that added that, not sure what was the case it was trying to solve.

Environment (please complete the following information):

  • OS: MacOS 12.5
  • Python 3.9.13
  • py7zr version: 0.20.0
@miurahr
Copy link
Owner

miurahr commented Nov 5, 2022

Good catch!

miurahr added a commit that referenced this issue Nov 5, 2022
- Resolve #489

Signed-off-by: Hiroshi Miura <[email protected]>
miurahr added a commit that referenced this issue Nov 5, 2022
- Resolve #489

Signed-off-by: Hiroshi Miura <[email protected]>
@miurahr
Copy link
Owner

miurahr commented Nov 5, 2022

Strange... There is no big difference in the score of benchmark...
#297 (comment)

Any wrong things?

@miurahr
Copy link
Owner

miurahr commented Nov 5, 2022

Here is a test code. @capyvara could you improve the benchmark test code? I think a target data is relatively small than your condition.

https://github.com/miurahr/py7zr/blob/master/tests/test_benchmark.py

@miurahr miurahr added the enhancement New feature or request label Nov 5, 2022
@capyvara
Copy link
Author

capyvara commented Nov 7, 2022

@miurahr I'm not fully sure we can test this automatically because gc.collect() performance is dependent on the environment, on my context I was with a 500k lines pandas dataframe loaded, maybe allocate a huge amount of dummy objects?

Testing inside a Jupyter notebook was even worse, it was reading something like 1 item/s.

miurahr added a commit that referenced this issue Nov 14, 2022
- Resolve #489

Signed-off-by: Hiroshi Miura <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants