Slow gc.collect() on close() #489

capyvara · 2022-11-04T12:42:45Z

Specially when reading many small files, the time spent on gc.collect() is greater than the time decompressing data.

Example, 500 ~90kb files (600kb uncompressed) , LZMA, it can only read ~12 files per second.

  0%|▏                | 500/472075 [00:41<10:45:57, 12.17it/s]
         2249416 function calls (2217648 primitive calls) in 44.815 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    883/1    0.003    0.000   44.816   44.816 {built-in method builtins.exec}
...
      512    0.001    0.000   36.190    0.071 py7zr.py:408(__exit__)
      512    0.003    0.000   36.189    0.071 py7zr.py:1081(close)
      512    0.005    0.000   36.167    0.071 py7zr.py:816(_var_release)
      512   36.161    0.071   36.161    0.071 {built-in method gc.collect}
      512    0.001    0.000    3.510    0.007 py7zr.py:970(readall)
      512    0.017    0.000    3.508    0.007 py7zr.py:524(_extract)
      512    0.003    0.000    3.363    0.007 py7zr.py:1203(extract)
      512    0.002    0.000    3.283    0.006 py7zr.py:1271(extract_single)
      512    0.042    0.000    3.280    0.006 py7zr.py:1298(_extract_single)
      502    0.063    0.000    3.223    0.006 py7zr.py:1377(decompress)
     1292    0.037    0.000    3.071    0.002 compressor.py:681(decompress)
     1292    0.006    0.000    2.993    0.002 compressor.py:652(_decompress)
     1292    0.002    0.000    2.987    0.002 compressor.py:559(decompress)
     1292    2.985    0.002    2.985    0.002 {method 'decompress' of '_lzma.LZMADecompressor' objects}

Commenting out the gc.collect() on _var_release() ref it speeds up to 138 files per second, 10x faster

  0%|▏                    | 500/472075 [00:03<56:33, 138.97it/s]
         1988045 function calls (1956277 primitive calls) in 7.341 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    883/1    0.003    0.000    7.341    7.341 {built-in method builtins.exec}
...
      512    0.001    0.000    2.473    0.005 py7zr.py:970(readall)
      512    0.011    0.000    2.472    0.005 py7zr.py:524(_extract)
      512    0.002    0.000    2.356    0.005 py7zr.py:1203(extract)
      5/1    0.034    0.007    2.329    2.329 {method 'to_pandas' of 'pyarrow.lib._PandasConvertible' objects}
        1    0.004    0.004    2.329    2.329 pandas_compat.py:797(table_to_blockmanager)
      512    0.001    0.000    2.300    0.004 py7zr.py:1271(extract_single)
      512    0.011    0.000    2.298    0.004 py7zr.py:1298(_extract_single)
      502    0.014    0.000    2.274    0.005 py7zr.py:1377(decompress)
        1    0.000    0.000    2.210    2.210 pandas_compat.py:1165(_table_to_blocks)
        1    2.209    2.209    2.209    2.209 {pyarrow.lib.table_to_blocks}
     1292    0.024    0.000    2.185    0.002 compressor.py:681(decompress)
     1292    0.004    0.000    2.134    0.002 compressor.py:652(_decompress)
     1292    0.001    0.000    2.130    0.002 compressor.py:559(decompress)

IMHO it should not be in the lib responsibility to force invoke a a manual GC, it's the user that should handle its own memory the lib should only ensure no memory leaks on their own.

I see a commit about one year ago that added that, not sure what was the case it was trying to solve.

Environment (please complete the following information):

OS: MacOS 12.5
Python 3.9.13
py7zr version: 0.20.0

The text was updated successfully, but these errors were encountered:

miurahr · 2022-11-05T01:35:08Z

Good catch!

- Resolve #489 Signed-off-by: Hiroshi Miura <[email protected]>

miurahr · 2022-11-05T05:41:11Z

Strange... There is no big difference in the score of benchmark...
#297 (comment)

Any wrong things?

miurahr · 2022-11-05T05:47:02Z

Here is a test code. @capyvara could you improve the benchmark test code? I think a target data is relatively small than your condition.

https://github.com/miurahr/py7zr/blob/master/tests/test_benchmark.py

capyvara · 2022-11-07T00:33:08Z

@miurahr I'm not fully sure we can test this automatically because gc.collect() performance is dependent on the environment, on my context I was with a 500k lines pandas dataframe loaded, maybe allocate a huge amount of dummy objects?

Testing inside a Jupyter notebook was even worse, it was reading something like 1 item/s.

- Resolve #489 Signed-off-by: Hiroshi Miura <[email protected]>

miurahr added a commit that referenced this issue Nov 5, 2022

Drop manual GC call on close()

20ac7b9

- Resolve #489 Signed-off-by: Hiroshi Miura <[email protected]>

miurahr mentioned this issue Nov 5, 2022

Drop manual GC call on close() #490

Merged

miurahr added a commit that referenced this issue Nov 5, 2022

Drop manual GC call on close()

38991e7

- Resolve #489 Signed-off-by: Hiroshi Miura <[email protected]>

miurahr added the enhancement New feature or request label Nov 5, 2022

miurahr added a commit that referenced this issue Nov 14, 2022

Drop manual GC call on close()

4d653bf

- Resolve #489 Signed-off-by: Hiroshi Miura <[email protected]>

miurahr closed this as completed in #490 Nov 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow gc.collect() on close() #489

Slow gc.collect() on close() #489

capyvara commented Nov 4, 2022 •

edited

Loading

miurahr commented Nov 5, 2022

miurahr commented Nov 5, 2022

miurahr commented Nov 5, 2022

capyvara commented Nov 7, 2022

Slow gc.collect() on close() #489

Slow gc.collect() on close() #489

Comments

capyvara commented Nov 4, 2022 • edited Loading

miurahr commented Nov 5, 2022

miurahr commented Nov 5, 2022

miurahr commented Nov 5, 2022

capyvara commented Nov 7, 2022

capyvara commented Nov 4, 2022 •

edited

Loading