whisper : reduce ggml_context usage #2525

ggerganov · 2024-10-30T11:42:01Z

No need to retain the ggml_context instances for the KV caches. With this change, each whisper_state now uses 3 less contexts. It still uses 4 though - one for each of the 4 schedulers.

Also, ggml_init will now allocate the contexts on the heap instead of using a static pool. Added ggml_reset(ctx) for resetting existing contexts.

slaren · 2024-10-30T22:05:52Z

The number of contexts is likely an issue in llama.cpp as well, especially on systems with multiple GPUs. The model loader, cache, loras, control vectors all use a different context for each buffer type. It may be better to remove the limit entirely and allocate them dynamically, maybe with an option to reset the context if allocations during inference are absolutely unacceptable.

ggerganov · 2024-10-31T10:51:38Z

@slaren What do you think about 2d9c313 (#2525)? ggml_init will now use the statically allocated contexts, and when they run out, it will start allocating on the heap.

maybe with an option to reset the context if allocations during inference are absolutely unacceptable.

I'm not sure if avoiding these allocations during inference is really that important. So if this looks overly-complicated, I'm open to completely switch to dynamically allocated contexts and simplify this logic.

slaren · 2024-10-31T11:08:51Z

I wouldn't be surprised if malloc is faster in most of cases than what ggml_init does iterating over a list to find a free slot (here is a test). It's a very small allocation, so it will likely find a free block very quickly, without having to request more memory from the OS. So I think it would be better to remove the old logic entirely, but it's not very important either.

slaren · 2024-10-31T19:36:14Z

ggml/src/ggml.c


+    struct ggml_context * ctx = GGML_ALIGNED_MALLOC(sizeof(struct ggml_context));


GGML_ALIGNED_MALLOC is likely completely overkill here, I don't think there is a good reason to require more than the alignment of malloc. Additionally, GGML_ALIGNED_MALLOC uses vm_allocate now on macOS, and this may actually be much slower than malloc, since it is requesting pages from the OS.

slaren

This may be tricky to do correctly when freeing, but it may also make sense to use malloc for the buffer for no_alloc contexts, since these should be small and only need need to allocate tensors. It doesn't matter too much however since applications can provide their own buffer. At some point we will probably want to remove support for allocating tensors on a ggml_context, or just allocate the tensor data dynamically, and then that would be a good moment to implement that.

slaren · 2024-10-31T19:47:43Z

On a side note related to this, I was planning to implement support for automatically growing the buffers of ggml_context, so if there is not enough space to create a tensor, it would allocate a new buffer instead of failing. I think that would be good to improve the usability of ggml, especially for newcomers or when experimenting.

ggerganov · 2024-10-31T19:56:47Z

Yes, these sound like good improvements. Allocating tensor data in the context is not really useful and applications that really need it can implement their own memory pool if necessary.

* whisper : reduce ggml_context usage * ggml : allocate contexts on the heap (v2) * ggml : aligned malloc -> malloc

whisper : reduce ggml_context usage

3689d49

KitaitiMakoto mentioned this pull request Oct 30, 2024

ruby: Segmentation fault when transcribing 8 times #2509

Closed

andychu666 approved these changes Oct 31, 2024

View reviewed changes

ggml : allocate contexts on the heap (v2)

987f314

ggerganov force-pushed the gg/reduce-ctx-use branch from 2d9c313 to 987f314 Compare October 31, 2024 19:30

ggerganov requested a review from slaren October 31, 2024 19:31

slaren reviewed Oct 31, 2024

View reviewed changes

ggml : aligned malloc -> malloc

552419f

slaren approved these changes Oct 31, 2024

View reviewed changes

ggerganov merged commit aa037a6 into master Oct 31, 2024
89 checks passed

ggerganov mentioned this pull request Oct 31, 2024

Crash if --processors > 8 #2521

Closed

lyapple2008 pushed a commit to lyapple2008/whisper.cpp.mars that referenced this pull request Nov 2, 2024

ggml : alloc ggml_contexts on the heap (ggerganov#2525)

8253fbb

* whisper : reduce ggml_context usage * ggml : allocate contexts on the heap (v2) * ggml : aligned malloc -> malloc

adutilleul pushed a commit to adutilleul/whisper.cpp that referenced this pull request Nov 19, 2024

ggml : alloc ggml_contexts on the heap (ggerganov#2525)

75aeb16

* whisper : reduce ggml_context usage * ggml : allocate contexts on the heap (v2) * ggml : aligned malloc -> malloc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

whisper : reduce ggml_context usage #2525

whisper : reduce ggml_context usage #2525

ggerganov commented Oct 30, 2024 •

edited

Loading

slaren commented Oct 30, 2024

ggerganov commented Oct 31, 2024

slaren commented Oct 31, 2024 •

edited

Loading

slaren Oct 31, 2024

slaren left a comment •

edited

Loading

slaren commented Oct 31, 2024

ggerganov commented Oct 31, 2024


		struct ggml_context * ctx = GGML_ALIGNED_MALLOC(sizeof(struct ggml_context));

whisper : reduce ggml_context usage #2525

whisper : reduce ggml_context usage #2525

Conversation

ggerganov commented Oct 30, 2024 • edited Loading

slaren commented Oct 30, 2024

ggerganov commented Oct 31, 2024

slaren commented Oct 31, 2024 • edited Loading

slaren Oct 31, 2024

Choose a reason for hiding this comment

slaren left a comment • edited Loading

Choose a reason for hiding this comment

slaren commented Oct 31, 2024

ggerganov commented Oct 31, 2024

ggerganov commented Oct 30, 2024 •

edited

Loading

slaren commented Oct 31, 2024 •

edited

Loading

slaren left a comment •

edited

Loading