Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

whisper : reduce ggml_context usage #2525

Merged
merged 3 commits into from
Oct 31, 2024
Merged

whisper : reduce ggml_context usage #2525

merged 3 commits into from
Oct 31, 2024

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Oct 30, 2024

ref #2521

No need to retain the ggml_context instances for the KV caches. With this change, each whisper_state now uses 3 less contexts. It still uses 4 though - one for each of the 4 schedulers.

Also, ggml_init will now allocate the contexts on the heap instead of using a static pool. Added ggml_reset(ctx) for resetting existing contexts.

@slaren
Copy link
Collaborator

slaren commented Oct 30, 2024

The number of contexts is likely an issue in llama.cpp as well, especially on systems with multiple GPUs. The model loader, cache, loras, control vectors all use a different context for each buffer type. It may be better to remove the limit entirely and allocate them dynamically, maybe with an option to reset the context if allocations during inference are absolutely unacceptable.

@ggerganov
Copy link
Owner Author

@slaren What do you think about 2d9c313 (#2525)? ggml_init will now use the statically allocated contexts, and when they run out, it will start allocating on the heap.

maybe with an option to reset the context if allocations during inference are absolutely unacceptable.

I'm not sure if avoiding these allocations during inference is really that important. So if this looks overly-complicated, I'm open to completely switch to dynamically allocated contexts and simplify this logic.

@slaren
Copy link
Collaborator

slaren commented Oct 31, 2024

I wouldn't be surprised if malloc is faster in most of cases than what ggml_init does iterating over a list to find a free slot (here is a test). It's a very small allocation, so it will likely find a free block very quickly, without having to request more memory from the OS. So I think it would be better to remove the old logic entirely, but it's not very important either.

@ggerganov ggerganov requested a review from slaren October 31, 2024 19:31
ggml/src/ggml.c Outdated

struct ggml_context * ctx = GGML_ALIGNED_MALLOC(sizeof(struct ggml_context));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GGML_ALIGNED_MALLOC is likely completely overkill here, I don't think there is a good reason to require more than the alignment of malloc. Additionally, GGML_ALIGNED_MALLOC uses vm_allocate now on macOS, and this may actually be much slower than malloc, since it is requesting pages from the OS.

Copy link
Collaborator

@slaren slaren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be tricky to do correctly when freeing, but it may also make sense to use malloc for the buffer for no_alloc contexts, since these should be small and only need need to allocate tensors. It doesn't matter too much however since applications can provide their own buffer. At some point we will probably want to remove support for allocating tensors on a ggml_context, or just allocate the tensor data dynamically, and then that would be a good moment to implement that.

@slaren
Copy link
Collaborator

slaren commented Oct 31, 2024

On a side note related to this, I was planning to implement support for automatically growing the buffers of ggml_context, so if there is not enough space to create a tensor, it would allocate a new buffer instead of failing. I think that would be good to improve the usability of ggml, especially for newcomers or when experimenting.

@ggerganov
Copy link
Owner Author

Yes, these sound like good improvements. Allocating tensor data in the context is not really useful and applications that really need it can implement their own memory pool if necessary.

@ggerganov ggerganov merged commit aa037a6 into master Oct 31, 2024
89 checks passed
lyapple2008 pushed a commit to lyapple2008/whisper.cpp.mars that referenced this pull request Nov 2, 2024
* whisper : reduce ggml_context usage

* ggml : allocate contexts on the heap (v2)

* ggml : aligned malloc -> malloc
adutilleul pushed a commit to adutilleul/whisper.cpp that referenced this pull request Nov 19, 2024
* whisper : reduce ggml_context usage

* ggml : allocate contexts on the heap (v2)

* ggml : aligned malloc -> malloc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants