-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
whisper : reduce ggml_context usage #2525
Conversation
The number of contexts is likely an issue in llama.cpp as well, especially on systems with multiple GPUs. The model loader, cache, loras, control vectors all use a different context for each buffer type. It may be better to remove the limit entirely and allocate them dynamically, maybe with an option to reset the context if allocations during inference are absolutely unacceptable. |
@slaren What do you think about
I'm not sure if avoiding these allocations during inference is really that important. So if this looks overly-complicated, I'm open to completely switch to dynamically allocated contexts and simplify this logic. |
I wouldn't be surprised if |
2d9c313
to
987f314
Compare
ggml/src/ggml.c
Outdated
|
||
struct ggml_context * ctx = GGML_ALIGNED_MALLOC(sizeof(struct ggml_context)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GGML_ALIGNED_MALLOC
is likely completely overkill here, I don't think there is a good reason to require more than the alignment of malloc
. Additionally, GGML_ALIGNED_MALLOC
uses vm_allocate
now on macOS, and this may actually be much slower than malloc
, since it is requesting pages from the OS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be tricky to do correctly when freeing, but it may also make sense to use malloc
for the buffer for no_alloc
contexts, since these should be small and only need need to allocate tensors. It doesn't matter too much however since applications can provide their own buffer. At some point we will probably want to remove support for allocating tensors on a ggml_context
, or just allocate the tensor data dynamically, and then that would be a good moment to implement that.
On a side note related to this, I was planning to implement support for automatically growing the buffers of |
Yes, these sound like good improvements. Allocating tensor data in the context is not really useful and applications that really need it can implement their own memory pool if necessary. |
* whisper : reduce ggml_context usage * ggml : allocate contexts on the heap (v2) * ggml : aligned malloc -> malloc
* whisper : reduce ggml_context usage * ggml : allocate contexts on the heap (v2) * ggml : aligned malloc -> malloc
ref #2521
No need to retain the
ggml_context
instances for the KV caches. With this change, eachwhisper_state
now uses 3 less contexts. It still uses 4 though - one for each of the 4 schedulers.Also,
ggml_init
will now allocate the contexts on the heap instead of using a static pool. Addedggml_reset(ctx)
for resetting existing contexts.