sync : ggml #2608

ggerganov · 2024-12-05T12:31:12Z

TODO:

deprecate Makefile
update instructions
update CI
squash tail commits

* wip * wip implementation f32 * kernel conv transpose 1d f32 working * initial commit

* implemented argmax kernel * tpig -> tgpig * change to strides * contiguous assertions * kernel working and tested * argmax simd parallel implementation * added 2 new tests for argmax in test-backend-ops * cosmit * added 3 tests cases for perf eval * add test_argmax in make_test_cases_perf * Update test-backend-ops.cpp Co-authored-by: Diego Devesa <[email protected]> --------- Co-authored-by: Diego Devesa <[email protected]>

* kqmax_new_j in every thread within warp is same after operate at line 199,this reduce can be omit * same problem in vec32 --------- Co-authored-by: ZhaoXiaoYu <[email protected]>

* vulkan: Use pipeline_robustness to disable robustness in mul_mat_vec. Add some early returns for nonexistent rows in mul_mat_vec shaders. These can only be hit when dispatching a 2D grid of workgroups. Fix the logic for the 2D grid of workgroups to round up. Enable the pipeline robustness extension if it's available, and use it to disable robustness for these pipelines. The instructions to do the bounds checking contend for the same ALU resources as the bit twiddling dequant instructions. * vulkan: Add GLSL structure aliases for quant types to allow larger loads In Vulkan it's not possible to cast pointer types, so instead you have to declare an aliased binding for the memory with a different type. This commit adds aliases for the quant formats using 16b ints, and in a few places where the struct size is a multiple of 4 also using 32b ints. Currently only q4_k's aliases are used, but others will be used in subsequent commits. * vulkan: use larger loads in q5_k and q6_k shaders. Similar to the optimization I did in q4_k recently, this vectorizes some loads and reduces the number of bit twiddling instructions. * vulkan: use larger K step per iteration in mul_mat_vec. Add vec4 dequantization functions, and use them to do K=8 per iteration in mul_mat_vec. This uses 16b loads for the quant values and 128b loads for B which helps reduce the load on the memory system. The K_PER_ITER==2 logic is still there, just for F16/F32, and really only because they support unaligned sizes. Tweak the num_iters/unrolling logic to be simpler and catch a couple missed unrolling opportunities.

…0437) Fixes #10434

* cuda : optimize argmax * remove unused parameter ggml-ci * fixup : use full warps ggml-ci * Apply suggestions from code review Co-authored-by: Johannes Gäßler <[email protected]> * fix ub * ggml : check ne00 <= INT32_MAX in argmax and argsort --------- Co-authored-by: Johannes Gäßler <[email protected]>

* CANN Support Ascend310P to accelerate F32 and F16 Model * Add compile option soc type macro ASCEND_310P to ggml-cann lib * Remove unused code * Remove the ascend soc_type hard code compile option in CMakelist.txt

* ggml : add support for dynamic loading of backends --------- Co-authored-by: Georgi Gerganov <[email protected]>

* llama : accept a list of devices to use to offload a model * accept `--dev none` to completely disable offloading * fix dev list with dl backends * rename env parameter to LLAMA_ARG_DEVICE for consistency

The vulkan-shaders-gen was not parsing the --no-clean argument correctly. Because the previous code was parsing the arguments which have a value only and the --no-clean argument does not have a value, it was not being parsed correctly. This commit can now correctly parse arguments that don't have values.

Co-authored-by: noemotiovon <[email protected]>

…a/10454) * improve inferencing performance for ascend npu. Co-authored-by: Frank Mai <thxCode@[email protected]> * some modification after review * some modifications after review * restore some modifications * restore some modifications --------- Co-authored-by: shanshan shen <[email protected]> Co-authored-by: Frank Mai <thxCode@[email protected]>

* ggml-cpu: cmake add arm64 cpu feature check for macos * use vmmlaq_s32 for compile option i8mm check

* cmake : enable warnings in llama ggml-ci * cmake : add llama_get_flags and respect LLAMA_FATAL_WARNINGS * cmake : get_flags -> ggml_get_flags * speculative-simple : fix warnings * cmake : reuse ggml_get_flags ggml-ci * speculative-simple : fix compile warning ggml-ci

Fix bad calculation of the end of the range. Add a backend test that covers the bad case (taken from stable diffusion). Fixes leejet/stable-diffusion.cpp#439.

…llama/10516) Signed-off-by: Xiaodong Ye <[email protected]>

…0506)

There have been reports of failure to compile on systems with <= 32KB of shared memory (e.g. #10037). This change makes the large tile size fall back to a smaller size if necessary, and makes mul_mat_id fall back to CPU if there's only 16KB of shared memory.

* Add some minimal optimizations for CDNA * ggml_cuda: set launch bounds also for GCN as it helps there too

ggml-ci

* subgroup 64 version with subgroup add. 15% faster scalable version tested for subgroup sizes 16-128 * check for subgroup multiple of 16 and greater than 16 * subgroup sizes are always a power of 2 (KhronosGroup/GLSL#45) * force 16 sequential threads per block * make 16 subgroup size a constant

…q4_0_4x4_q8_0() (llama/10567) Signed-off-by: Adrien Gallouët <[email protected]>

* Switched to GGML_LOG * Fix missing semicolon

* metal : small-batch mat-mul kernels ggml-ci * metal : add rest of types ggml-ci * metal : final adjustments ggml-ci * metal : add comments ggml-ci

ggml : automatic selection of best CPU backend (llama/10606)

ggml-ci

* ggml_pad_reflect_1d defined in header * implemented on CPU * called the forward pass * impl Metal kernel * added Metal kernel * added OP_PAD_REFLECT_1D in test-backend-ops.cpp * add test-pad-reflect-1d test case * test case support multiple backend

* implemented cpu kernel * add i32 test cases in test-backend-ops * typedef `ggml_metal_kargs_set` * implemented `kernel_set` * memcpy

Use vector loads when possible in mul_mat_split_k_reduce. Use split_k when there aren't enough workgroups to fill the shaders.

…IDIA backend (llama/10584) * [SYCL] Move to Compile Time backend selection on oneMKL Interface for NVIDIA backend Move to compile time selection to backend to avoid latency at run time. Add it to all mkl gemm calls and only for NVIDIA backend. Signed-off-by: nscipione <[email protected]> * Formatting * Address PR comments to increase readibility --------- Signed-off-by: nscipione <[email protected]>

…llama/10642)

…626) * ggml : add predefined list of CPU backend variants to build * update CPU dockerfiles

JohannesGaessler and others added 30 commits December 5, 2024 14:29

ggml-opt: fix data corruption (ggml/1022)

62fd128

Do not include arm_neon.h when compiling CUDA code (ggml/1028)

3445025

metal : add GGML_OP_CONV_TRANSPOSE_1D kernels (ggml/1026)

2a2ed50

* wip * wip implementation f32 * kernel conv transpose 1d f32 working * initial commit

CUDA: remove unnecessary warp reduce in FA (ggml/1032)

2404fca

* kqmax_new_j in every thread within warp is same after operate at line 199,this reduce can be omit * same problem in vec32 --------- Co-authored-by: ZhaoXiaoYu <[email protected]>

add cmake rvv support (llama/10411)

b1c6a66

vulkan: copy iq4_nl LUT into shared memory (llama/10409)

b4fa978

vulkan: predicate max operation in soft_max shaders/soft_max (llama/1…

176d689

…0437) Fixes #10434

CANN: Support Ascend310P to accelerate F32 and F16 Model (llama/10216)

761a3e8

* CANN Support Ascend310P to accelerate F32 and F16 Model * Add compile option soc type macro ASCEND_310P to ggml-cann lib * Remove unused code * Remove the ascend soc_type hard code compile option in CMakelist.txt

ggml : do not use ARM features not included in the build (llama/10457)

cd3456d

metal : minor code formatting

5be17d6

ggml : add support for dynamic loading of backends (llama/10469)

920a48a

* ggml : add support for dynamic loading of backends --------- Co-authored-by: Georgi Gerganov <[email protected]>

llama : accept a list of devices to use to offload a model (llama/10497)

640732f

* llama : accept a list of devices to use to offload a model * accept `--dev none` to completely disable offloading * fix dev list with dl backends * rename env parameter to LLAMA_ARG_DEVICE for consistency

metal : enable mat-vec kernels for bs <= 4 (llama/10491)

9229dd6

CANN: RoPE and CANCAT operator optimization (llama/10488)

c211ba4

Co-authored-by: noemotiovon <[email protected]>

ggml-cpu: cmake add arm64 cpu feature check for macos (llama/10487)

b054b83

* ggml-cpu: cmake add arm64 cpu feature check for macos * use vmmlaq_s32 for compile option i8mm check

vulkan: fix group_norm (llama/10496)

1e5fe6a

Fix bad calculation of the end of the range. Add a backend test that covers the bad case (taken from stable diffusion). Fixes leejet/stable-diffusion.cpp#439.

mtgpu: Add MUSA_DOCKER_ARCH in Dockerfiles && update cmake and make (…

5ec2241

…llama/10516) Signed-off-by: Xiaodong Ye <[email protected]>

vulkan: optimize Q2_K and Q3_K mul_mat_vec (llama/10459)

6ebd263

vulkan: skip integer div/mod in get_offsets for batch_idx==0 (llama/1…

f69379f

…0506)

vulkan: further optimize q5_k mul_mat_vec (llama/10479)

475517a

vulkan: define all quant data structures in types.comp (llama/10440)

5aad67d

metal : fix group_norm support condition (llama/0)

98690a8

Add some minimal optimizations for CDNA (llama/10498)

d147926

* Add some minimal optimizations for CDNA * ggml_cuda: set launch bounds also for GCN as it helps there too

Alcpz and others added 24 commits December 5, 2024 14:29

sycl : offload of get_rows set to 0 (llama/10432)

5acee88

ggml-cpu: fix typo in gemv/gemm iq4_nl_4_4 (llama/10580)

544a4d4

ggml : fix I8MM Q4_1 scaling factor conversion (llama/10562)

8383be9

ggml-ci

ggml-cpu: replace AArch64 NEON assembly with intrinsics in ggml_gemv_…

06bf264

…q4_0_4x4_q8_0() (llama/10567) Signed-off-by: Adrien Gallouët <[email protected]>

SYCL: Fix and switch to GGML_LOG system instead of fprintf (llama/10579)

59853e7

* Switched to GGML_LOG * Fix missing semicolon

metal : small-batch mat-mul kernels (llama/10581)

b405683

* metal : small-batch mat-mul kernels ggml-ci * metal : add rest of types ggml-ci * metal : final adjustments ggml-ci * metal : add comments ggml-ci

ggml : move AMX to the CPU backend (llama/10570)

b383af9

ggml : automatic selection of best CPU backend (llama/10606)

common : fix compile warning

76199ee

ggml-ci

files : remove make artifacts

fe9b27d

ggml: add GGML_SET Metal kernel + i32 CPU kernel (ggml/1037)

e20efac

* implemented cpu kernel * add i32 test cases in test-backend-ops * typedef `ggml_metal_kargs_set` * implemented `kernel_set` * memcpy

vulkan: optimize and reenable split_k (llama/10637)

03331b1

Use vector loads when possible in mul_mat_split_k_reduce. Use split_k when there aren't enough workgroups to fill the shaders.

Avoid using __fp16 on ARM with old nvcc (llama/10616)

9623ba1

vulkan: Implement "fast divide" (mul+shift) for unary ops like copy (…

b311da3

…llama/10642)

ggml-cpu : fix HWCAP2_I8MM value (llama/10646)

61aff48

ggml : add predefined list of CPU backend variants to build (llama/10…

dfddca0

…626) * ggml : add predefined list of CPU backend variants to build * update CPU dockerfiles

sync : ggml

dfe6652

talk-llama : sync llama.cpp

1a1fcd3

make : shim cmake

729effe

ci : disable Obj-C build + fixes

a5cd03a

ci : disable CUDA and Android builds

762f63e

readme : update build instructions

668930a

ggerganov force-pushed the sync branch from 39ed859 to 668930a Compare December 8, 2024 13:48

ggerganov marked this pull request as ready for review December 8, 2024 13:48

ci : disable freeBSD builds [no ci]

280d273

ggerganov merged commit 0164427 into master Dec 8, 2024

ggerganov deleted the sync branch December 8, 2024 18:14

KitaitiMakoto mentioned this pull request Dec 9, 2024

ruby : Sync whisper.cpp and model download feature #2617

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sync : ggml #2608

sync : ggml #2608

ggerganov commented Dec 5, 2024 •

edited

Loading

sync : ggml #2608

sync : ggml #2608

Conversation

ggerganov commented Dec 5, 2024 • edited Loading

ggerganov commented Dec 5, 2024 •

edited

Loading