本帖最後由 wiz123 於 2026-5-11 11:11 編輯
原來跑本地35B LLM也跟本不需顯卡,方法是用MOE MODEL,速度達20t/s,即時對話, 養龍蝦都夠用,最重要是文本長度,用DDR5 RAM,要講幾耐都得,不用忘記一開始講乜
load_backend: loaded CPU backend from /app/libggml-cpu-alderlake.so
| model | size | params | backend | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen35moe 35B.A3B IQ4_XS - 4.25 bpw | 17.63 GiB | 34.66 B | CPU | 8 | pp512 | 107.09 ± 0.38 |
| qwen35moe 35B.A3B IQ4_XS - 4.25 bpw | 17.63 GiB | 34.66 B | CPU | 8 | tg128 | 20.38 ± 0.13 |
load_backend: loaded CPU backend from /app/libggml-cpu-alderlake.so
| model | size | params | backend | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q6_K | 26.55 GiB | 34.66 B | CPU | 8 | pp512 | 120.33 ± 4.32 |
| qwen35moe 35B.A3B Q6_K | 26.55 GiB | 34.66 B | CPU | 8 | tg128 | 17.45 ± 0.42 |
ggml_cuda_init: found 2 CUDA devices (Total VRAM: 56237 MiB):
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, VRAM: 32110 MiB
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24126 MiB
load_backend: loaded CUDA backend from /app/libggml-cuda.so
load_backend: loaded CPU backend from /app/libggml-cpu-alderlake.so
| model | size | params | backend | ngl | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| qwen35moe 35B.A3B IQ4_XS - 4.25 bpw | 17.63 GiB | 34.66 B | CUDA | 99 | 24 | pp512 | 6033.68 ± 67.15 |
| qwen35moe 35B.A3B IQ4_XS - 4.25 bpw | 17.63 GiB | 34.66 B | CUDA | 99 | 24 | tg128 | 203.47 ± 0.73 | |