Localhost Proxy LLM Guide

Localhost Proxy LLM Guide

150

552

If you possess a decent computer, you can use Kobold to host your own LLMs.

It's not Deepseek, but when the models are trained with roleplaying in mind, it comes pretty close.


This guide describes how, but the website has been down as of late. https://waiki.trashpanda.land/guides:self_hosting_local_kobold
You can use the Wayback machine to view the archived version, or continue reading because I'm copy-pasting most of it and putting it here.

Massive credit to whoever written the guide. Here's to hoping they can fix the website.


------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------



Check Your Hardware

  • RAM/VRAM: Press Ctrl + Shift + Esc > “Performance” tab.

    • VRAM: Under “GPU” (look for “Dedicated GPU Memory”)

    • RAM: Under “Memory”

  • Rule of Thumb:

    • 7B models need ~8GB RAM (use Q4/Q5 quantization)

    • 13B+ models need ~16GB+ RAM

    • Anything above you can probably guess. (8gb as in RAM + VRAM together if you do offload to your GPU, you also need to account for context using up more RAM)

Download a Model

  • Where? HuggingFace (search for GGUF files)

    • Starter Picks:

      • 8B: Stheno 3.2 8B or Llama 3 8B

      • 12B: MN-Violet-Lotus-12B

  • Quantization: Use Q4_K_M, Q5_K_M, or higher (avoid anything lower, they’re kinda dumb)

    ((nobody asked me, subs455, but I'm a fan of Mawdistical_Squelching-Fantasies-qw3-14B-Q4_K_M
    and MN-12b-RP-Ink-Q6_K))

Install KoboldCPP

  • Download KoboldCPP (the easiest way to run GGUF models for me personally)

  • Open koboldcpp.exe.

  • (If you don’t have a GPU, use LM Studio! There are guides out there specifically for it)

Configure KoboldCPP

  • Click Browse and select your GGUF model file.

  • Backend Settings:

    • NVIDIA GPU? Use CUBlas.

    • AMD GPU? Use Vulkan.

    • No GPU? Use OpenBLAS (CPU-only mode) 1)

  • GPU Layers:

    • Example: For a 7B model with 33 layers, offload 32 layers to your GPU (if you have 6GB+ VRAM).

  • Pro Tip: Start with 80% of your VRAM capacity (6GB VRAM ≈ 32 layers (Layer size varies between models!) (You can also use this helpful calculator)

Tweak Settings

  • Context Size: Start at 4096 (increase if you have RAM to use).

  • Faster Processing: Enable MMQ, FlashAttention, ContextShift, and FastForwarding

    • MMQ: Basically, do math in a different way that makes it more VRAM friendly

    • FlashAttention: Calculates which parts are important instead of doing it for each individual piece (this is really dumbed down dont quote me)

    • ContextShift: Reduces preprocessing, basically it’s slow initially, but it doesn’t have to go through every single message unless you edit something previously. Makes it wayyyy faster to regenerate prompts.

    • FastForwarding: Let’s the model skip reused tokens in the context that has already been processed

Run the Model!

  • Click Launch. Once loaded, open http://localhost:5001 in your browser to chat.

    • If you're encountering any issues with memory, I.E “Failed to allocate memory”

      • Try: Reducing GPU layers, context, switching to a lower Quantization, OR swapping to a smaller model.

      • You can also attempt to Quantize KV cache in the “Tokens” tab, which essentially compresses down context to lower VRAM/RAM usage. (Same issues, more compression = more quality loss, not really noticeable(?))

    • If some models straight up just don't load (Newer models)

      • Try updating KoboldCPP to the latest version.

Janitor API Setup

  • Check Remote Tunnel in KoboldCPP.

  • In JAI:

    • API Endpoint: Paste the URL and add /chat/completions at the end.

    • API Key: Type anything (it’s ignored).

    • Model Name: Use the filename or anything you want (stheno-8b-q5_k_m)

      Refresh the page after every change. Otherwise, it wouldn't work as JAI thinks it's on the previous API.

Okay... Can I see an example of how much speed I'd get?

My Personal Setup:

  • PC Specs: i5-11400H, RTX 3060 (6GB VRAM), 32GB DDR4 RAM.

    • Stheno 8B (Q5_K_M):

      • Offload 32/33 layers to GPU → Processes 350 tokens per sec, generates 12–17 tokens/sec.

    • MN-Violet-Lotus-12B (Q6_K):

      • Offload 26/41 layers to GPU → Processes 120 tokens per sec, generates 4 tokens/sec (slow but usable).

proxy allowed

Published chats

0

comments

Leave a comment or feedback for the creator ❤️