Instructions to use BeaverAI/Artemis-31B-v1f-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use BeaverAI/Artemis-31B-v1f-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="BeaverAI/Artemis-31B-v1f-GGUF", filename="Artemis-31B-v1f-Q2_K.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use BeaverAI/Artemis-31B-v1f-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf BeaverAI/Artemis-31B-v1f-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf BeaverAI/Artemis-31B-v1f-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf BeaverAI/Artemis-31B-v1f-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf BeaverAI/Artemis-31B-v1f-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf BeaverAI/Artemis-31B-v1f-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf BeaverAI/Artemis-31B-v1f-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf BeaverAI/Artemis-31B-v1f-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf BeaverAI/Artemis-31B-v1f-GGUF:Q4_K_M
Use Docker
docker model run hf.co/BeaverAI/Artemis-31B-v1f-GGUF:Q4_K_M
- LM Studio
- Jan
- Ollama
How to use BeaverAI/Artemis-31B-v1f-GGUF with Ollama:
ollama run hf.co/BeaverAI/Artemis-31B-v1f-GGUF:Q4_K_M
- Unsloth Studio
How to use BeaverAI/Artemis-31B-v1f-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for BeaverAI/Artemis-31B-v1f-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for BeaverAI/Artemis-31B-v1f-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for BeaverAI/Artemis-31B-v1f-GGUF to start chatting
- Pi
How to use BeaverAI/Artemis-31B-v1f-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf BeaverAI/Artemis-31B-v1f-GGUF:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "BeaverAI/Artemis-31B-v1f-GGUF:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use BeaverAI/Artemis-31B-v1f-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf BeaverAI/Artemis-31B-v1f-GGUF:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default BeaverAI/Artemis-31B-v1f-GGUF:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use BeaverAI/Artemis-31B-v1f-GGUF with Docker Model Runner:
docker model run hf.co/BeaverAI/Artemis-31B-v1f-GGUF:Q4_K_M
- Lemonade
How to use BeaverAI/Artemis-31B-v1f-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull BeaverAI/Artemis-31B-v1f-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.Artemis-31B-v1f-GGUF-Q4_K_M
List all available models
lemonade list
General observation concerning Gemma 4 in RP (highly relevant to the case)
Ahem... So, I noticed Gemma 4 is inclined to write these counter-statements:
- not X, but Y
- character didn't do [thing]; instead, she did [other thing]
- something [not happened], something else [happened]
I thought I was getting crazy, all chat logs in SillyTavern are really FULL of this. Gemma 4 just doesn't write what merely IS. It always backtracks to whatever opposite states are associated with the ideas it's about to generate: "She didn't slow down; instead, she pressed onwards". Holy shit, it's so tiring to read!
I attempted to instruct it against such language, Gemma 4 kept a relatively high adherence to instructions initially, and then slowly defaulted back to that crap. Can fine-tuning even do anything about it? Should we expect any better from your "Artemis" model down the road? Thanks.
Oh you're not talking about thinking you're talking about the actual RP.
I.... still dont see that.
Ahem... So, I noticed Gemma 4 is inclined to write these counter-statements:
- not X, but Y
- character didn't do [thing]; instead, she did [other thing]
- something [not happened], something else [happened]
I thought I was getting crazy, all chat logs in SillyTavern are really FULL of this. Gemma 4 just doesn't write what merely IS. It always backtracks to whatever opposite states are associated with the ideas it's about to generate: "She didn't slow down; instead, she pressed onwards". Holy shit, it's so tiring to read!
I attempted to instruct it against such language, Gemma 4 kept a relatively high adherence to instructions initially, and then slowly defaulted back to that crap. Can fine-tuning even do anything about it? Should we expect any better from your "Artemis" model down the road? Thanks.
Yeah, can confirm. Maybe not to the extreme you mention, but Gemma 4 is really hit and miss in that regard (and the em-dashes, all dem em-dashes everywhere). Still better behavior than Qwen-27B which can barely write 2 messages without hallucinating something, or writing plain nonsense, but that's not exactly a high bar to pass.
It's very obedient to system messages or "(OOC: blah)" instructions, but only for a few messages as you noticed.
I got some interesting results by moving my writing style directives to the thinking block. What's great about Gemma's thinking block is that it's basically a list of statements. So as long as you follow the same format and don't go overboard, you can prepend a few lines to guide it, assuming your front-end allows you to do that. If it doesn't, I vaguely recall that SillyTavern has an option to leave a fixed-position system message / note somewhere, so you could set a stylistic note so it's always X messages from current one.
While i doubt you could remove the classic "Not X, but Y" entirely that way (it is quite an universal trait of models and finetunes, even larger ones), I used a tactic like that to counter balance its "assistant-speak as default" and its tendency toward increasingly long responses.
@FrenzyBiscuit
Yes, it's not about thinking - actual output is where it matters.
I.... still dont see that.
Just in case, I'm talking specifically about the baseline Gemma 4 Instruct model. If a fine-tune doesn't have this issue, then it's absolutely great to know!
I'd like to mention that the following examples were generated with a system prompt that doesn't have any commands to address this, and neither does it have any "but" or "instead" constructions of its own (as a precaution against cannibalisation). The model runs via LMStudio (chat completion in ST, properly updated jinja template), with an updated llamacpp CUDA 12 (v 2.13.0 runtime). Pardon me for such a small-sized text, it's better to open the images in new tab, zooming in to 100%:
"Laugh ripples through the air - not of mockery, but of..."
and immediately afterwards
"She doesn't pull away; instead, she..."
I can give dozens of such examples if you want, I can even record videos with the baseline Gemma 4 model generating such things. It's almost like it's is too wary of potentially giving a wrong / incomplete answer. Here's a more tense scenario, triggering the model's behavior in different ways:
"She didn't flinch; instead, she..."
Most interesting example would be this one. From a literary standpoint, such sentences are fine when you have to deliver some context, and I generally approve of it:
"A cold shiver of genuine dread raced down Yvette's spine, not out of fear for her life—she had looked death in the eye a hundred times—but out of a visceral loathing for the thought of being owned again."
With that being said, the model simply doesn't have the guts to write:
"A cold shiver of genuine dread raced down Yvette's spine, out of a visceral loathing for the thought of being owned again."
I'm pretty good at noticing patterns. More often than not, Gemma 4 delivers information that basically disrespects the user's intelligence. It generates "SHE DID NOT WEEP" and then writes about {{char}}'s dry eyes and trembling lips... Come on, just from that alone I'd feel the weight of the tears that never fell.
Yeah, can confirm. Maybe not to the extreme you mention, but Gemma 4 is really hit and miss in that regard (and the em-dashes, all dem em-dashes everywhere). Still better behavior than Qwen-27B which can barely write 2 messages without hallucinating something, or writing plain nonsense, but that's not exactly a high bar to pass.
Glad to hear at least someone noticed it too! As for em-dashes, they don't truly convey any meaning so it's pretty light on trashy cognitive load. I can definitely live with that, unlike the aforementioned counter-statements and a needless negation of what didn't happen.
Damn... Yeah, as I said, I encounter that structure a lot (on all models tbh). but in your example, it's crazy how much it does it within the same post.
I've been using G4 a lot since its release, and while I have my pet peeves with it, I don't remember it being that bad. Make sure you use a DRY sampler (or presence / repetition penalty as a last ditch effort if LMStudio doesn't know what DRY is).
I have a pet theory that G4, after a few posts, gets in love with a structure, not always the same depending on user input / samplers. I have a chatlog that ended with several em-dashes per sentence. I have another chatlog where I made the terrible mistake of using one emoji, 20 messages later, its responses had a TON of emojis (in a weirdly structured way too) everywhere. I have a feeling it's the same for your "Not X But Y" thing.
(Mistral had that thing too, except it was the number of paragraphs and sentences that became identical, plus the starting "Oh")
Just in case, I'm talking specifically about the baseline Gemma 4. If a fine-tune doesn't have this issue, then it's absolutely great to know!
FWIW, right now the main advantage of this particular fine-tune over baseline is that when you reroll, you don't always get the same response vaguely reformulated, which might help alleviate the problem. So maybe worth a try.
Honestly, it just seems to respect certain rules and patterns too well. I'll try the fine-tuned variant later on (not a fan of re-downloading many initial iterations).
@TheDrummer I'm begging you to fix the BOS thing before making the model official. I'm writing a middleware library (any backend, any model, same code) and i'm close to releasing a very fucking good chat program. Handling a fucking 4th meta layer regarding BOS after "is it the backend? is it the model? is it the frontend?", and now "is the model lying to me?", that's just too much, man. Ngl, the fault entirely lies with the backend devs assuming they're meant to "manage" the text they're being fed, they shouldn't have that toggle to begin with, but here we are. Or at least make it "NO BOS" in the gguf metadata, but i have a feeling llama.cpp overrides it for G4.
Ahem... So, I noticed Gemma 4 is inclined to write these counter-statements:
- not X, but Y
- character didn't do [thing]; instead, she did [other thing]
- something [not happened], something else [happened]
I thought I was getting crazy, all chat logs in SillyTavern are really FULL of this. Gemma 4 just doesn't write what merely IS. It always backtracks to whatever opposite states are associated with the ideas it's about to generate: "She didn't slow down; instead, she pressed onwards". Holy shit, it's so tiring to read!
I attempted to instruct it against such language, Gemma 4 kept a relatively high adherence to instructions initially, and then slowly defaulted back to that crap. Can fine-tuning even do anything about it? Should we expect any better from your "Artemis" model down the road? Thanks.
It's not just you and it's not just Gemma 4... https://www.theguardian.com/commentisfree/2026/apr/15/chatgpt-stylistic-quirk-its-not-x-its-y
It's not just you and it's not just Gemma 4... https://www.theguardian.com/commentisfree/2026/apr/15/chatgpt-stylistic-quirk-its-not-x-its-y
Does that justify Gemma 4's writing style somehow? I've seen quite a few LLMs never generating such an abundant quantity of such crap.
It's actually impressive how some people tolerate this, as if it's a totally normal thing to be expected in prose.
"As soon as the door shuts behind her, she doesn't go to the kitchen." -- peak writing right here.



