Qwen 3 came out about a month ago and is one of the most capable local LLM models available today.
the 30B-A3B variant is probably the most interesting because it can fit quantized very well in a 24GB GPU and with some tricks to reduce key-value store size we can also support a good amount of context with it.
A coding task like "write a tetris game in an html page" works well with thinking turned on and I was able to get a working snake game out of it with thinking switched off.
This model is able to run very quickly on apple silicon. I read reports of between 60 and 100 tokens per second on various hardware. I think you will need a 32GB or more apple silicon machine to run this though 48GB may be a lot more comfortable.
I have over the past few weeks been tinkering with SGLang to run it as fast as possible on a 3090.
I'm running Linux (docker under and ancient Ubuntu 20.04 installation) with a 250W power limit on the 3090 because that is near a sweet spot for power efficiency.
With LLMs and GPUs that have ample compute, you want to use batching to get more bang for your buck out of the available memory bandwidth. Because if you have a batch of tokens you can crunch through the model to infer the next token with, you can do all the matrix processing in parallel in one pass as you comb through the model's weights limited by the GPU's main memory bandwidth.
In distributed and high-performance computing, there is a concept called the Roofline Model which can be very powerful to help predict how algorithms scale in performance based on its arithmetic intensity, which relates how much computation is being performed by the workload to how large the memory footprint of the workload is. For example, if your workload involves large matrices to multiply, it is easy to end up with algorithms that exhibit very high intensity values. I studied this in depth 5 years ago, and I saw that typical computer chips, both CPU and GPU chips were starting to converge on and exceed an arithmetic intensity of 20 flops per byte. Now I did this simple calculation on a modern GPU (RTX 5090):
104.8 TFlops / 1.79 TBytes = 58.5
That's the theoretical compute and memory bandwidth specs divided against each other. Clearly we have blown well past an arithmetic intenisty of 20 in 2025. It may also be interesting to consider how tensor core throughput can dramatically increase this even further. Consider that FP8 with FP16 accumulate is theoretically 838 TFlops (an 8x multiplier over the above FP32 base theoretical speed), that would give an arithmetic intensity value of 468 flops per byte. This number is very close to meaningless of course but what it does mean is that under the most perfect of conditions it may be possible to conduct over 400 flops worth of compute per byte of churned data. If it takes for example 20 flops per byte on average to evaluate an LLM workload, and if this kind of low precision computation could form the backbone of the workload, then it could scale up to 20+ parallel instances because that is how much compute is on tap, with so much compute being heaped on top now compared to the memory speed gains.
I went on a large digression there. At the end of the day, the high arithmetic intensity that modern GPUs are able to absorb can be leveraged to get "free work" done. If my GPU is already working on inferencing one LLM prompt, if I submit a second one, its power draw will rise by a very small number of watts, and the token rate may drop from say 140tok/s to 120tok/s (for a total of 240 tok/s). This is very powerful.
By loading up on batch size, we are able to saturate the GPU's compute and memory bandwidth. This model is an MoE model with 30 billion parameter size and quantized fits (tightly) within 24GB. Since only 3 billion parameters are enabled at a time it runs as fast or faster than a 7B dense model. This means it runs very fast, around 150 tok/s out of the gate with a batch size of 1. That's very fast for a LLM and for example is faster than most chatbots you might be using. Not having to compromise on speed with self hosting is very nice.
With this setup, with batch size 8 it sustains 600 tok/s or so. I do find larger batch sizes run slower, so it seems like putting a limit of 8 concurrent inference jobs is a good idea.
SGLang provides a standard OpenAI compatible web API so I'll be able to point existing tools at the locally hosted AI.
I'll share the current state of my dockerfile. It has some special tweaks:
RedHatAI/Qwen3-30B-A3B-quantized.w4a16
I'm obviously just experimenting here. But since I would be leaving my machine idle most of the time I had to fix the busy loop problem in the software before I could start to run it continally on my machine at home.
As such, the following is a very potent starting point to get 600 (you may be able to push 700) tok/s (up to 160tok/s for one instance) out of Qwen3 30B-A3B out of a 3090.
FROM nvcr.io/nvidia/tritonserver:24.04-py3-min
ARG CUDA_VERSION=12.4.1 # keep the same CUDA as the base image
ARG BUILD_TYPE=all
ENV DEBIAN_FRONTEND=noninteractive
# ---------- system packages ----------
RUN echo 'tzdata tzdata/Areas select America' | debconf-set-selections && \
echo 'tzdata tzdata/Zones/America select Los_Angeles'| debconf-set-selections && \
apt update -y && \
apt install -y software-properties-common \
python3 python3-pip \
curl git sudo \
libibverbs-dev rdma-core infiniband-diags \
openssh-server perftest && \
rm -rf /var/lib/apt/lists/* && apt clean
# optional utility for MiniCPM models
RUN pip3 install datamodel_code_generator
WORKDIR /sgl-workspace
# ---- Torch 2.6 (CU-12.4) ----
ARG TORCH_VER=2.6.0+cu124
ARG CUINDEX=124
# ---------- Torch, patched sglang ----------
RUN python3 -m pip install --upgrade pip setuptools wheel html5lib six && \
python3 -m pip install --no-cache-dir \
--index-url https://download.pytorch.org/whl/cu${CUINDEX} \
"torch==${TORCH_VER}" \
&& git clone --depth 1 --revision 60cef0fd623472219104258d17053a9733df0a4a https://github.com/nytopop/sglang.git \
&& cd sglang \
&& python3 -m pip install -e "python[${BUILD_TYPE}]" \
--find-links https://flashinfer.ai/whl/cu${CUINDEX}/torch2.6/flashinfer-python
# ---------- vLLM + newer compressed-tensors ----------
RUN python3 -m pip install "vllm[cu${CUINDEX}]==0.8.5" && \
python3 -m pip install --no-deps --upgrade "compressed-tensors>=0.9.4"
# --- patch sglang scheduler to cut CPU spin only in idle loops: https://github.com/sgl-project/sglang/pull/6026 ---
RUN set -e; \
F=/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py; \
echo "==> Backing up $F -> ${F}.orig"; \
cp "$F" "${F}.orig"; \
python3 - "$F" <<'PY' && \
echo "==> Diff after patch:" && \
diff -u ${F}.orig ${F} || true
import sys, re, pathlib
path = pathlib.Path(sys.argv[1])
text = path.read_text().splitlines(keepends=True)
patched = []
for i, line in enumerate(text):
patched.append(line)
if (
i > 0
and text[i - 1].lstrip().startswith("self.check_memory()")
and line.lstrip().startswith(
"self.new_token_ratio = self.init_new_token_ratio"
)
):
indent = re.match(r"\s*", line).group(0)
patched.append(f"{indent}time.sleep(0.001)\n")
# ensure `import time` exists
if not any(re.match(r"\s*import time", l) for l in patched):
patched.insert(0, "import time\n")
path.write_text("".join(patched))
PY
ENV DEBIAN_FRONTEND=interactive
# ---------- default command ----------
CMD ["python3","-m","sglang.launch_server", \
"--model-path","RedHatAI/Qwen3-30B-A3B-quantized.w4a16", \
"--reasoning-parser","qwen3", \
"--dtype","float16", \
"--host","0.0.0.0", \
"--port","30000"]
# ---------- end Dockerfile ----------
I've been very impressed with how well running bleeding edge CUDA software works on even really old distros. 20.04 came out in April 2020 and is over 5 years old now. As long as you have a suitable Nvidia driver installed, the rest of the software stack can be fully managed via docker, which has been a huge game changer.
Actually the only reason I am currently motivated to upgrade this install is because of the really old ZFS version.