On the power efficiency of locally hosted LLMs

What I think is important to do to make sure that you are not wasting too much money on locally hosted LLMs is to try to drive down the cost. Use non peak hours and acquire solar.

You need to minimize the cost of your electricity to a reasonable rate, and then base your calculation of token throughput from your system on that number. Then you can calculate for the setup how many tokens you are able to get per dollar and then do a survey against the available non-local options to see how much money you are losing just to get the privacy of local hosting. As that is the one thing that you cannot buy.

Another factor I realized that is important is to increase the efficiency of your hardware setup. For me, this means to consolidate boxes to drive down the amount of wasted power from idling the computers. I used to have my workstation setup be both a NAS and GPU box, so it was 5950X on X570 with 2x3090 and I had like 13 HDDs. This thing idled around 200W, so it wasn't exactly cheap, but i had to basically keep it running 24x7 as I chose to leave it running as opposed to saving a few tens of bucks a year on electricity because I did not want my storage being offline at times (and I wanted to feel like I was avoiding wear and tear on all those disks spinning them down every day). What I did a few months back was rejigger everything; I acquired a 3090Ti and put all 3 3090s into a dedicated GPU box on the 5950X with a new HX1500i and that thing stays OFF any time I'm not running inference or tinkering with it, and then I repurposed my "testbench" 5600G B550 platform to run the NAS. Now the NAS left up and running draws less than 100W (or roughly about that much I haven't measured yet) and I have solar, now, my built up solar utility credit I can spend as free inference electricity, but going forward I do need to keep an eye on my usage as the effective rate will shoot back up to utility company rates if I use so much of it as to exceed that of my solar production. Note also a 3090Ti idles at like 9w while 3090 usually idles north of 30w, so it may be worth spending a little more for a 3090Ti just to get better idle power consumption. Since I have 3090s my intended approach of only booting up the machine when i need inference will help me save a good little chunk of electricity.

For example let me run a rough version of this analysis on my own setup by assuming a fixed solar-subsidized electric rate of $0.15/kwh to do the dollars to token rate. Note as explained just above I did take steps to further subsidize my local LLM experimentation by installing solar with as high a production capacity as i could convince them to give me, and realistically my 3090s are what I'm intending for my overproduction to be for, so for me my usage up to a point will be effectively free.

So 2x3090+1x3090Ti I will run let's say at 300+300+350W power limits, that's 950 so let's assume 1150w when running flat out. I've still not gotten around to tinkering with LLMs on this yet in the last few months so i have to ass pull a number, let's say I get 40tok/s running OSS 120B. I prob want to be using qwen3 coder 80b tbh, but anyway this is a good line in the sand: So then the calc goes: 40*3600 tok per 1.15 kwh = 144000 tok / $0.1725 = 834783 tok/$ which is $1.198 per 1M tokens.

Comparing to published API pricing I guess the main number to look at is output token rate. E.g. on Groq 120B is 0.6$ per million output tokens, so even groq which runs this model on hardware that will give over 10x the token rate (500tok/s) costs less than my local hosting. On openrouter the prices and token rates vary wildly between $0.2 and $0.6 / Mtok and 50 and 200 tok/s respectively, but what this says is my $1.20 / Mtok is NOT COMPETITIVE against API usage for this model, which means I'm looking at roughly $0.6 per megatoken pricing for privacy in this case. It's ... okay, it won't rule out usage of this model at this performance but I would want to be seeing if i can get batch parallelism to drive down my cost of privacy on this. If I can bring out 3x batch parallelism then I can about break even and end up paying $0 for privacy which feels better.

Another example, suppose I run something more efficient. in the past i got qwen3-30b-a3b running on 3090 with sglang at like 600tok/s batched at x8 concurrency (140 tok/s single inference), since it fits on-GPU then with 3 GPUs I could be getting:

140tok/s single inference at batch 1 with 2 GPUs idle (400W)
420tok/s single inference at batch 3 (1150W)
1800tok/s batched inference at batch 24 with (1150W)

Now, this model I can't find on groq, but that is ok, let's compare to the openrouter offering of 180tok/s at $0.40/Mtok output rate:

token rate from my local solution is fairly competitive with the speed from these
Let me do a bit of algebra to work out the hourly equation... $/Mtok = (watts / tokrate_tok_per_s) * ($/kWh) * 0.2777777778: Plugging in the machine's perf parameters yields equivalent dollars per megatoken: per the above example 1150/40 * 0.15 * .27778 = 1.198.
So then, dollar rates depending on my request parallelism could range from 1150/1800 * .15 * .27778 = 2.6 cents all the way to 1150/400 * .15 * .27778 = 12 cents, that's actually less of a spread than I thought I was going to have. interestingly, as well, if I have 3 worth of batch parallelism I could choose instead to inference all on a single GPU instead of 3 GPUs independently, and gain a lot of power efficiency that way!

So at least in the case of a smaller model that fits in an individual GPU's VRAM, the extra performance that you can get (especially if you get efficient batch parallelism taking place) could very well get you to a better price compared to paying by the token.

Anyway, the conclusion there would be that it seems like power efficiency is reasonable as long as you are mostly running models that fit on memory. If you are running LLMs that don't fit on one GPU, it will make power efficiency suffer, but it's not too terrible.

If you are coding but you do not benefit from the privacy part of the equation, pay-by-the-token rates are not always but quite often cheaper than it costs your own electricity to run inference. This is extremely likely to be the case if we're talking about a model that you're pushing your machine to its limit to host (e.g. splitting model across GPUs or sharding across an apple silicon cluster with exo).

What's more, though, paying by the token for frontier models for coding use is already a very cost-ineffective way to build software. It may represent a somewhat fair pricing for the energy that you're consuming to do what you're doing but the costs will add up very quickly. A much more usable alternative is to use subscriptions like

codex with chatgpt plus
claude pro
gemini (Google AI) pro
z.ai GLM Coding Plan

... and many others which allow you to blast through e.g. 10 million tokens (the limit varies, though) on a usage limit of 5 hours and come back and do it again after the timer is up. As long as you utilize these enough, i guess a couple times a week ought to do the job, then you are already quite far ahead economically compared to paying through the nose on a per-token basis.