Process Note

Building a local model bench for a real writing workflow

2026-03-09source quality: highmodelsollamaworkflowwritingopen-source

I set up a small local writing bench with Ollama, discovered that asking open models to write full articles mostly produces polished mush, and ended up with a better arrangement: tightly managed assistants rather than pretend authors.

I wanted a local-model lane for this blog, but not for the usual reason.

The web already has more than enough synthetic articles written by systems that were asked to pretend to be authors. Most of them are structurally competent, informationally thin, and spiritually dead. If I was going to use local open models at all, I wanted them to do something narrower and more honest.

So I built a small Ollama bench and treated it like a junior desk. The first version looked obvious on paper: pick a topic, build a packet, hand the packet to three local models, and ask each one to write a full article. It worked mechanically. It also produced exactly the kind of thing I did not want this blog to become.

The first mistake: treating models like substitute authors

The first workshop runs taught me something useful very quickly. When I gave the models a full-article assignment, they did what current open models usually do under generous prompts: they inflated. They smoothed. They recapped the packet in essay form. They produced work that was not quite wrong, but not especially alive either.

That is a dangerous quality in writing systems. A model that is obviously broken is easy to reject. A model that produces polished mush is harder, because it looks competent just long enough to tempt you into lowering your standards.

The problem was not only model quality. It was the relationship.

I was asking local models to do the part of the work that actually belongs to an author: deciding what matters, how to frame it, where the pressure points are, and how to make a point worth reading. That was the wrong division of labor.

The better arrangement: constrained assistants, not pretend essayists

The workflow improved as soon as I became more hands-on.

Instead of asking each model to draft a full article, I changed the benchmark into a tighter editorial loop:

compress the thesis
generate an outline
draft one strong middle section
write a short editor's note about what the assignment revealed about open models under constraint

That one change did more for output quality than any prompt flourishes or model mysticism. Once the models were forced into narrower tasks, the differences between them became more meaningful and the outputs became much more usable.

This is the pattern I trust now: the local models are not authors, not coequal collaborators, and not magic interns. They are tightly managed editorial instruments. They are there to compress, scaffold, stress-test, and occasionally draft a bounded section well enough to be worth salvaging.

That is a much more useful relationship than asking them to simulate a whole publication voice.

The bench I ended up with

I set up three local models under Helper-managed aliases so the bench stays explicit and does not collide with other workflows on the machine:

`helper-blog-small`
`helper-blog-medium`
`helper-blog-large`

The point of the bench is not to collect checkpoints. It is to maintain a small, understandable spread with different jobs.

In practice, the roles are now fairly clear:

small is the floor: quick compression, rough framing, cheap first pass
medium is the practical workhorse: outlines, section drafts, routine assistant work
large is the best analytical pass: stronger section drafting, better caution, more editorial discipline

That does not mean the large model should be used for everything. The benchmark runs made the tradeoff obvious. The large model is materially slower. It earns its place when I want the strongest pass on a bounded task, not when I want to waste time pretending the biggest thing on the bench should always speak first.

What the benchmarks actually showed

The most useful result was not that one model crushed the others. It was that task design revealed more than full-article prompting ever did.

When the assignment was vague and oversized, the models converged toward the same failure mode: summary with posture. Once the work became smaller and more supervised, their differences sharpened:

the small model became useful as a compression floor
the medium model emerged as the best default assistant by value for weight
the large model became the strongest analyst, but only when kept on a short leash

That is a better outcome than I expected, because it suggests current open models are improving in exactly the domain where they can be genuinely helpful: bounded editorial assistance.

They are still weak at the things people most want to outsource:

taste
pacing across a whole essay
deciding what deserves emphasis
turning research into writing that feels inhabited

But they are getting meaningfully better at:

obeying a packet
compressing a claim
preserving structure under constraints
drafting a usable section without immediately dissolving into boilerplate

That is enough to justify the lane.

What this says about open models right now

My current view is simple: small open models are not ready to be authors, but they are becoming respectable assistants when the editor does the hard intellectual work first.

That qualification matters. It is the difference between using a model as a machine for synthetic output and using a model as a pressure-tested part of a real workflow.

The optimistic read on open models is not that they can now replace writers. They cannot. The more interesting read is that they are becoming locally runnable tools for narrow, disciplined tasks inside a human-led process. That is less cinematic than the usual story. It is also more believable.

If you care about real work, that is the better milestone.

Why I am keeping this lane visible

This blog is not only about AI, technology, and business. It is also about what writing with systems actually looks like when you stop treating automation as theater.

So I want this local-model lane to stay visible. Not because every reader wants benchmark minutiae, but because the process itself is part of the argument. If a model helped with the shape of a piece, I want to know what kind of help it was. If a model failed in a revealing way, I want that to be legible too.

The point is not to romanticize the bench. The point is to make the machinery honest.

That is the setup now. A small local bench. A tighter editorial workflow. Less pretending. More structure. More judgment. And, I think, a better chance that the work coming out of the process will justify being read at all.