I’ve been coding for years, and I’ve noticed something. Developers who treat LLMs like magic get mediocre results. Developers who understand what’s actually happening get dramatically better ones.
The difference isn’t prompt engineering tricks. It’s mental models.
When you understand that LLMs are basically story-continuation engines trained on internet text, everything clicks. The weird behaviors make sense. The failures become predictable. And you know how to work with the machine instead of against it.
Let me share the mental models that changed how I think about these tools.
The Story Continuation Engine
Here’s the thing that changes everything: LLMs don’t answer questions. They continue stories.
When you type a prompt, you’re not asking an oracle for wisdom. You’re starting a narrative, and the model predicts what comes next. The “answer” is just the most likely continuation of the story you began.
This matters more than you’d think.
If you start your prompt like someone asking a dumb question, the model continues that story—with a basic, possibly wrong response. That’s what typically follows dumb questions online.
If you start like an expert discussing a nuanced problem, the model continues that story—with sophisticated reasoning and careful qualifications. That’s what typically follows expert discussions.
The model isn’t smarter in one case than the other. It’s continuing different narratives.
I watched two developers prompt the same model for code review. One wrote: “hey can u check this code for bugs lol.” The other wrote: “As a senior engineer reviewing this authentication implementation, I’m looking for security vulnerabilities and edge cases.” Same code. Wildly different responses.
The second developer understood story continuation. The first got the story they started.
A Skewed Mirror
LLMs are trained on internet text. This matters because internet text is a very specific slice of human knowledge.
It’s heavily English. Western cultural norms are baked in. Academic papers, Stack Overflow, Reddit, programming blogs—these dominate. Conversations don’t. Oral traditions don’t. Non-English perspectives are underrepresented.
When I prompt an LLM, I’m essentially asking: “What would the average internet response to this be?” And “internet” means a particular, biased subset of how humans express themselves.
This matters practically. For mainstream programming questions, the model has seen thousands of similar discussions. The “average” is pretty good. For niche domains or specialized industries? The training data is sparse. The “average” might be nonsense.
I’ve seen developers get frustrated when LLMs give poor answers about specialized topics. The model isn’t failing—it’s accurately reflecting that the internet doesn’t have much good content about that topic.
Knowing the training data helps you calibrate trust. Mainstream JavaScript question? Trust it. Obscure embedded systems protocol? Verify everything.
Garbage In, Garbage Out (For Prompts)
Every developer knows GIGO. Garbage in, garbage out. We apply it to data, to API inputs, to user forms.
We forget to apply it to prompts.
Research backs this up. Teams using structured, contextual prompts saw 30% faster turnaround and better quality. Prompt format significantly impacts results. And here’s the sneaky part: irrelevant context degrades output—especially if it’s semantically similar to the actual task, which confuses the model.
There’s also prompt bloat. More context isn’t always better. Performance actually drops around 3000 tokens. Past that, the model starts losing the thread. The irrelevant stuff doesn’t just get ignored—it actively interferes.
I’ve learned to be surgical about context. Include what matters. Cut what doesn’t. Treat prompt construction with the same care I’d give API design.
Role Prompting (When It Works)
“Act as an expert…” prompts are everywhere. The research on whether they work is more nuanced than the hype.
Role prompting helps with open-ended tasks: creative writing, brainstorming, exploring possibilities. The persona shifts the story. “As a security researcher…” starts a different narrative than nothing at all.
But for accuracy? Simple personas don’t make answers more correct. “You are an expert in X” doesn’t magically give the model knowledge it lacks. It might change style without improving accuracy.
What does work: specific, detailed personas. Not “you are an expert” but a paragraph describing the expert’s background, approach, and priorities. Interestingly, LLM-generated personas often beat human-written ones—the model knows what framing works best for itself.
I use role prompting for exploration and style. I don’t rely on it for facts.
Embracing the Randomness
Here’s something counterintuitive: the randomness isn’t a bug. It’s a feature.
Every developer’s instinct is to want deterministic output. Same input, same output. Reproducible. Predictable. Safe.
LLMs don’t work that way. Same prompt, different response. Temperature modulates this, but even at low temps, there’s variance.
Google DeepMind’s AlphaEvolve system taught me to think differently. They use LLM randomness as an advantage:
- Generate many diverse candidates
- Evaluate against objective criteria
- Keep the best, discard the rest
- Iterate
The randomness creates diversity. Most variations are useless. But occasionally, one is better than anything deterministic would produce. The randomness breaks you out of local optima.
AlphaEvolve used this to find the first improvement on matrix multiplication in 56 years. 0.7% efficiency gains on Google’s global compute. 23-32% speedup on AI training operations. Not by fighting randomness. By embracing it.
I’ve started doing this in my own work:
- Generate 3-5 responses, not one
- Vary prompts slightly between generations
- Sometimes use different models for diversity
- Select the best, iterate on that
The first response isn’t special. It’s one sample from a distribution. Why would I stop at one?
Temperature as a Creative Dial
Temperature controls how “adventurous” the model gets. Low (0.0-0.3) means predictable, deterministic—good for facts, code, consistency.
High (0.8-1.2) means more exploration, more creativity, more weird ideas. Good for brainstorming, creative writing, breaking out of obvious solutions.
Think of it like exploration vs exploitation. Low temp exploits what the model knows. High temp explores possibilities it might not otherwise surface.
For code, I usually want low temp. For naming things, high. For debugging, low. For architecture brainstorming, high.
The dial exists. Use it.
The Systems Thinking Edge
Here’s what I keep coming back to: LLMs reward systems thinking.
You understand garbage in, garbage out—apply it to prompts. You understand training data shapes output—consider what data and its biases. You understand abstraction—think about what story you’re starting.
The developers struggling with LLMs often treat them as magic black boxes. Input goes in, output comes out, when it’s bad you shrug and try again.
The developers getting great results understand the system. They know what they’re working with. They adjust inputs based on how the system processes them. They generate multiple samples and select. They calibrate trust based on training data coverage.
The Ironic Reversal
We spent decades designing programming languages to communicate precisely with computers. Exact syntax. Explicit types. No ambiguity.
Now computers speak our language—imprecisely, statistically, narratively. The roles reversed.
But the developer’s edge remains. You understand systems. You know outputs depend on inputs. You know tools have characteristics and limitations. You know how to debug when things go wrong.
Apply that thinking to LLMs. They’re not magic. They’re systems with predictable behaviors once you understand the mechanics.
The model continues stories. It reflects its training data. It responds to prompt quality. It generates from a distribution.
Work with that, not against it.