Skip to content

Testing the Chaos: Why LLM Failures Move Our Work Upstream

A student in our AI with Python cohort recently sent me an email that struck a chord.

He was exhausted after a long day of coding and reviews, grappling with a fundamental question that every developer eventually hits when moving from playing with LLMs to building actual products with them.

"There are a couple of things I don't yet see how to test. For example, how do I see that a failed response from the LLM, deep inside a routine, throws an LLMParseError? Do I abstract that code into its own function, or do I do mocking? What is the normal way to proceed?"

This is a great question because it highlights exactly how our work is changing. As I have written before, AI does not necessarily make our jobs easier. It just moves the hard part upstream.

In this case, the hard part isn't writing the code that talks to the AI. The real work is designing the system that survives when the AI acts up.

The False Choice

When you are staring at a piece of code that feels untestable, your instinct might be to choose between cleaner architecture or using testing tools like mocks. The truth is that you need both, but for a very specific reason.

You aren't just cleaning up code when you move an LLM call into its own function. You are creating a boundary. You are deciding exactly where the unpredictable world of the LLM ends and where your reliable Python logic begins.

1. Define the Boundary

If you have an LLM call buried in the middle of a large function, testing a failure is a nightmare. By pulling that call out into a dedicated service or function, you give yourself a hook.

This is the architectural shift. You are moving your focus away from the process and toward the interface.

2. Control the Failure

Once you have that hook, you use mocking. Testing LLMs with real API calls is slow, expensive, and frustratingly inconsistent. You should not wait for the LLM to actually fail to see if your code works. You should force it to fail.

Using a tool like unittest.mock or pytest-mock allows you to tell your code: "In this test, the LLM is going to return total gibberish. Now, let us see if you handle it or if you just crash."

def test_app_handles_parsing_error(mocker):
    # We simulate the specific failure we want to guard against
    mocker.patch(
        "app.services.llm_service.call_model",
        side_effect=LLMParseError("The LLM returned invalid JSON"),
    )

    # We check if our main logic knows what to do next
    result = run_main_process()
    assert result["status"] == "error_handled"

The New Cognitive Load

If you feel worn down, it is because the mental load has shifted. In the past, you spent your energy worrying about syntax and loops. Now, you have to spend it worrying about the "what-ifs."

What if the model hallucinates?

What if the JSON is missing a key?

What if the connection drops halfway through?

This is what it means to move work upstream. You are no longer just a coder in the traditional sense. You are becoming a governor of a complex system. You are building a safety net for a technology that is inherently unstable.

Why This Matters

Learning how to mock these failures is not just a technical hurdle. It is a mindset shift. When you write a test for a failed LLM response, you are forced to think about your error handling strategy. You are forced to decide whether the app should retry, log a warning, or ask the user for help.

The time you spend figuring out these tests is the real work of a modern engineer. The AI can handle the happy path of writing the code, but you have to be the one who plans for the unhappy path.

A Final Note

It is okay to feel worn down. You aren't just learning Python anymore. You are learning how to build resilient systems on top of shifting sands.

My advice is to do both. Abstract the code to create a clear boundary, and then use mocks to simulate the chaos. The goal isn't to write a program that never fails. The goal is to be the architect who already planned for the moment it does.

And if you want to learn how to build agentic systems hands-on, join the next cohort at pythonagenticai.com.


If this is relevant to your work

I build RAG systems and AI assistants for engineering organizations — designed to handle the unpredictability described in this post.

RAG for Engineering → Technical Document AI → Book a free intro call