← Back to Dispatches
Back in the Trenches

Back in the Trenches, Part 2: Make It Talk

Building the full pipeline — from Twilio call ingestion to LLM causal reasoning — to extract why a changepoint happened, not just when.

· 16 min read
engineeringAI/MLTypeScriptTwilio

The Sentinel Caught Someone

In Part 1 we built The Karaul – a TypeScript changepoint detection library, from first principles, in a traffic jam, in the Easter rain. We gave the sentinel its post. We taught it to stand watch over a time series and scream when something permanently shifted.

It works. Index 28, year 1899, the Nile river knew before the history books confirmed it.

Now the sentinel has caught someone. Day 60, call volume up 40%, the trespasser is right there in the data. The detection is exact. The capture is mathematically guaranteed. What we do not have is a confession.

That is what Part 2 is about. Pulling the trespasser into the interrogation room and making it talk.

karaul hands you a timestamp and a magnitude. It cannot tell you why. The trespasser knows – somewhere in the transcripts, in the Jira tickets, in the patterns of which topics shifted when – but the trespasser will not volunteer the information. You have to extract it.

That extraction is the entire pipeline we are about to build. Reading transcripts, correlating deployments, reasoning over context, connecting a statistical signal to a human experience. It requires understanding that somewhere behind that 40% increase are doctors on hold, agents under pressure, and a support team absorbing the consequences of a decision made three floors up.

The sentinel caught the trespasser. Now it makes the trespasser talk.

We will build the full pipeline – from Twilio call ingestion to LLM causal reasoning – on top of a fictional medical device company called Glados Therapeutics. Their products help neurologists monitor patients. Their marketing department once launched a promotion requiring proof of dancing samba under a full moon while Virgo was in Sagittarius.

The slogan is “For the good of all of us, except the ones who are dead.” The support team did not write it.

The corpus is synthetic – generated by an LLM, planted with known changepoints so we can verify the pipeline finds the right reasons for the right shifts. We cannot use real patient data. We can use realistic fake data. The difference matters less than you might think when the goal is demonstrating a methodology rather than a specific dataset.

By the end of this article you will have a complete architecture for turning a karaul changepoint into an actionable hypothesis – what changed, why it probably changed, and what to do about it. The pipeline closes. The loop that started in the TRIZ and DT series, continued through Samogon, and crystallized in Part 1 finally completes.

Aperture Science: we do what we must because we can. – GLaDOS, Portal

Section 1 – The Architecture

The Pipeline

Before writing a single line of code, let’s look at the full picture. Every section of this article explains one box in this diagram.

The pipeline has five distinct layers, each with a clear job and a clear boundary.

Ingestion – Twilio delivers call recordings. An Azure Function transcribes and stores each conversation as a structured record: call ID, timestamp, duration, raw transcript. Nothing intelligent happens here. Dumb pipe, reliable storage.

Extraction – For each transcript, an LLM extracts topics, subtopics, and named entities. This is where the intelligence enters the pipeline – and where most of the engineering pain lives. More on that in Sections 3 and 4.

Detection – Daily call volumes flow into karaul. The sentinel watches for permanent shifts. When it finds one, it records the timestamp, the magnitude, and the direction. This is the only layer with a mathematical guarantee. Everything else is probabilistic.

Correlation – When karaul fires, the pipeline retrieves transcripts and Jira tickets from the window around the changepoint, then compares topic distributions before and after the shift – identifying which topics increased, which disappeared, and which stayed flat. The retrieval is exact. The interpretation of what matters is not.

Reasoning – An LLM reads the correlated context and produces a causal hypothesis. Not a decision – a hypothesis. The human validates, acts, and closes the loop. karaul then monitors whether the intervention worked.

This separation of concerns is deliberate and important. Each layer does exactly one thing:

Notice where LLMs appear – extraction and reasoning. Not detection. Not correlation. The algorithm handles what algorithms are good at. The LLM handles what LLMs are good at. No tool stretches outside its competence.

This is the point that got lost in the GA detour in Part 1 and is worth repeating here: the question is never whether a tool can do something. It is whether it is the right tool for that specific job. An LLM could probably detect changepoints if you asked it nicely. It would be wrong in ways you couldn’t predict, for reasons you couldn’t debug, at a cost you couldn’t justify. karaul finds the globally optimal segmentation deterministically every single time. Use the right tool.

Section 2 – The Test Data Problem

Building A World To Break

There is a particular kind of engineering honesty required when you cannot use real data. You have to build something realistic enough that the pipeline produces meaningful results, specific enough that the planted changepoints are findable, and detailed enough that the LLM reasoning produces non-trivial hypotheses rather than obvious ones.

Real customer data is off the table. Not because it doesn’t exist – it does, in abundance – but because publishing it would be wrong, and anonymizing it well enough to publish is harder than building a synthetic corpus from scratch. So we build from scratch.

Meet Glados Therapeutics. Neural monitoring devices for neurologists. Implantable, precise, expensive, and supported by a call center staffed by agents who did not sign up to explain marketing promotions to frustrated doctors.

“For the good of all of us, except the ones who are dead.”

The Glados Therapeutics ecosystem:

Six distinct entities across hardware, software, and finance. Enough complexity to generate realistic support topics. Enough overlap in naming to create genuine NER ambiguity when doctors describe them over the phone to an ASR system that has never heard of any of them.

Two Planted Trespassers

The synthetic corpus covers 180 days. Two known changepoints – planted deliberately so we can verify the pipeline finds the right reasons for the right shifts. These are the trespassers we want our sentinel to catch.

Day 60 – PortalPay billing system migration

A backend migration to a new billing engine. Seemed straightforward in the architecture review. Turned out that the new system calculated invoice cycles differently for doctors on annual contracts versus monthly contracts – a distinction that existed in the old system as a flag nobody had documented, and in the new system as a calculation nobody had tested. Doctors started receiving invoices for amounts that didn’t match their contracts. The phones started ringing.

Day 120 – Neurotoxin Pro promotional launch

The marketing team launched a promotion. The terms were as follows:

Purchase NeuralCore before the end of the quarter, under a full moon, when Virgo is in Sagittarius, while also dancing samba – video proof required, minimum 30 seconds, must include recognizable samba footwork as judged by Glados Therapeutics marketing staff – and provided it is raining at your registered practice address at time of purchase, receive a complimentary CompanionScan calibration kit and six months of PortalPay fees waived. Offer valid in territories where dancing is permitted by local ordinance. Void where prohibited by physics.

The support team found out about this promotion when the first doctor called to ask what samba footwork looked like. It was a Monday.

The Corpus Structure

The two noise Jira tickets – GLADOS-1201 and GLADOS-1215 – are there deliberately. The interrogation must correctly identify that an infrastructure ticket and a dependency update are not the reason doctors are calling about samba footwork. If the LLM gets confused by noise, the pipeline fails. The test data should be honest about what the pipeline is actually being asked to do.

Why Python For Generation

The corpus generator lives in karaul-demo – a companion Python repository. The generation script calls an LLM API, passes a detailed prompt describing Glados Therapeutics, the time window, the planted changepoint context, and the character roster, and collects the output as structured JSON.

Python is the right tool here – the LLM API calls, JSON generation, and fixture management are simpler, and the data generation workflow is fundamentally scripting, not library code. karaul remains TypeScript. Two tools, two jobs, no confusion.

Section 3 – Making Topics Stable

The Problem With Creative Machines

LLMs are, by disposition, creative. You give them a customer support transcript and ask what it’s about and they will tell you – accurately, thoughtfully, differently every single time. One call about a PortalPay invoice discrepancy becomes “billing amount mismatch.” Another identical call becomes “invoice calculation error.” A third becomes “payment portal discrepancy.” All correct. All useless for trend analysis.

Trend analysis requires consistency. If you want to know that PortalPay billing complaints increased 40% after day 60, you need every billing complaint to be called the same thing. The LLM’s creativity – its greatest strength in reasoning tasks – is its greatest liability in classification tasks.

This tension shows up everywhere in production LLM systems. It took several prompt iterations to get right, and getting it right required accepting that you cannot solve a consistency problem with intelligence alone. You solve it with constraints.

Layer 1 – Canonical Topics

The first and most important constraint: a predefined list of topics the model must choose from.

Ten topics. Every call maps to one, two, or three of them – never more, never to something outside the list. The prompt is explicit:

“You MUST use topics from this list. Map conversations to the closest canonical topic. NEVER create new topics.”

This alone solves 80% of the consistency problem. But it introduces a new one – the model needs to know which canonical topic maps to which natural language. A doctor saying “my invoice is wrong” should map to payment issue, not invoice request. A doctor saying “I can’t log in” should map to account access, not platform usability.

The solution is explicit mapping rules baked into the prompt:

These rules are only partially generated – they are written by hand with an LLM assistance, refined over hundreds of calls, and represent accumulated domain knowledge that no amount of prompt engineering replaces. The model is intelligent. It does not know your business. You do.

Layer 2 – The Subtopic Problem

Canonical topics solve classification. They do not solve description.

A subtopic is a 2-3 word phrase describing the specific issue within the topic. payment issue is the topic. “annual contract invoice recalculated” is the subtopic. Subtopics are where the useful signal lives – the difference between a routine payment question and a systematic billing error shows up in the subtopics, not the topics.

The problem: without constraints, the model produces a unique subtopic for every call. After processing a hundred calls you have a hundred subtopics, each appearing once, each slightly different from the others, none comparable across calls. Useless.

Early attempts at constraining subtopics with instructions like “keep subtopics generic” and “avoid call-specific details” produced marginal improvement. The model understood the instruction in principle and violated it in practice – because its training optimized for accuracy and specificity, not reusability.

The insight that actually worked came from reframing the constraint:

“Subtopics should be generic enough that 3-5 different calls could plausibly share the same subtopic. If a subtopic would only apply to this specific call, it is too specific – generalize it one level up.”

This shifts the model from describing this call to thinking about reusability across calls. Small change in framing, large change in output consistency.

Layer 3 – Cardinality And The Two-Pass Approach

Two additional constraints that matter in practice.

Cardinality: extract 1-3 topics maximum. If you identify more, merge the least distinct ones. This forces the model to prioritize rather than enumerate, which naturally produces cleaner output.

The two-pass approach: even with all constraints, subtopic drift accumulates over time. New edge cases produce new subtopics. The solution is a periodic canonicalization pass – feed a batch of 50-100 recent extractions to the model and ask it to merge near-duplicates and overly specific entries into a stable vocabulary. This runs weekly and feeds back into the system prompt as a soft known-subtopics list.

The prompt that emerged from this evolution – version 2.3 after several iterations – looks like this in structure:

Version 1.0 had none of this structure. It asked for topics and got creativity. Version 2.3 asks for classification within constraints and gets consistency. The distance between those two versions is the real engineering story of this section – not the final prompt, but the path to it.

What This Gives You

With stable topics you can now compute, per day, the distribution of calls across all twenty topic categories. When karaul detects a changepoint at day 60, you can compare:

Payment issue went from 8% to 34%. Everything else stayed roughly flat. The signal is clean because the topics are stable. Without the prompt engineering work in this section, that comparison produces noise.

The trespasser is starting to look familiar.

Section 4 – NER in The Wild

The Transcription Problem Nobody Warns You About

Automatic Speech Recognition is remarkably good at English. It is remarkably bad at proper nouns it has never encountered – brand names, product identifiers, portal names invented by a medical device company’s product team at some point between 2018 and now.

The result is a transcript that is grammatically coherent, semantically accurate, and factually wrong in exactly the places that matter most for your analytics.

A doctor calls about PortalPay. The ASR system, doing its best, produces “portal page.” Another doctor calls about the same thing and gets “portal pay.” A third gets “portal paint.” All three are the same entity. Your topic extraction sees three different things. Your trend analysis is now measuring three separate phenomena that do not exist.

This is not an edge case. It is the default behavior of ASR on domain-specific vocabulary. And it compounds – every product name, every portal, every identifier in your ecosystem becomes a small source of noise that, aggregated across thousands of calls, produces systematic distortion in your analytics.

The Glados Therapeutics Bestiary

Here is what the ASR system does to Glados Therapeutics entities in practice:

Some of these are almost right. “Companion cam” is close enough that a human reader corrects it automatically. “Portal paint” is not close at all but appears consistently enough to become its own phantom entity in a naive extraction system.

The subtler problem is semantic overlap. “The vault” could be DocVault or TreatmentVault depending on context. “The portal” could be DocVault or PortalPay. “The scanner” is almost always CompanionScan but occasionally refers to a third-party device the doctor owns independently. These ambiguities cannot be resolved by string matching alone – they require context.

The Solution – Canonical Entities In The Prompt

The fix is the same architectural instinct as canonical topics – bring the domain knowledge into the prompt explicitly, don’t ask the model to infer it.

The alias lists are not generated – they are discovered. You process a batch of real transcripts, find the variants, add them to the list. Then process another batch, find new variants, add them. The list grows over time as ASR finds new creative interpretations of your product names.

This is operational linguistics. Not glamorous. Completely necessary.

The DoctorID Problem Deserves Its Own Paragraph

DoctorID is conceptually the simplest entity in the system – a unique identifier assigned to each clinician. In practice it generates a disproportionate number of support calls and a disproportionate amount of extraction noise.

Doctors confuse it with their login credentials. Agents ask for it and get a practice license number instead. The ASR system, encountering a sequence of alphanumeric characters spoken aloud digit by digit, produces something that bears only a probabilistic relationship to the original. And because DoctorID sounds like it could be a portal or a system, the extraction model occasionally classifies it as one.

The fix required an explicit instruction in the prompt: “DoctorID is an identifier – NOT a system, NOT a portal, NOT a product. It is a number. Treat it accordingly.”

This level of specificity – writing prompt instructions for a single entity’s basic identity – is the reality of production NLP. The model is not stupid. It does not know your business. Every instruction like this represents a real call that went wrong before you wrote it.

What Good NER Gives You

With entities normalized, your topic extractions become comparable across calls. “PortalPay billing issue” and “portal page billing issue” collapse into the same record. The entity distribution per day becomes a clean signal:

Combined with the topic distribution from Section 3, you now have two independent signals pointing at the same place. The trespasser is wearing a name tag now.

One more layer to go – the Jira tickets, the causal reasoning, and the confession.

Section 5 – The Interrogation

What The Algorithm Gave You

karaul fired at day 60. Call volume shifted permanently upward, 40% above baseline, and hasn’t returned. The detection is exact – mathematically guaranteed, deterministic, reproducible. What it cannot tell you is why.

You now have two additional signals from the work in Sections 3 and 4. Topic distribution shifted – payment issue went from 8% to 34% of all calls. Entity mentions shifted – PortalPay appears in 41% of calls versus 12% before. Two independent signals, same direction, same timestamp.

Something happened to PortalPay around day 60 that generated a sustained wave of billing confusion. The algorithm caught the trespasser. Now we make it confess.

The Context Window

When karaul detects a changepoint on day T, the pipeline opens a context window – by default three days before and three days after the changepoint. From that window it retrieves two things.

Transcripts – a sample of calls from the window, weighted toward the post-changepoint period where the shift is most visible. The sample size is bounded by what the LLM can reason over coherently – too few and the signal is lost in individual variation, too many and the model loses focus. In practice, twelve to twenty calls per window produces stable hypotheses without context overflow. This is a parameter worth tuning to your specific LLM and use case.

Jira tickets – everything merged or deployed in the window. For Glados Therapeutics this means:

The noise ticket – GLADOS-1201 – is there deliberately. The interrogation must correctly identify that a load balancer configuration update is not the reason doctors are confused about their invoices. If it can’t do that, the pipeline isn’t useful.

The Prompt

The causal reasoning prompt is the most carefully constructed piece of the pipeline. It gives the LLM everything it needs and nothing it doesn’t:

The last instruction – “be specific about which evidence you are discounting and why” – is not decorative. It forces the model to reason about the noise tickets explicitly rather than silently ignoring them. An answer that says “the load balancer update is unlikely to cause billing confusion because it has no user-facing impact” is more trustworthy than one that simply doesn’t mention it.

The Confession

The LLM produces something like this:

“The most likely cause is the PortalPay billing engine migration deployed on day 58 (GLADOS-1142). The evidence is consistent: payment issue topic increased from 8% to 34%, PortalPay entity mentions tripled, and the shift is sustained rather than spike-like – consistent with a systematic calculation error rather than a one-time incident. Sample transcripts show doctors reporting invoice amounts that don’t match their contracts, specifically on annual billing cycles.

The infrastructure load balancer update (GLADOS-1201) is discounted – it deployed 13 days after the changepoint and has no user-facing billing impact.

Recommended investigation: review invoice calculation logic for annual contract cycles in the new billing engine. The migration notes indicate unified calculation logic was applied to both annual and monthly contracts – this may have introduced a regression for annual cycle customers.”

This is not a decision. It is a hypothesis with reasoning and a recommended next step. A human reads it, checks the billing engine code, confirms the annual cycle regression, files a bug, deploys a fix. karaul monitors whether call volume returns to baseline. The loop closes.

Why This Is Legitimate AI Use

This question deserves a direct answer because the answer is not obvious.

The pipeline uses an LLM in two places – topic extraction and causal reasoning. Both are legitimate. Neither is using AI because it is fashionable.

Topic extraction: an LLM reads unstructured text and maps it to structured categories. A human could do this – it takes three minutes per call. At ten thousand calls per month that is five hundred hours of human time. The LLM does it in seconds with comparable accuracy, given good constraints. That is genuine value.

Causal reasoning: an LLM reads twenty transcripts and ten Jira tickets and forms a hypothesis about what changed. A human analyst could do this – it takes two hours of careful reading. The LLM does it in ten seconds. The output is a starting point for human investigation, not a replacement for it. That is also genuine value.

What the LLM does not do: detect changepoints. Find the optimal segmentation. Guarantee mathematical correctness. Those jobs belong to karaul – because they have exact solutions and the LLM has opinions.

“Just use AI for it” is the incantation of the lazy engineer. There is a meaningful difference between using AI deliberately and using it reflexively. This pipeline does the first.

Section 6 – Closing The Loop

The Full Picture

Let’s step back and look at what we actually built.

It started in Part 1 with a deprecated Microsoft service and a traffic jam on Marginal Pinheiros. It ended with a TypeScript library called karaul that finds permanent shifts in time series with mathematical precision. Twenty-five lines of code. Decades of mathematics.

Part 2 took that timestamp and turned it into something actionable. Five layers, each doing exactly one job, each handing off cleanly to the next. The sentinel catches the trespasser. The pipeline makes it talk.

But the pipeline is not the point. The pipeline is the implementation of something older.

The Methodology

Look at the full loop:

This is not a technology stack. It is a methodology. And its components did not appear in this article for the first time.

The TRIZ and Design Thinking series established the innovation framework – how to identify contradictions and resolve them systematically. The Samogon article argued that engineering intuition is the thing AI cannot replace – the ability to understand a system deeply enough to know which tool belongs where. Part 1 built the detection layer from first principles, deliberately, with a full understanding of the mathematics. Part 2 connected the signal to the human experience behind it.

Four articles. One loop. Each piece was written deliberately, building toward this. They fit together because they were always asking the same question from different angles: how do you understand what is actually happening in a complex system, and what do you do about it?

The DT Connection – No Research Sprint Required

The classic Design Thinking empathy phase asks you to go talk to your customers. Schedule interviews. Synthesize findings. It takes weeks.

The Glados Therapeutics support operation talks to its customers every day. Every call is a conversation. Every transcript is an empathy interview that happened whether anyone planned it or not. The question was never whether the data existed – it always did. The question was whether anyone was listening systematically.

karaul provides the trigger – something changed, pay attention now. The topic analytics provide the focus – payment issue, PortalPay, these specific calls. The transcripts provide the empathy – not synthesized from interviews but captured directly from the moment a doctor picked up the phone and said “my invoice is wrong and I don’t understand why.”

The research sprint didn’t need to be scheduled. It had already happened.

The TRIZ Connection – karaul Itself

There is a contradiction at the heart of changepoint detection: you want maximum sensitivity to real shifts and maximum tolerance to noise. These requirements conflict. Crank sensitivity up and you detect every minor fluctuation as a changepoint. Crank tolerance up and you miss the shifts that actually matter.

Naive approaches compromise – they pick a middle ground that is too sensitive for some signals and too tolerant for others. The compromise is the symptom that no real solution has been found.

PELT resolves this contradiction without compromise. The penalty term is not a sensitivity dial – it is a self-justification mechanism. A changepoint is added if and only if it reduces total cost by more than the penalty cost of adding it. Real shifts justify themselves mathematically. Noise does not. The algorithm doesn’t compromise between sensitivity and tolerance – it asks each candidate shift to prove its own worth.

This is TRIZ in pure form. The contradiction was apparent: sensitivity vs. tolerance, two engineering requirements in direct conflict. The resolution was not compromise but a different framing of the question – let the data itself decide which shifts are worth detecting. The penalty mechanism is the embodied resource that resolves the contradiction by changing what is being asked.

Every robust algorithm has a moment like this – the moment where a contradiction was resolved structurally rather than dialed away. PELT has its penalty. BIC has its log(n). The right design always reveals an underlying contradiction that someone, at some point, refused to compromise on.

The Validation Loop

karaul keeps watching. If the fix worked, a new changepoint will appear – call volume returning to baseline is itself a permanent shift in the opposite direction. If no return shift appears within a reasonable window, the intervention didn’t take, and the investigation continues. The methodology is not a one-time intervention but a continuous improvement cycle, with a mathematical signal at the entry and the exit.

What This Is Not

This is not an AI system. It is a system that uses AI in two specific places for specific reasons.

This is not a replacement for human judgment. Every hypothesis produced by the reasoning layer requires a human to validate, investigate, and act. The pipeline produces starting points, not conclusions.

This is not a solution to every problem. karaul detects mean shifts in univariate time series. It does not detect seasonal patterns, cyclic behavior, gradual trends, or multivariate correlations. The topic analytics are only as good as the canonical topic list, which requires ongoing maintenance. The causal reasoning is only as good as the context you provide – garbage in, confident hypothesis out.

Naming what a system does not do is as important as naming what it does. This is something the “just use AI for it” school of thought consistently skips.

The Broader Vision

karaul monitors call volume today. The architecture is indifferent to the signal.

Tomorrow it monitors treatment outcome rates – do patient complaints increase after a firmware update to NeuralCore? It monitors portal adoption – does DocVault engagement drop after a UI change? It monitors support handle times – does agent efficiency shift after a new training program? Any metric that can be represented as a time series can be watched by the same sentinel, with the same mathematical guarantee, feeding into the same reasoning pipeline.

The signal layer is generic. The domain knowledge – the canonical topics, the entity aliases, the Jira integration, the causal reasoning prompt – is specific. That separation is what makes the architecture extensible without making it generic to the point of uselessness.

Build the generic thing well. Build the specific thing carefully. Know which is which.

Closing

We started with a deprecated Microsoft service.

We ended with a complete operational methodology – detection, correlation, empathy, resolution, validation – built from first principles, documented honestly, open source, and running on a fictional medical device company whose marketing department requires samba proof of purchase under astronomically favorable conditions.

The sentinel is standing watch. When it catches the next trespasser, we will know exactly how to make it talk.

Aperture Science: we do what we must because we can. – GLaDOS, Portal