Why Rules-Based NLP Fails for Medical Coding

Rules-based natural language processing cannot reliably perform modern medical coding. These systems — built on pattern matching, regular expressions, keyword extraction, and dictionary lookups — were designed for a simpler era of code assignment. They plateau at 70–80% accuracy on complex HCC coding tasks, cannot reason about clinical context, and break whenever CMS updates its models. Agentic AI, which uses multiple specialized agents that reason about clinical documentation the way human coders do, is the architecture that actually solves the problem. Here is why the distinction matters and what organizations should understand before investing in coding automation.

What Is Rules-Based NLP?

Rules-based NLP refers to natural language processing systems that rely on manually authored rules to extract information from text. In the context of medical coding, these rules typically take several forms:

Keyword matching — The system scans clinical text for specific words or phrases (e.g., "diabetes," "heart failure," "COPD") and maps them directly to ICD-10-CM codes.
Regular expressions — Pattern-matching logic searches for structured strings like medication dosages, lab values, or diagnostic phrases that follow predictable formats.
Dictionary lookups — A curated terminology database (often based on SNOMED CT, UMLS, or proprietary vocabularies) maps clinical terms to standardized codes.
Decision trees — If/then logic chains attempt to classify conditions based on the presence or absence of specific keywords in a document.
Template matching — The system looks for documentation that conforms to expected structural patterns, such as assessment and plan sections in progress notes.

These components are often layered together into what vendors market as "NLP engines" or "clinical NLP pipelines." The underlying architecture, however, remains fundamentally the same: the system does not understand what it reads. It matches patterns against a predefined rule set and produces outputs based on those matches.

This distinction — between pattern matching and genuine clinical reasoning — is the root cause of every failure mode described in this article.

Why Rules-Based NLP Worked Initially

It is worth acknowledging that rules-based NLP was not always inadequate. When these systems were first deployed in healthcare in the early 2010s, the coding landscape was considerably simpler:

ICD-9 had roughly 14,000 diagnosis codes. The mapping between clinical terms and codes was relatively direct, and many conditions could be captured with a single keyword match.
Risk adjustment models were less granular. Earlier CMS-HCC models did not require the severity tiering and specificity distinctions that V28 demands.
Clinical documentation was more structured. Many early NLP deployments targeted radiology reports, pathology results, and other highly formatted document types where template matching was effective.
Code sets were relatively stable. CMS updates were incremental, and the mapping tables between diagnoses and HCC categories did not change dramatically from year to year.

In this environment, a well-tuned rules-based system could achieve acceptable accuracy for straightforward code assignment tasks — identifying that a patient had diabetes, flagging a mention of COPD, or extracting a cancer diagnosis from a pathology report.

The problem is that medical coding did not stay simple. The transition from ICD-9 to ICD-10 expanded the code set to over 72,000 diagnosis codes. CMS-HCC models introduced severity hierarchies and narrowed the set of accepted codes. Documentation styles diversified as electronic health records proliferated. And the financial stakes of accurate risk adjustment coding escalated to the point where a 5% accuracy gap can translate to millions of dollars in revenue variance for a mid-sized Medicare Advantage plan.

Rules-based NLP was built for a world that no longer exists.

Where Rules-Based NLP Fails at HCC Coding

The failures of rules-based NLP in HCC coding are not edge cases. They are systematic limitations that affect the majority of complex coding scenarios. The following failure modes are well-documented across the industry.

Failure 1: Clinical Context and Negation Handling

One of the most fundamental limitations of keyword-based systems is their inability to distinguish between the presence and absence of a condition. Consider these three sentences from a clinical note:

"Patient has a history of congestive heart failure."
"Patient denies symptoms of congestive heart failure."
"Patient was evaluated for congestive heart failure, which was ruled out."

A rules-based system scanning for "congestive heart failure" will flag all three as positive mentions. Only the first represents an active, codable diagnosis. The second is a negation. The third is a rule-out — a condition that was considered but determined not to be present.

Sophisticated rules-based systems attempt to handle negation by looking for modifier words like "no," "denies," "ruled out," or "negative for" within a defined window around the target term. This approach, known as NegEx or its variants, catches simple negations but fails on complex sentence structures:

"There is no evidence that the patient does not have heart failure." (double negation — the patient has heart failure)
"Heart failure was considered in the differential but the echocardiogram findings are more consistent with valvular disease." (implicit negation through clinical reasoning)
"The patient's mother had congestive heart failure." (family history, not the patient's condition)

Each of these requires understanding sentence semantics, not just keyword proximity. Rules-based systems have no mechanism for this level of comprehension.

Failure 2: Severity Tiering Under V28

The CMS-HCC V28 model introduced extensive severity tiering that makes keyword matching fundamentally insufficient. The most illustrative example is heart failure coding.

Under V24, heart failure mapped to a single HCC category regardless of type. Under V28, coders must distinguish between:

Heart Failure Type	ICD-10-CM Codes	V28 HCC	RAF Weight Difference
HFrEF (reduced ejection fraction)	I50.2x	Higher-weighted HCC	Highest payment
HFpEF (preserved ejection fraction)	I50.3x	Mid-weighted HCC	Moderate payment
Unspecified heart failure	I50.9	Lower-weighted HCC	Lowest payment

A rules-based system that matches "heart failure" cannot determine whether the documentation supports HFrEF, HFpEF, or unspecified classification. That determination requires reading the echocardiogram results, understanding ejection fraction values, and connecting them to the diagnosis — a reasoning task, not a pattern-matching task.

This pattern repeats across V28 categories. Dementia must be classified as with or without behavioral disturbance. Chronic kidney disease must be staged. Diabetes must be specified with its complications. In each case, the code assignment depends on clinical details scattered throughout the documentation, not on the presence of a single keyword.

Failure 3: MEAT Criteria Evidence Extraction

Every HCC code submitted for risk adjustment must be supported by clinical evidence that the provider Monitored, Evaluated, Assessed/Addressed, or Treated (MEAT) the condition during the encounter. This is not optional — it is the standard enforced through CMS RADV audits, and unsupported codes are subject to recoupment.

MEAT evidence is typically embedded in unstructured clinical narrative. A provider might document monitoring through a lab result reference, evaluation through a physical exam finding, assessment through a clinical impression statement, or treatment through a medication adjustment. The evidence can appear anywhere in the note — in the history of present illness, the review of systems, the assessment and plan, or even in a separate addendum.

Rules-based systems cannot reliably extract MEAT evidence because:

Evidence is expressed in natural language, not structured data. A provider might write "we will continue current statin therapy" as evidence of treatment, or "A1c trending down from 8.2 to 7.4" as evidence of monitoring. These are linguistically diverse expressions that cannot be captured by a finite rule set.
Evidence must be linked to specific conditions. It is not enough to find a mention of metformin in the medication list. The system must connect that medication to a specific diagnosis (diabetes) and determine that it constitutes treatment evidence for that condition.
Evidence adequacy is contextual. A passing mention of a condition in the problem list without any supporting narrative does not meet MEAT criteria, even though the condition is documented. Rules-based systems cannot distinguish between a documented condition and a clinically addressed condition.

Organizations that rely on rules-based NLP for risk adjustment coding consistently find that their MEAT compliance rates lag behind manual coding, precisely because the system captures codes without verifying that evidence exists to support them.

Failure 4: Hierarchy Reasoning

The "H" in HCC stands for "hierarchical," and the hierarchy logic is central to correct coding. When a patient has multiple related conditions at different severity levels, only the highest-severity HCC in each hierarchy should be submitted. Lower-severity HCCs within the same hierarchy are superseded and should not generate a separate RAF score.

For example, if a patient has both diabetes with chronic kidney disease (a higher-severity HCC) and diabetes without complications (a lower-severity HCC), only the higher-severity code should be submitted for risk adjustment. Submitting both would be incorrect and could trigger compliance flags during audit.

Rules-based systems handle hierarchies through lookup tables: after codes are assigned, a post-processing step checks each code against a hierarchy matrix and removes lower-ranked entries. This works for straightforward cases but fails when:

Multiple hierarchies interact. A single patient may have conditions that span several overlapping hierarchies, and the correct resolution depends on which combination of codes produces the most clinically accurate representation — not just the highest-paying one.
Hierarchy rules change between model versions. V28 restructured several hierarchies and introduced new ones. A rules-based system's hierarchy tables must be manually updated whenever CMS publishes model changes, and any delay or error in updating creates systematic miscoding.
Clinical nuance affects hierarchy placement. Some conditions may appear to fall within a hierarchy based on keyword matching but are actually clinically distinct. A rules-based system cannot make this determination.

Failure 5: Multi-Condition Interactions and Comorbidity Patterns

Real patients rarely have a single, isolated diagnosis. The average Medicare Advantage beneficiary has 4–6 chronic conditions, and many have significantly more. Accurate HCC coding requires understanding how these conditions interact:

Diabetes with complications — A patient with diabetes, peripheral neuropathy, and chronic kidney disease may have multiple interacting codes, and the correct assignment depends on whether the complications are documented as causally related to the diabetes.
Manifestation codes — Some ICD-10-CM codes require dual coding with an underlying etiology code. Rules-based systems frequently miss these pairings because they evaluate each condition independently.
Comorbidity patterns — Certain condition combinations are clinically expected (e.g., heart failure and atrial fibrillation), while others are unusual and may indicate a documentation error. Rules-based systems have no mechanism for evaluating clinical plausibility.

These multi-condition scenarios are the norm in risk adjustment populations, not the exception. A system that evaluates each keyword in isolation will systematically mishandle the interactions that determine correct code assignment.

Failure 6: Documentation Style Variation

Every provider writes differently. A cardiologist and a primary care physician will document the same heart failure diagnosis in fundamentally different ways. Hospitalists write terse, abbreviation-heavy notes. Academic physicians write lengthy, literature-referencing assessments. Nurse practitioners and physician assistants have their own documentation patterns.

Rules-based systems are brittle in the face of this variation because:

Abbreviations and shorthand are inconsistent. "CHF," "HF," "congestive heart failure," "systolic dysfunction," and "heart failure with reduced EF" can all refer to the same condition. A rules-based system must enumerate every variant — and any variant not included in the rule set will be missed.
Note structures vary across EHR systems. Epic, Cerner, Athenahealth, and other platforms generate notes with different section headers, formatting conventions, and data layouts. Template-matching rules built for one EHR often fail when applied to notes from another.
Copy-forward and cloned notes introduce historical information that may not reflect the current encounter. Rules-based systems cannot distinguish between freshly documented clinical findings and text carried forward from a prior visit.

The Maintenance Nightmare

Even if a rules-based system achieves acceptable accuracy at the time of deployment, maintaining that accuracy is an ongoing operational burden. The rules must be updated whenever:

CMS updates the HCC model. The V24-to-V28 transition required wholesale changes to code mapping tables, hierarchy logic, and severity classifications. Any rules-based system required months of manual rule rewriting to accommodate the new model.
ICD-10-CM codes are added or revised. CMS publishes annual ICD-10-CM updates effective October 1 each year, adding new codes, revising existing codes, and retiring obsolete ones. Each change must be reflected in the rule set.
Documentation standards evolve. As clinical documentation improvement programs mature, providers change how they document conditions. Rules tuned to historical documentation patterns degrade as provider behavior shifts.
Payer-specific requirements diverge. Different Medicare Advantage plans and payers may have varying documentation and coding requirements. Maintaining separate rule sets for different payer contexts multiplies the maintenance burden.

Industry reports consistently estimate that maintaining a rules-based NLP system for medical coding requires 2–4 full-time equivalent staff dedicated solely to rule authoring, testing, and updating. This ongoing cost is rarely included in vendor pricing discussions but represents a significant portion of the total cost of ownership.

The result is a system that is always playing catch-up — accurate only until the next regulatory change, and then degraded until the rules are manually updated.

The Accuracy Ceiling

The cumulative effect of these failure modes is a hard accuracy ceiling. Industry benchmarks and peer-reviewed literature on NLP-based medical coding consistently report that rules-based systems achieve:

85–95% accuracy on simple, single-code extraction tasks (e.g., identifying a primary diagnosis from a structured radiology report)
70–80% accuracy on complex HCC coding tasks involving severity tiering, MEAT evidence extraction, and multi-condition reasoning
60–70% accuracy on full risk adjustment workflows that require hierarchy resolution, dual-model processing, and audit-ready evidence documentation

These accuracy ranges have remained largely unchanged over the past decade despite significant investment in rule refinement. The ceiling exists not because the rules are poorly written but because the underlying architecture is incapable of the reasoning required for higher accuracy.

For risk adjustment organizations, the gap between 75% and 93% accuracy is not a minor performance difference. Applied across a population of 50,000 Medicare Advantage members, a 15–18 percentage point accuracy gap can represent $10 million to $25 million in annual RAF score variance — revenue that is either missed through undercoding or exposed to recoupment through overcoding.

What Agentic AI Does Differently

Agentic AI represents a fundamentally different architecture for medical coding. Instead of a single rules engine processing text against a static rule set, agentic systems deploy multiple specialized AI agents, each responsible for a distinct aspect of the coding workflow. These agents reason about clinical documentation, communicate with each other, and produce validated outputs with full evidence trails.

ANICA, Jivica's AI medical coding engine, implements this architecture with 9 specialized AI agents and 24 MCP (Model Context Protocol) tools. The key architectural differences from rules-based NLP include:

Contextual Clinical Reasoning

Instead of matching keywords, ANICA's agents read and interpret clinical narratives in context. The system understands that "no evidence of heart failure" is a negation, that "heart failure with EF of 35%" indicates HFrEF, and that "mother had CHF" is a family history reference. This contextual understanding is not programmed through rules — it emerges from the agents' ability to reason about language semantics.

Multi-Agent Specialization

Different agents handle different aspects of the coding task:

Clinical NLP agents parse and interpret documentation
Code assignment agents map clinical findings to ICD-10-CM, HCC, and E/M codes
Evidence extraction agents identify and link MEAT criteria evidence to each condition
Validation agents check hierarchy logic, code specificity, and clinical plausibility
Audit readiness agents score each code assignment for RADV defensibility

This specialization allows each agent to focus on what it does best, rather than forcing a single rules engine to handle every aspect of the coding workflow.

Self-Validation and Error Correction

Agentic systems include validation agents that review the work of other agents before producing final output. If a code assignment agent selects a severity-tiered HCC, the validation agent verifies that the documentation actually supports that severity level. If the evidence extraction agent identifies MEAT criteria for a condition, the validation agent confirms that the evidence is linked to the correct diagnosis.

This self-checking architecture is impossible in a rules-based system, where the same rules that produce the initial output are the only mechanism for validating it.

Adaptive Learning Without Rule Rewriting

When CMS updates models, payers change requirements, or documentation patterns shift, agentic AI systems adapt without requiring manual rule rewriting. The agents' reasoning capabilities allow them to process new code sets, updated hierarchies, and evolving documentation standards based on their understanding of clinical coding principles — not on static lookup tables.

Rules-Based NLP vs. Agentic AI: Comparison

Capability	Rules-Based NLP	Agentic AI (ANICA)
Negation handling	Keyword proximity (NegEx) — misses complex negations	Contextual semantic understanding
Severity tiering (V28)	Cannot distinguish HFrEF vs. HFpEF without explicit rules per case	Reasons from clinical data (EF values, documentation context)
MEAT evidence extraction	Limited to predefined phrase matching	Extracts and links evidence from unstructured narrative
Hierarchy resolution	Static lookup tables — requires manual updates	Dynamic reasoning across interacting hierarchies
Multi-condition reasoning	Evaluates conditions independently	Understands comorbidity patterns and causal relationships
Documentation style tolerance	Brittle — fails on non-standard phrasing	Handles varied provider styles, abbreviations, and EHR formats
V24/V28 dual-model processing	Requires separate rule sets per model	Native dual-model support with unified processing
Maintenance burden	2–4 FTE for ongoing rule updates	Model updates without manual rule rewriting
Accuracy on complex HCC coding	70–80% ceiling	92.6% across ICD-10, HCC, and E/M
Audit readiness	Codes without evidence trails	Full evidence trail for every code assignment
Time per chart	Varies — often requires manual review	5–15 seconds with automated validation

Frequently Asked Questions

Can rules-based NLP be improved to match agentic AI accuracy?

No — not without fundamentally changing the architecture, at which point it would no longer be rules-based. The accuracy ceiling of rules-based systems is a structural limitation, not a tuning problem. You can add more rules, refine existing ones, and expand dictionaries, but the system still cannot reason about clinical context, handle novel documentation patterns, or self-validate its outputs. The gap between pattern matching and clinical reasoning cannot be closed by writing more patterns.

Is agentic AI more expensive than rules-based NLP?

The licensing cost for an agentic AI platform may be higher than a basic rules-based NLP tool, but the total cost of ownership is typically lower. Rules-based systems require 2–4 FTE for ongoing maintenance, generate higher error rates that require manual rework, and miss revenue through undercoding. When these operational costs are factored in, agentic AI consistently delivers stronger ROI. Organizations processing more than 5,000 charts per month typically see positive ROI within the first quarter of deployment.

How does agentic AI handle CMS model changes like the V24-to-V28 transition?

Agentic AI systems process model changes through updated configuration rather than rule rewriting. Because the agents reason about clinical coding principles — not just static mappings — they can adapt to new code sets, updated hierarchies, and revised severity tiers without the months-long re-engineering cycle that rules-based systems require. ANICA supported dual V24/V28 processing from the start of the transition period, allowing organizations to model the financial impact of V28 before full implementation.

Do we need to replace our existing NLP system entirely?

Not necessarily. Some organizations deploy agentic AI alongside existing systems during a transition period, using the agentic platform to validate and enhance the outputs of their rules-based tools. However, most organizations that evaluate both approaches side-by-side conclude that running parallel systems adds complexity without proportional benefit, and they transition fully to the agentic platform within 6–12 months.

Conclusion

Rules-based NLP was a reasonable first attempt at automating medical coding, and it served the industry adequately when coding was simpler, models were less granular, and documentation was more structured. That era is over. The CMS-HCC V28 model, with its severity-tiered hierarchies, reduced code acceptance, and heightened specificity requirements, demands a coding automation architecture that can reason about clinical context — not just match patterns against a dictionary.

Organizations still relying on rules-based NLP for HCC coding face a compounding problem: accuracy gaps that translate to revenue loss, compliance risks that increase with every RADV audit cycle, and maintenance costs that grow with every CMS update. The longer the transition to a capable architecture is delayed, the wider these gaps become.

ANICA was purpose-built for this complexity — with 9 specialized AI agents, 24 MCP tools, dual V24/V28 model processing, and full evidence trails for every code assignment. It achieves 92.6% accuracy across ICD-10, HCC, and E/M coding categories, and it processes charts in 5–15 seconds with automated RADV audit readiness scoring.

Schedule a demo to see how agentic AI handles the coding scenarios where rules-based NLP fails.

References: CMS 2024 Rate Announcement and Final Call Letter, CMS Risk Adjustment Data Validation (RADV), Journal of AHIMA — NLP in Clinical Coding, AAPC Risk Adjustment Coding Resources, Savova, G.K., et al. "Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES)," Journal of the American Medical Informatics Association, 2010, Chapman, W.W., et al. "A Simple Algorithm for Identifying Negated Findings and Diseases in Discharge Summaries," Journal of Biomedical Informatics, 2001.

Why Rules-Based NLP Fails for Medical Coding

Why Rules-Based NLP Fails for Medical Coding

What Is Rules-Based NLP?

Why Rules-Based NLP Worked Initially

Where Rules-Based NLP Fails at HCC Coding

Failure 1: Clinical Context and Negation Handling

Failure 2: Severity Tiering Under V28

Failure 3: MEAT Criteria Evidence Extraction

Failure 4: Hierarchy Reasoning

Failure 5: Multi-Condition Interactions and Comorbidity Patterns

Failure 6: Documentation Style Variation

The Maintenance Nightmare

The Accuracy Ceiling

What Agentic AI Does Differently

Contextual Clinical Reasoning

Multi-Agent Specialization

Self-Validation and Error Correction

Adaptive Learning Without Rule Rewriting

Rules-Based NLP vs. Agentic AI: Comparison

Frequently Asked Questions

Can rules-based NLP be improved to match agentic AI accuracy?

Is agentic AI more expensive than rules-based NLP?

How does agentic AI handle CMS model changes like the V24-to-V28 transition?

Do we need to replace our existing NLP system entirely?

Conclusion

Related Articles

Launching DelPHI Beta: The Privacy Gateway for Safer AI in Healthcare

MEAT Criteria in Medical Coding: The Complete Guide

What Agentic AI Actually Means for Healthcare