The paper’s practical point is that enterprise AI should stop asking language models to be the whole system. Use models to translate messy inputs into structured forms, then let explicit knowledge stores, rules, workflows, and verifiers do the dependable work.

Source note: Kuldeep Singh, Anson Bastos, and Isaiah Onando Mulang’. “Position: Avoid Overstretching LLMs for every Enterprise Task.” arXiv:2605.09365, submitted May 10, 2026. https://arxiv.org/abs/2605.09365

Why This Paper Matters

A lot of enterprise AI is being built as if the model should sit at the center of every workflow.

The model reads the request. The model retrieves context. The model decides which tool to call. The model interprets the result. The model applies business rules. The model produces the answer. When the system is too expensive or slow, the next move is often to distill the large model into a smaller model and hope the behavior survives.

This paper argues that this default is structurally wrong for many enterprise tasks.

The reason is not that language models are useless in companies. The paper’s argument is sharper: enterprise work is often deterministic, rule-bound, knowledge-dependent, repetitive, auditable, and cost-sensitive. Those are exactly the properties that make it risky to put a probabilistic model in charge of the whole runtime.

The paper’s position is that enterprise AI should invert the control structure. The language model should become an interface layer. The durable parts of the work should live in systems that can be inspected, tested, governed, updated, and replayed.

That matters because the failure mode is familiar. Companies get impressive pilots, then struggle to turn them into reliable production systems. The paper says part of that gap comes from treating model capability as the main bottleneck when the real issue is system architecture.

The Idea in Plain English

The paper separates three jobs that are often collapsed into one model call.

The first job is language interface work: reading a document, extracting fields, classifying a request, mapping a user sentence to a schema, or routing a ticket to a known workflow.

The second job is knowledge storage: keeping policies, customer data, product rules, contract terms, compliance requirements, operational history, and domain facts current.

The third job is computation: applying rules, checking constraints, querying systems, running deterministic procedures, verifying outputs, and deciding whether a bounded action is allowed.

The paper argues that models are best used for the first job. They are much weaker as the long-term home for the second job and the dependable executor of the third.

In the proposed architecture, small specialized models handle narrow interface tasks such as extraction, routing, tool aggregation, state updates, and verification. A controller connects those models to external knowledge graphs, relational data, APIs, rules, and symbolic procedures. Frontier LLMs may still be useful, but mostly offline: they can help synthesize schemas, draft rules, create ontologies, or generate test cases before the runtime path is stabilized.

The online system is then cheaper and more governable. The model is still present, but it is no longer pretending to be the database, the workflow engine, the policy engine, and the auditor.

What the Researchers Tested

This is a position paper, not a benchmark paper. Its main contribution is a formal and architectural argument for modular enterprise AI.

The paper builds a theoretical case around enterprise tasks whose answers depend on information outside the user input. In a company, that external information might be a policy, a customer record, a pricing rule, a runbook, a product catalog, a graph of business entities, or a regulatory constraint.

The authors compare two broad designs.

One design is a monolithic model or a distilled smaller model that tries to internalize the task in its parameters. It may use retrieval and tools, but the model remains the primary policy and reasoning engine.

The other design is a hybrid system. A small language model maps unstructured input into a structured representation. External knowledge sources provide the relevant facts. A symbolic or deterministic procedure applies the rules and computation.

The paper then uses information-theoretic and computational arguments to show why the second design can be strictly better for certain classes of enterprise work. If the answer depends on external knowledge that is not present in the input, a model that lacks that knowledge faces an irreducible error floor. If the task requires a repeatable algorithmic procedure, a symbolic reasoner can execute the procedure directly while the small model only has to extract the right structured inputs.

The paper also spends substantial space on distillation. Its key claim is that response-only distillation loses information about the teacher model’s internal reasoning. A smaller student may learn typical outputs without inheriting the algorithmic structure needed for out-of-distribution or multi-step enterprise work.

What They Found

External knowledge changes the reliability problem

The paper’s first major point is that many enterprise answers are not functions of the prompt alone.

If a support ticket depends on the current contract, the current product configuration, the customer’s region, and a policy that changed last week, the correct answer is not sitting inside the text of the ticket. It depends on external knowledge.

A model-only system can memorize some facts, infer some patterns, and use retrieval as a patch. But the paper argues that enterprise memory should not live primarily in model parameters. It should live in explicit systems: knowledge bases, databases, policy repositories, document indices, and business object graphs.

That shift changes the model’s role. Instead of asking the model to “know the business,” the system asks it to identify which structured thing is being discussed and which workflow should handle it.

Deterministic work should not be hidden inside probabilities

The second major point is about computation.

Many enterprise tasks are not creative reasoning problems. They are procedures. Parse the invoice. Match it to a purchase order. Check policy constraints. Route the exception. Validate the form. Apply a pricing rule. Follow the incident runbook. Determine whether a proposed action is allowed.

The paper argues that these procedures should be externalized into formal methods, symbolic rules, deterministic algorithms, constrained solvers, verifiers, or workflow engines. That makes the behavior easier to test and govern.

This does not eliminate the need for models. It narrows the model’s job. The model can extract the invoice fields or classify the incident. The rules engine or workflow then decides what happens next.

Distillation does not solve the architecture problem

The paper is skeptical of the idea that the main answer is to distill large model behavior into cheaper small models.

Distillation is useful when the task is pattern-like, low entropy, and bounded. The paper does not reject it outright. The objection is to using response-only distillation as a substitute for explicit knowledge and computation.

If a teacher model solves a task through intermediate reasoning, but the student only sees final answers, the student receives a compressed signal. It can match surface behavior while missing the underlying procedure. The paper calls this a projection bottleneck.

For enterprise workflows, that matters because brittle shortcuts can look fine on common cases and fail exactly where governance, edge cases, and auditability matter most.

Specialized small models fit decomposable workflows

The paper’s most practical design idea is specialization.

Instead of one large model acting as the universal controller, the architecture uses smaller models for bounded roles. One model extracts entities. Another routes requests. Another aggregates tool results. Another checks state. Another verifies whether an output matches a schema or policy.

Each model has a narrower action space and a clearer evaluation target. The rest of the workflow state lives outside those models, in tools and memory. The paper argues that this is more information-efficient than forcing a single shared model to internalize every stage.

For builders, this is the useful version of “small models can work.” Small models work better when the system reduces the job to something small enough to be evaluated.

Why It Happens

The paper’s underlying claim is that enterprise tasks are not shaped like open-ended chat.

Enterprise work is full of explicit objects: customers, contracts, SKUs, invoices, employees, incidents, tickets, permissions, assets, policies, service levels, controls, and workflows. Those objects change over time. Their relationships matter. The rules attached to them matter. The history of actions taken against them matters.

A language model can talk about those things. It should not be the only place those things live.

This is why the software analogy in the paper is useful. Modern systems separated concerns because monoliths became hard to maintain, scale, and govern. The authors argue that enterprise AI needs a similar separation. Language understanding, knowledge storage, computation, verification, and action control should not all be buried in the same parameter vector.

The same logic applies to agents. A frontier LLM can be an impressive planner, but if it remains the live orchestrator for every repeated enterprise task, the system pays repeatedly in cost, latency, context growth, and fuzzy failure boundaries. For stable work, the architecture should move authority into explicit components.

What This Means for Builders

Builders should start by classifying which parts of a workflow actually require model judgment.

If the task is extraction, classification, schema matching, normalization, or routing, a specialized model may be enough. If the task is policy enforcement, state transition, numerical calculation, permission checking, or workflow execution, it probably belongs in an explicit system.

A useful implementation pattern is:

  1. Use a strong model offline to help design the schema, rule set, ontology, workflow, or evaluator.
  2. Validate those artifacts with humans and tests.
  3. Deploy smaller models for bounded extraction and routing.
  4. Keep knowledge in external stores that can be updated without retraining.
  5. Run deterministic computation and verification outside the model.
  6. Escalate to humans or larger models at high-risk or ambiguous boundaries.

This is less magical than an all-purpose agent, but it is closer to how production systems survive.

The engineering challenge is integration. Modular systems are not free. They require schemas, ownership, test data, API contracts, observability, permissions, versioning, and review paths. The paper’s point is that this cost is not accidental overhead. It is the price of turning AI from a pilot into an operating system component.

What This Means for Buyers and Operators

For buyers, the paper gives a better procurement question than “which model do you use?”

Ask where the model sits in the control path. Ask what knowledge lives outside the model. Ask whether policies and rules can be inspected. Ask how the system handles drift when business data changes. Ask whether a failed answer can be traced to extraction, retrieval, rule execution, or model judgment.

If the vendor’s answer is that the model handles everything, that may be fine for low-risk assistance. It is weaker for repeatable operational work.

Operators should also look for component-level evaluation. A monolithic agent can fail in ways that are hard to diagnose. A modular system can test extraction quality, retrieval coverage, rule correctness, verifier behavior, and end-to-end outcomes separately.

This matters for governance. Enterprise AI does not only need better answers. It needs controllable failure surfaces. A system that can say “the extractor failed to identify the customer entity” or “the policy rule blocked the action” is easier to operate than one that returns a fluent but opaque answer.

What to Watch Next

The first thing to watch is whether enterprise AI teams move from model-centered architectures to workflow-centered architectures.

The second thing to watch is the role of frontier models as offline design assistants. The paper’s architecture still uses large models, but it moves them away from the high-volume runtime path. They help create schemas, rules, and canonical examples, then smaller and more explicit systems run the day-to-day workflow.

The third thing to watch is evaluation. Specialized SLMs are only useful if their narrow jobs can be measured. Extraction, routing, state updates, and verification need their own test sets and failure budgets.

Finally, watch the boundary cases. Some enterprise work really does require open-ended synthesis, negotiation, strategy, or judgment. The paper’s architecture is strongest where the workflow is repeatable and the rules are explicit. The field still needs good patterns for moving between deterministic workflow execution and judgment-heavy model use.

Limitations and Caveats

This is a position paper. It provides formal arguments and an architectural roadmap, but it does not present a broad empirical comparison across enterprise deployments.

The formal claims also depend on assumptions. The hybrid system needs a good extractor, reliable external knowledge, and a sound reasoner or workflow. If those components are weak, the modular architecture can still fail.

The paper also risks making the boundary sound cleaner than it is in practice. Many enterprise tasks mix extraction, judgment, policy, ambiguity, and exception handling. Builders still have to decide which parts deserve deterministic control and which parts deserve model flexibility.

There is also organizational cost. Externalizing knowledge and computation means someone has to maintain the knowledge base, rules, schemas, workflows, and evaluation harness. That work is often exactly what companies underinvest in.

The broader lesson still holds: do not solve a systems problem by hiding more of the system inside a model.

Source

Singh, Kuldeep, Bastos, Anson, and Mulang’, Isaiah Onando. (2026). Position: Avoid Overstretching LLMs for every Enterprise Task. arXiv preprint arXiv:2605.09365. Available at: https://arxiv.org/abs/2605.09365