Developers debate large language model coding in complex production codebases

Hacker News users shared detailed experiences using large language models inside messy, established codebases, from fully agent-driven workflows to strict bans on generated code. The discussion highlights productivity gains, testing strategies, and persistent limits around context, integration testing, and trust.

The original poster describes a startup that has deeply integrated large language models into everyday development across a monorepo that includes scheduled Python data workflows, two Next.js apps, Temporal workers and a Node worker. Each engineer receives Cursor Pro with Bugbot, Gemini Pro, OpenAI Pro, and optionally Claude Pro, and the poster estimates that large language models are worth about 1.5 excellent junior/mid-level engineers per engineer, which they argue easily justifies paying for multiple models. Heavy use of pre-commit hooks, type checkers, tests and auto-formatting lets models focus on producing types and tests, while coding standards and conventions are encoded in .cursor/rules and AGENT.md-style files to steer agents away from raw SQL and toward specific schema files.

The team leans on GitHub Enterprise primarily for its Copilot issue assignment feature: their rule is that if you open an issue you must assign it to Copilot, which then opens a pull request, and roughly 25% of “open issue → Copilot PR” results are mergeable as-is and get to ~50% with a few comments. The poster says that overall, for roughly ?k/month, they are getting the equivalent of 1.5 additional junior/mid engineers per engineer, with these “large language model engineers” consistently writing tests, following standards, producing good commit messages and working 24/7. However, they also report pain points: Copilot’s model choice cannot be controlled for issues or reviews, agents in worktrees are fragile, and verifying changes often requires spinning up Temporal, two Next.js apps, several Python workers, a Node worker, and a browser, which makes integration testing slow and difficult to automate.

Other commenters report a wide spectrum of experience and caution. Some developers find large language models highly effective for boilerplate, unit and integration test generation, one-off scripts, and refactoring in smaller or well-structured areas, treating tools such as Claude Code, Copilot or Cursor as a junior pair programmer and insisting on small, incremental changes with plans and tests first. Several teams describe elaborate guardrails: dockerized dev containers without production credentials, CONTRIBUTING.md or Claude.md files encoding rules, custom linting and test pipelines, feature or roadmap markdown files that act as persistent memory, and staged, stacked pull requests with multiple automated review agents. Others emphasize that context window limits, legacy code complexity and long-range architectural concerns still defeat current models, arguing that they cannot replace a human mental model of a large, messy codebase and that they tend to duplicate code, miss subtle concurrency bugs or fail on giant legacy files. At the far end, one open source maintainer states that their project has banned all large language model generated code after repeated experiments produced plausible but fundamentally wrong suggestions, reflecting ongoing skepticism about relying on these tools in critical, long-lived systems.

55

Impact Score

Top data stories of 2025

Lexology Pro highlights the most significant data and privacy developments of 2025, from a controversial EU digital omnibus initiative to landmark rulings on Artificial Intelligence and data transfers.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.