Evaluating 2,800 Newsletter Articles in 2½ Minutes for 55¢
In late 2025, the release of GPT-5, Claude 4.5, and Gemini 3 class models allowed AI coding agents (e.g., Cursor, Claude Code) to move from just autocomplete and code generation to handling complex changes autonomously. I wanted to find articles by developers reporting on their own experiences using these tools. As a subscriber to the TLDR AI and TLDR Dev newsletters, I was aware such stories were being published regularly in recent months. Instead of reading through the archives to hunt for anecdotes, I built an automated tool to scrape the newsletters, evaluate the articles, and find the examples I was looking for.
The source code is available on GitHub.
Overview
The application evaluates the newsletter links through a two-stage process that lets you choose any AI models you want through OpenRouter (a platform that provides access to various LLM providers). For my specific case, I used Gemini 2.5 Flash Lite (a fast and cheap lightweight reasoning model) for the first stage, where a preliminary screening was run on just the newsletter summaries to help keep the cost of processing thousands of documents to a bare minimum. For the second stage, DeepSeek v3.2 evaluated the full text of the surviving articles against a strict set of criteria. This model works well because it provides higher reasoning capabilities comparable to GPT-5 class models at a fraction of the cost.
Criteria
To find high-quality articles and filter out marketing fluff, I set the following four criteria:
- The article’s primary purpose is sharing real experience or insight, not marketing a product, announcing a launch, or promoting a company’s tool or platform. Genuine personal reflections are acceptable even if they mention their company, but the article should not read as promotional content.
- It is a first-hand account from an individual person or small team reflecting on their genuine experience, not a corporate blog showcasing an internal tool or platform capability.
- The article describes one or more instances of the author using AI coding agents for substantial implementation of their own work. Meta-commentary or industry analysis about AI/LLMs does not satisfy this.
- The author provides specific quantitative terms or reasoned estimates describing their own productivity gains from AI coding tools (e.g., time savings, cost savings, before and after comparisons, percentage of code written by AI, ratio of active involvement to total time, scope of work accomplished in a stated timeframe, etc.). General industry statistics, adoption figures, or market data do not count. Vague or impressionistic multipliers (e.g., 10x) also do not qualify.
Process
With the models selected and the criteria established, the application processes the articles in two separate stages.
Summary Screening
The first stage is a screening process that runs only on the article title, its URL, and the summary provided in the newsletter. To avoid rejecting relevant articles, the instructions define the model’s role as a “generous initial screener” whose “only job is to filter out articles that are clearly unrelated to the criteria.” The model is required to output a simple pass or fail decision along with a brief reason, subject to the following constraints:
- Consider the title, summary, and source together as a whole when judging relevance.
- Ask only: ‘Could this article plausibly relate to the criteria?’ Check broad relevance only. Do not judge whether the summary satisfies the criteria.
- Summaries are brief and may omit what the criteria ask for. Focus only on whether the article’s topic could relate to the criteria. Do not reject for missing details, evidence, or different emphasis. The full article may contain it.
- Pass the article through if it is likely to be relevant to the criteria. Reject only when it is clearly unrelated. When in doubt, answer true.
- Accept all claims in the summary at face value. Never fact-check details against your own knowledge; your training data may be outdated.
As an aside, I found the last constraint to be necessary after noticing a specific failure in early testing where an article about Wordiest was rejected for this spurious reason: “The article is about a developer using an AI tool to port a game, which could align with the criteria. However, the summary mentions GPT-5.2 which is not a real model, suggesting this might be a fictional or satirical piece rather than a genuine personal experience.” 🤔
Full Article Evaluation
Articles that pass the initial screening move to the second stage of full article evaluation where the system fetches the complete text for processing. This stage is designed to minimize false positives (i.e., where documents are incorrectly identified as matching the criteria when they actually do not). The instructions define the model’s role as an “analytical article evaluator” tasked with strictly evaluating the document as follows:
- Each criterion must be evaluated independently. The document must satisfy all criteria to be considered a match. If even one criterion fails, the entire document fails.
- Evaluate the document strictly against each criterion. Base your judgment on what the text explicitly states. Do not assume, infer, or stretch definitions to make the document fit.
- Do not act as a defense attorney for the text. If you have to bend a rule or squint to make the text fit a criterion, it does not fit.
- Pay absolute attention to any explicit exclusions or negative constraints in the criteria. If a criterion specifies that something should not be included, or does not count, this is a hard boundary that cannot be overridden.
- Accept all factual claims at face value. Never question their veracity based on your own knowledge; your training data may be outdated. Evaluate only whether the text satisfies the criteria as written.
For the final output, the prompt requires the model to write out a step-by-step analysis evaluating the article against each criterion before giving its final pass or fail decision. Forcing this explicit chain-of-thought reasoning improves the model’s ability to follow the strict negative constraints.
Results
The system scraped and evaluated 2,843 articles from newsletter issues dated October 1, 2025, through February 26, 2026. The entire run took 2 minutes and 21 seconds, thanks in part to OpenRouter’s high rate limits allowing the processing of 200+ articles concurrently. Over 2,500 of these were rejected during the initial screening stage. Most of the remaining articles failed the full text evaluation, leaving only 40 examples that satisfied all criteria.
Below are some of the best representative anecdotes discussing significant time and cost savings from using AI coding agents:
Resurrecting Crimsonland: Banteg used AI agents to reverse-engineer and rewrite a video game from 2003. The project took just two weeks despite lacking access to the original source code. By contrast, when the original studio remastered the game in 2014, the process took their team a full year even with the source code.
Migrating 6000 React tests using AI Agents and ASTs: Elio Capella used an AI agent to update his company’s entire testing suite to a new version. The author migrated 970 files containing over 6,000 tests in a single week, which he estimated would have taken months to complete manually.
I ported JustHTML from Python to JavaScript with Codex CLI and GPT-5.2 in 4.5 hours: Simon Willison used an AI coding agent to translate a software library from Python to JavaScript, with the resulting code passing 9,200 automated tests. The entire process took about 4.5 hours with a $20/month ChatGPT subscription, as opposed to the original author spending a couple of months developing the Python version.
Coding Agents & Complexity Budgets: Lee Robinson decided to migrate Cursor’s entire website away from an expensive content management system and back to raw code. What he originally estimated would take weeks of work (and possibly require hiring an outside agency) was completed in three days and cost $260 in tokens.
Cost
Processing this volume of data required a very small budget. The screening stage with Gemini 2.5 Flash Lite used about 1.78M input tokens and 144k output tokens at $0.10 and $0.40 per million, respectively, resulting in a cost of $0.24. The full evaluation stage with DeepSeek v3.2 used about 1.03M input tokens and 71k output tokens at $0.25 and $0.40 per million, respectively, for another $0.29. In total the models cost about $0.52, and after OpenRouter’s 5.5% markup the run came to $0.55.
It’s worth noting that OpenRouter offers a collection of free models that can be used to process the data at no cost; however, the performance is typically worse than the paid models and they are subject to relatively low rate limits. That being said, I did find the recently released Arcee AI: Trinity Large Preview (free) open-weight model to be surprisingly good for this kind of task, albeit more prone to yielding false positives.
Final Thoughts
I built this tool specifically to find AI coding stories, but it works for finding any information you need from any of TLDR’s various tech-related newsletters just by changing the target newsletters and criteria. With some modifications to the web scraping code, the system could also be adapted to scan any other large public archive.
The system still has limits that require human oversight. Cheaper models can occasionally miss strict rules, and because AI outputs vary, the results are not always perfectly reproducible. The tool functions best as a sampling mechanism to narrow down massive datasets into a focused short list for personal review.