Critics in the Machine
How a peer-review system brings objectivity, consistency, and actionable feedback to evaluating design quality at scale.
In today’s business environment, visual communication is no longer a “nice-to-have”; it’s a competitive necessity. Whether it’s a digital ad, an app store screenshot, a streaming thumbnail, or a presentation deck, the quality of design directly influences customer engagement, conversion rates, and ultimately revenue.
But here’s the challenge: evaluating design quality has always been subjective. One marketer may say a banner “pops,” while another complains it looks cluttered. A creative director may insist that spacing is “off,” while a junior designer can’t quite put their finger on what’s wrong. Companies often default to either gut instinct or simple checklists: does the logo appear, are the brand colors present, is the font readable? While useful, these heuristics don’t capture the multi-dimensional nature of good design.
Good design isn’t just about checking boxes. It’s about balance: alignment, typography, spacing, color harmony, hierarchy of information, and visual storytelling all working together. A single misstep in one of these areas (say, poor spacing or a confusing hierarchy) can reduce clarity, erode brand trust, or tank performance metrics like click-through rate (CTR) or conversions.
Generative design tools have made this problem even more pressing. Platforms can now churn out dozens, even hundreds, of design variations in seconds. That’s a huge opportunity, but also a governance nightmare: how do you maintain quality, brand consistency, and clarity at scale when human review is slow, subjective, and expensive? The problem the researchers from Adobe addresses is clear: how to evaluate design quality objectively, multi-dimensionally, and at scale—while also producing clear, actionable feedback.
The recently published research proposes a surprisingly elegant solution: treat design review like a conference-style peer review process, but with AI agents as the reviewers. Instead of asking one all-purpose model to judge a design, the system recruits multiple specialized reviewers, each focused on a different aspect of design.
At the center is a meta-agent (think of it as the editor-in-chief of a journal or the engagement manager on a consulting team). The meta-agent decides which expert reviewers are needed, recruits them, and then consolidates their evaluations into a single, coherent review.
There are two types of reviewers:
- Static agents: Always called upon, these check fundamentals like alignment, spacing, and overlapping elements.
- Dynamic agents: Brought in selectively depending on the specific design, they tackle higher-level questions like style coherence, semantic grouping, or readability.
But there’s a twist that makes these reviewers far more effective: the researchers found that simply giving a model an image isn’t enough. To truly evaluate designs, reviewers need design-aware context. This comes in two forms:
- GRAD (GRAph-based Design exemplar selection): The system translates a design into a graph, where nodes represent elements (like text or images) and edges capture relationships (like spatial distance or semantic similarity). By comparing this graph against a library of past examples using advanced mathematical measures, the system retrieves similar reference designs. Reviewers then get to “see” how comparable, high-quality examples were structured, just as a consultant benchmarks against industry leaders.
- SDD (Structured Design Description): Alongside visual input, the reviewers are also provided with a compact, text-based description of the design’s layout. Think of this as an executive summary of the design—highlighting what elements exist and where they sit. By grounding the review in both structured text and visual exemplars, the system avoids hallucinations and ensures reviewers focus on the right details.
Finally, the meta-agent consolidates the reviewers’ feedback into a unified report—offering both quantitative evaluations and qualitative, actionable insights. Instead of vague statements like “this doesn’t look good,” the system can pinpoint issues (“the spacing between the headline and call-to-action reduces readability”) and suggest fixes.
A bold framework is only as good as its real-world performance. To test whether their agentic review system could meaningfully evaluate design quality, the researchers ran a series of structured experiments across multiple datasets. Each dataset represented a different type of design challenge—ranging from advertising layouts to infographics to structured design documents. In essence, they built a “test bed” that reflected the diversity of visual communication businesses rely on today.
What makes this evaluation compelling is the range of design flaws that were tested. Instead of focusing narrowly on one issue, like typography, the datasets covered a broad set of attributes: alignment, overlap, whitespace, grouping of elements, style coherence, and more. This diversity allowed the researchers to see whether the system could flex across multiple categories, much like a seasoned brand manager who can spot inconsistencies in both a billboard and a social post.
In each experiment, the agentic system’s output was compared against two benchmarks. First were heuristic methods, the rule-of-thumb checklists many companies already use. Second were single-model baselines, which attempted to do the evaluation in one shot without the multi-agent approach. These comparisons set a high bar: if the agentic system could outperform both traditional heuristics and strong individual models, it would prove that the multi-reviewer strategy wasn’t just clever; it was also effective.
The results pointed to clear advantages. Across the different datasets, the agentic system consistently made more accurate judgments—aligning more closely with human evaluations of design quality. Importantly, it also produced richer, more actionable feedback than the baselines. While single models often generated vague or inconsistent commentary, the agentic system could zero in on specific issues and phrase them in a way that resembled the kind of constructive feedback a human designer might receive in a review session.
The researchers didn’t stop at “the model looks right.” They built a structured evaluation framework that mirrors how a business might assess the ROI of a new process.
First, they measured classification accuracy for discrete design issues. For example, if a dataset indicated that a layout had a spacing problem, could the system flag that issue correctly? This is akin to a quality-control check: are the obvious errors being caught?
Second, they measured correlation with human ratings for continuous attributes like alignment quality or whitespace usage. Instead of asking whether the system could only detect “right vs. wrong,” this test asked: does the system’s scoring move in lockstep with human judgment? In business terms, does the system think like a human reviewer when grading shades of quality rather than black-and-white outcomes?
Third, and perhaps most importantly, they assessed the usefulness of feedback. A system that simply says “bad spacing” isn’t particularly helpful. What companies and designers need are insights that explain the issue and point toward a fix. To measure this, the researchers introduced an evaluation method that compared the system’s written feedback with ground-truth problem descriptions—using both large language models (LLMs) as judges and embedding-based similarity measures. This meant the quality of comments was evaluated not just for correctness, but also for clarity and actionability.
Taken together, these layers of evaluation ensured the solution was tested both quantitatively and qualitatively. Accuracy and correlation numbers showed whether the system was technically sound, while the feedback evaluation captured whether it was practically valuable. For companies, this is the difference between a dashboard that reports metrics and one that drives actual decisions.
The experiments demonstrated that when judged on these holistic criteria, the agentic peer-review system consistently outperformed the alternatives. It wasn’t simply catching more errors; it was producing feedback that could realistically guide design improvements in the messy, real-world workflows where quality, consistency, and speed all collide.
In assessing whether this agentic review system worked, the researchers went beyond traditional metrics. They asked a more fundamental question: does the system produce outputs that would actually help a designer or business leader make better decisions?
Part of this came from observing how the system’s reviews resembled real human critique sessions. Instead of stopping at labeling a design as “good” or “bad,” the agentic system generated explanations and suggestions that could plausibly fit into a design workshop or brand governance meeting. This human-likeness in style and substance became a critical success marker.
Equally important was consistency. In fast-moving business environments, inconsistent evaluations are often worse than inaccurate ones, because they erode trust in the process. The agentic system showed that its judgments were more stable and reproducible than one-off models—creating a level of reliability that organizations could build processes around.
Finally, success was judged by the ability to transfer across contexts. Many AI systems perform well in narrow scenarios but fail when the situation shifts. By testing the framework across multiple datasets representing very different design challenges, the researchers were effectively stress-testing its generalizability. The fact that the system performed well in these varied contexts suggested it could scale across industries and use cases.
That said, the research is not without its limitations. For one, the datasets used were relatively small compared to the vast design space businesses operate in. This makes sense for an academic proof-of-concept, but in practice, real-world design review may involve millions of assets across multiple geographies, languages, and platforms. Scaling the approach will require more diverse training and testing grounds.
Another limitation lies in the evaluation framework itself. Although the researchers used sophisticated methods to judge feedback quality, they relied in part on LLMs to act as “judges.” This introduces the risk of model bias influencing the assessment. While early comparisons with human ratings showed strong correlation, over-reliance on LLM-as-a-judge could become problematic in high-stakes brand or compliance environments.
Finally, the system’s current capabilities are focused on evaluation and guidance; it does not yet automatically fix the identified issues. For businesses, this means the tool functions like a highly skilled consultant pointing out problems and suggesting remedies, rather than an autonomous operator that executes changes.
Looking ahead, the researchers envision a natural next step: closing the loop by combining evaluation with automated revision. Imagine a system that not only identifies misaligned typography but also proposes (and tests) alternative layouts until it converges on a better outcome. This would transform the framework from a quality-control checkpoint into an active co-creator.
The broader business implications are significant. Strategically, this type of system offers a scalable quality assurance layer for design workflows, making it possible to maintain brand consistency across thousands of assets and dozens of markets without ballooning headcount. Tactically, it can act as a triage mechanism—flagging assets that need human intervention while letting compliant ones flow through automatically. Operationally, it can reduce costly rework, accelerate campaign launches, and improve the signal-to-noise ratio in creative performance testing.
In industries where visuals directly drive business outcomes (advertising, e-commerce, streaming, financial services), this kind of solution could shift design from being a perceived cost center to being a measurable driver of performance. By transforming subjective taste into actionable, data-driven feedback, the research offers a new model for governing creativity at scale.
In short, the work doesn’t just solve a technical problem. It points toward a future where design quality can be evaluated with the same rigor as financial reporting or supply chain performance. And for organizations navigating the complexity of global branding in the digital age, that’s not just a nice-to-have—it’s a competitive edge.
Further Reading
- Mallari, M. (2025, August 16). Creative differences resolved. AI-First Product Management by Michael Mallari. https://michaelmallari.bitbucket.io/case-study/creative-differences-resolved/
- Nag, S., Joseph, K. J., Goswami, K., Morariu, V., I., & Srinivasan, B. V. (2025, August 14). Agentic design review system. arXiv.org. https://arxiv.org/abs/2508.10745