AI’m With the Travel Planner
A closer look at how Vaiage uses multi-agent LLMs to build dynamic, human-like planning systems for complex real-world tasks.
Imagine trying to plan a vacation for a group of friends. One person wants to visit museums, another prefers wineries, someone else is watching the budget, and you’re all trying to avoid rainstorms and squeeze everything into a long weekend. Now multiply that complexity across dozens of decisions: transportation, accommodations, food, weather, local events, and personal preferences (all of which can change on the fly).
Today’s travel tools (for all their polish) struggle to manage this reality. Platforms like Tripadvisor or Expedia might give you static options (a few hotels here, a flight there), but they aren’t built to adapt. They can’t easily revise your itinerary when weather changes, you find out a site is closed, or someone in the group wants to swap out an activity. Most importantly, they don’t let you carry on an ongoing conversation where you can refine, rework, and reason about tradeoffs in real time.
This is the core problem that a new research paper, Vaiage: A Multi-Agent Solution to Personalized Travel Planning, is trying to solve. It’s not just about booking logistics; it’s also about creating an intelligent, adaptive planning system that can reason, coordinate, and converse like a competent human assistant, while juggling complex preferences, goals, and constraints across a dynamic landscape.
So how do you even begin to solve that kind of problem?
The UC Berkeley research team behind Vaiage starts by rethinking what an AI travel planner should be. Instead of one giant model making every decision, Vaiage is structured like a team of specialists, a multi-agent system where each AI “agent” focuses on a specific task, and then collaborates with others to build a coherent travel plan.
Here’s the core idea: imagine your trip planner is actually a group of experts sitting around a virtual table. One is there to understand your preferences: “Are you a wine person?” or “Do you care more about budget or experience?”. Another focuses on making suggestions (destinations, restaurants, activities) based on what’s available and fits your style. A third agent handles the sequencing and logistics: figuring out the best routes, minimizing wasted time, and making sure your schedule isn’t physically impossible. Others monitor constraints like weather and budget and surface updates as new information becomes available. These agents then share information using a graph structure… essentially a live network of goals, constraints, and dependencies that they all reference and update.
Tying it all together is a large language model (LLM), similar in spirit to ChatGPT. But here, the LLM isn’t flying solo. Instead, it acts as the brain that facilitates coordination, interprets natural-language goals, and enables back-and-forth conversations with the user.
What makes this system powerful isn’t just that it responds like a chatbot. It’s that it can plan like a strategist. It integrates symbolic reasoning (i.e., hard constraints like “stay under $500” or “avoid more than 3 hours of driving per day”) with the fuzzier human-like judgment that LLMs are good at (i.e., “Find a romantic but not too expensive winery near Florence that isn’t overly touristy”).
This hybrid approach (structured agent collaboration plus language-based reasoning) enables Vaiage to do something today’s travel platforms cannot: dynamically generate, adjust, and explain personalized travel itineraries in real time, all through natural conversation.
In essence, Vaiage isn’t just automating bookings; it’s automating the thinking behind good planning.
Once the researchers behind Vaiage built this multi-agent travel planning framework, they faced a critical question: how well does it actually work in the real world?
To find out, they designed a series of structured experiments that put Vaiage head-to-head with simpler alternatives. These weren’t just internal simulations; they brought real users into the loop to interact with different versions of the system and assess the quality of the itineraries it produced.
The setup was straightforward but revealing. Participants were asked to imagine a travel scenario, say, planning a three-day trip to Tuscany with certain constraints like a budget limit, preferences for wine country, and an aversion to overly touristy spots. Then, they were shown itineraries generated by three different systems: the full Vaiage framework, a simplified version without strategic coordination between agents, and a third version that didn’t access any external real-time data like weather or activity availability.
From there, it became a qualitative exercise in judgment. The evaluators, some human and some aided by GPT-4-based rubric scoring, looked at each plan not just in terms of creativity or aesthetic appeal, but based on a few core dimensions: feasibility, clarity, coherence, and responsiveness to the user’s goals. Did the itinerary make sense logistically? Did it reflect what the user asked for? Was it grounded in real-world constraints, or was it just a plausible-sounding hallucination?
What stood out quickly was that the full Vaiage system consistently produced more refined and actionable itineraries. The difference wasn’t in flashy features; it was in the ability to integrate live information, coordinate across multiple planning agents, and respond to nuanced preferences. Without coordination or real-time inputs, the simplified systems often proposed plans that were either unrealistic (scheduling a hike during a forecasted thunderstorm) or misaligned with the user’s intent (too expensive, too rushed, or missing key interests).
The researchers used both structured scoring rubrics and free-form user commentary to evaluate these outcomes. Rubrics helped anchor the analysis around clear criteria—such as how well the plan stayed within budget, how smoothly the day flowed from one activity to the next, and how well it respected expressed preferences. Meanwhile, user feedback surfaced the kinds of insights that raw metrics can miss: whether the plan felt “trustworthy,” whether it was easy to understand why certain decisions were made, and whether users felt comfortable engaging in follow-up dialogue to tweak the plan.
In short, success wasn’t defined by perfection; it was defined by usability. Could a non-technical user get something out of the system that felt tailored, reasonable, and flexible enough to adjust when circumstances changed?
The evaluation design also revealed something deeper about how people interact with planning tools. People don’t just want answers—they want explanations. They want to know why the AI suggested a particular route or left out a specific attraction. Vaiage’s multi-agent architecture, by design, made it easier to trace back those decisions to specific constraints or agent contributions, which in turn increased user confidence.
Importantly, this wasn’t just about making plans; it was also about making better decisions with AI as a partner. And by testing Vaiage in these human-centered ways, the research showed that AI assistants, when properly structured, can do more than generate content; they can collaborate in ways that feel strategic, reasoned, and genuinely useful.
One of the most thoughtful aspects of the Vaiage research lies in how the team approached the question of what success actually looks like when designing intelligent planning systems. They didn’t just ask whether the tool could generate a trip itinerary. Instead, they evaluated the system through a broader, more human lens: was the plan grounded in reality, was it personalized, and (critically) did the user trust the process that created it?
This type of evaluation required going beyond traditional performance metrics. The researchers leaned into a mix of structured and subjective measures. On the structured side, they created detailed evaluation rubrics to score dimensions like itinerary feasibility (Could a person actually follow this plan without stress or confusion?), alignment with stated goals (Did the plan reflect what the user actually asked for?), and logical flow (Were activities scheduled in a way that made sense given geography, timing, and constraints?). These rubrics were often scored by advanced language models like GPT-4—offering consistency across evaluations.
On the subjective side, the team incorporated open-ended human feedback—surfacing whether the planning experience felt transparent, if the plan seemed trustworthy, and how easy it was to revise. This dual evaluation approach created a clearer picture: Vaiage wasn’t just building smarter outputs; it was building confidence in how those outputs came together.
But even with these promising outcomes, the system isn’t without limitations.
One of the biggest challenges is the dependency on real-time data through APIs. Vaiage’s power hinges on its ability to integrate dynamic external sources: weather forecasts, hotel availability, traffic conditions, and more. If these APIs are down, slow, or inconsistent, the quality of planning drops. This creates a point of vulnerability for anyone thinking about using this system in production, especially in high-stakes contexts like logistics or healthcare scheduling.
Another limitation is common to many AI systems today: LLM hallucinations. While Vaiage mitigates some of this risk by segmenting responsibilities across multiple agents and validating external data when available, it’s still possible for the system to fabricate plausible-sounding but inaccurate information. This isn’t just a technical issue; it’s also a trust issue. For a system that’s supposed to help people make decisions, getting facts wrong even occasionally can quickly erode user confidence.
There’s also the question of efficiency and scalability. Multi-agent systems are inherently more complex, and orchestrating them through LLMs introduces latency and computational overhead. If you’re planning one trip, that might be fine. But if you’re managing thousands of simultaneous itineraries, or applying this approach to domains like fleet logistics or event coordination, the system’s ability to scale efficiently becomes critical.
Despite these challenges, the potential impact of Vaiage’s approach is significant. What the research gestures toward is a new class of intelligent assistants… not just tools that give you results, but also systems that can reason, coordinate, adapt, and explain… systems that don’t just spit out itineraries, schedules, or plans, but also help you make sense of them (and improve them) through collaboration.
Looking forward, the research has several promising directions: tighter feedback loops that allow users to correct or refine outputs in real-time, more robust verification layers to prevent errors, and broader applications beyond travel—into any space where personalized, multi-step planning is required. Think of domains like recruiting, medical scheduling, supply chain orchestration, or even personal financial planning.
In short, Vaiage represents more than just a travel tool. It’s a demonstration of what’s possible when we stop trying to build AI that replaces human decision-making, and start building systems that partner with it.
Further Readings
- Liu, B., Ge, J., & Wang, J. (2025, May 16). Vaiage: a multi-agent solution to personalized travel planning. arXiv.org. https://arxiv.org/abs/2505.10922
- Mallari, M. (2025, May 18). Itineraire you ready for this? AI-First Product Management by Michael Mallari. https://michaelmallari.bitbucket.io/case-study/itineraire-you-ready-for-this/