Survey & taxonomy
PhD Dissertation University of South Carolina
Generalized Planning Using Language Models
and Its Applications
This dissertation explores how language models can support automated planning without sacrificing correctness, generalization, or practical usefulness.
Valid plan generation
State-centric planner
Neuro-symbolic loop
Dialog & manufacturing
Overview
Dissertation overview
The dissertation is organized around five technical questions covering characterization, validity analysis, model design, neuro-symbolic integration, and evaluation on real-world domains.
Classical automated planning provides formal semantics, explicit state representations, and correctness criteria, but it depends on structured domain models and search procedures that are difficult to scale to open-ended inputs. Large language models provide broad priors and flexible sequence modeling, but unconstrained generation does not guarantee executability, validity, or goal satisfaction.
This dissertation studies how language models can be incorporated into planning systems without discarding the formal structure needed for reliable plan generation. The technical progression moves from systematic characterization of the literature, to empirical analysis of validity failures, to state-centric model design, to neuro-symbolic planning architectures, and finally to evaluation on real-world domains.
Research arc
Systematically review 126 papers and identify functional roles, category structure, and open technical limitations.
Benchmark pretrained and fine-tuned models on benchmark planning domains to measure valid plan generation and cross-domain transfer.
Replace raw action-sequence generation with state-centric prediction and search-based plan extraction.
Use symbolic validation and repair signals inside an iterative neuro-symbolic planning loop.
Study the resulting methods in dialog systems and adaptive manufacturing replanning settings.
Research Questions
Five research questions
Together, these questions trace the path from understanding the literature to building planning systems that are both expressive and dependable.
RQ1
Characterization
How are language models being used for automated planning?
This chapter constructs a structured characterization of the role of language models in automated planning. The contribution is twofold: a manual taxonomy built from the literature through November 2023, and a semi-automated framework for extending that taxonomy and analyzing how the field changes over time.
Language models are not used in planning in a single uniform way; they occupy multiple functional roles, and those roles are shifting as the community moves from end-to-end plan generation toward more structured and tool-supported uses.
Category drift
Still dominant in D2, but its relative share decreases.
Declines as translation is treated as necessary but insufficient for planning.
Decreases as end-to-end interactive use remains difficult to scale reliably.
Grows and becomes the second-highest category in D2.
Maintains a stable presence across the two datasets.
Increases and reaches the third-highest share in D2.
Declines as work shifts toward concrete neuro-symbolic architectures.
Decreases as coordination reliability remains a major challenge.
Emerges in D2 with 4 papers focused on subgoal structuring.
Emerges in D2 with 1 paper centered on plan adaptation after failure.
RQ2
Specialization
How can language models be used for effective plan generation?
This chapter conducts a controlled evaluation of plan generation across language-model architectures, input representations, and adaptation strategies. The study compares pretrained and fine-tuned models on benchmark planning domains to identify where specialization helps and where generalization still fails.
Fine-tuning improves performance within the training distribution, but effective specialization still requires planning-oriented representations and architectures that generalize beyond memorized domain patterns.
Action-centric plan generation
(ontable a) (ontable b) (clear a) (clear b) (handempty)
(on a b)
Conditioned on the initial and goal state, the model emits plan tokens left to right and assigns a next-token probability at each decoding step.
P(token_t | s_init, s_goal, token_{<t})
pickup
0.78
a
0.92
stack
0.73
a,b
0.89
pickup(a)
→
stack(a,b)
pickup(a)
2stack(a,b)
Pretrained models remain weak in zero-shot settings; the best vanilla result in the chapter is still only 43.52% under one-shot PDDL prompting.
Fine-tuned CodeT5 with compact input achieves 97.57% satisficing plans, of which 86.21% are optimal, on in-distribution problems.
No valid plans are produced on unseen domains such as childsnack, depots, and satellite.
RQ3
Modeling
How can compact foundation models be trained from scratch to support plan generation?
This chapter reformulates generalized planning as transition-model learning. Instead of predicting action tokens autoregressively, a compact model predicts successor-state embeddings from explicit symbolic states and goals, then recovers executable actions through symbolic successor verification.
Generalization comes from the representation as much as the model: explicit symbolic states, ILGs, and the right tokenizer reduce state drift and let compact transition models extrapolate beyond the training object range.
State-centric transition learning
(ontable a) (ontable b) (clear a) (clear b) (handempty)
(on a b)
φ(s_t, g)
v_t = φ(s_t) + f(φ(s_t), φ(g))
argmin s in Succ(s_t) ||φ(s) - v_t||_2
From s_0, the model predicts an embedding for the next state and the decoder
recovers pickup(a) by matching against valid successors.
At the next step, successor verification selects the state that supports
stack(a,b), so the rollout remains executable throughout.
Permutation- and size-invariant; best when local neighborhoods determine transition dynamics.
Captures reachability structure directly; strongest and most consistent on VisitAll.
Learns recurring ILG token patterns from DFS linearizations; best state-centric tokenizer on Gripper.
Fast and permutation-invariant, but lossy; retains some transport structure only with LSTM decoding.
RQ4
Integration
How can language models be used along with symbolic methods to achieve robust plan generation via neurosymbolic architectures?
This chapter instantiates Plan-SOFAI, a Slow and Fast AI architecture for classical planning. Fast System-1 solvers propose candidate plans, a metacognitive controller decides whether they are trustworthy enough to use, and slow symbolic System-2 planners verify, repair, or replace them when needed.
Language models are most useful as fast proposers. Symbolic planners provide the correction and repair layer, and metacognition decides when to trust the proposal and when to escalate.
Plan-SOFAI: thinking fast and slow in planning
The Model of the World provides the PDDL environment, instance, and time budget.
Plansformer and case-based selectors return a quick plan proposal and confidence score.
Confidence, experience, risk aversion, and estimated cost decide whether to accept, repair, or replan.
LPG repairs partial plans; FD solves from scratch when the S1 proposal is not usable.
Return an accepted or repaired plan only when correctness constraints are satisfied.
Fast and strong for a learned proposer, but still not reliable enough as a standalone planner.
Solves all 500 problems and reduces average plan length by 14% relative to LPG.
The strongest overall balance between solve rate, plan quality, and runtime.
RQ5
Application
How do new generalized planners created with language models and symbolic approaches perform in applications?
Chapter 7 evaluates the generalized planning methods developed earlier in three deployment settings: dialog-based information retrieval, trustworthy conversational support for HIV/AIDS information, and adaptive replanning in a stochastic rocket-assembly factory. The emphasis is on whether these methods remain effective under ambiguity, safety constraints, and operational disruptions rather than only on benchmark instances.
panomaly = 0.05.
Generalized planning remains effective in deployment when symbolic structure, metacognitive risk control, and adaptive replanning are matched to the application domain rather than relying on a standalone generator.
Application deployment
Planning and reinforcement learning are interleaved to resolve ambiguous user requests, select the appropriate hierarchy, and drive multi-turn retrieval over heterogeneous data sources.
Code for cholera
A00.0
Accurate, but it requires manual data-source selection.
Fastest, but incomplete on 4 of 9 evaluated queries.
Shortest task length with stronger task completion across both datasets.
Query risk, model confidence, and prior experience determine whether the system answers with a retrieved LLM response or escalates to a verified rule-based assistant.
Llama-3-8B with retrieved QA context
SafeChat policies and verified answers
panomaly = 0.05
A discrete-time MDP models four robots, a conveyor, and four anomaly types. Learned policies recover through waiting, retries, and intervention actions without halting the line.
All learned policies exceed 91% success; DQN, AVI, and ASNet pass 60/60 episodes, while the random baseline remains below 5%.
Committee and Collaborators
Committee
Dissertation advisors and committee members who supported and shaped this work.
Major Professor
Biplav Srivastava
University of South Carolina
Major Professor
Amit Sheth
University of South Carolina
Examination Chair
Ramtin Zand
University of South Carolina
Committee Member
Lior Horesh
IBM Research
Committee Member
Sarath Sreedharan
Colorado State University