PhD Dissertation University of South Carolina

Generalized Planning Using Language Models

and Its Applications

This dissertation explores how language models can support automated planning without sacrificing correctness, generalization, or practical usefulness.

RQ1 Characterization

Survey & taxonomy

RQ2 Specialization

Valid plan generation

RQ3 Modeling

State-centric planner

RQ4 Integration

Neuro-symbolic loop

RQ5 Application

Dialog & manufacturing

Overview

Dissertation overview

The dissertation is organized around five technical questions covering characterization, validity analysis, model design, neuro-symbolic integration, and evaluation on real-world domains.

Classical automated planning provides formal semantics, explicit state representations, and correctness criteria, but it depends on structured domain models and search procedures that are difficult to scale to open-ended inputs. Large language models provide broad priors and flexible sequence modeling, but unconstrained generation does not guarantee executability, validity, or goal satisfaction.

This dissertation studies how language models can be incorporated into planning systems without discarding the formal structure needed for reliable plan generation. The technical progression moves from systematic characterization of the literature, to empirical analysis of validity failures, to state-centric model design, to neuro-symbolic planning architectures, and finally to evaluation on real-world domains.

Problem setting Unconstrained LM generation is expressive, but it does not reliably satisfy action semantics, plan validity, or out-of-domain generalization requirements.
Methodology Combine systematic review, benchmark-based evaluation, state-transition modeling, graph tokenization, and symbolic verification in hybrid planning pipelines.
Technical outcome Establish a taxonomy of LM use in planning, quantify validity limitations, develop state-centric planning models, and evaluate neuro-symbolic methods on real-world domains.

Research arc

1
Characterize LM-based planning

Systematically review 126 papers and identify functional roles, category structure, and open technical limitations.

2
Quantify validity and generalization

Benchmark pretrained and fine-tuned models on benchmark planning domains to measure valid plan generation and cross-domain transfer.

3
Model state transitions directly

Replace raw action-sequence generation with state-centric prediction and search-based plan extraction.

4
Integrate neural and symbolic planning

Use symbolic validation and repair signals inside an iterative neuro-symbolic planning loop.

5
Evaluate on real-world domains

Study the resulting methods in dialog systems and adaptive manufacturing replanning settings.

Research Questions

Five research questions

Together, these questions trace the path from understanding the literature to building planning systems that are both expressive and dependable.

RQ1

Characterization

How are language models being used for automated planning?

This chapter constructs a structured characterization of the role of language models in automated planning. The contribution is twofold: a manual taxonomy built from the literature through November 2023, and a semi-automated framework for extending that taxonomy and analyzing how the field changes over time.

Question What functional roles are language models playing within planning pipelines?
Method Manually categorize 126 papers into eight core categories, then update the analysis with a semi-automated, human-augmented pipeline over 47 newer papers.
Finding Plan generation is the dominant category in the initial taxonomy, while later updates show category drift and introduce goal decomposition and replanning.
126-paper taxonomy 8 core categories 47-paper update
Main takeaway

Language models are not used in planning in a single uniform way; they occupy multiple functional roles, and those roles are shifting as the community moves from end-to-end plan generation toward more structured and tool-supported uses.

Category drift

2020 2021 2022 2023 2024
Down Plan generation

Still dominant in D2, but its relative share decreases.

Down Language translation

Declines as translation is treated as necessary but insufficient for planning.

Down Interactive planning

Decreases as end-to-end interactive use remains difficult to scale reliably.

Up Model construction

Grows and becomes the second-highest category in D2.

Stable Heuristics optimization

Maintains a stable presence across the two datasets.

Up Tool integration

Increases and reaches the third-highest share in D2.

Down Brain-inspired planning

Declines as work shifts toward concrete neuro-symbolic architectures.

Down Multi-agent planning

Decreases as coordination reliability remains a major challenge.

New Goal decomposition

Emerges in D2 with 4 papers focused on subgoal structuring.

New Replanning

Emerges in D2 with 1 paper centered on plan adaptation after failure.

RQ2

Specialization

How can language models be used for effective plan generation?

This chapter conducts a controlled evaluation of plan generation across language-model architectures, input representations, and adaptation strategies. The study compares pretrained and fine-tuned models on benchmark planning domains to identify where specialization helps and where generalization still fails.

Question How do pretrained and specialized language models behave on classical plan-generation tasks?
Method Evaluate decoder-only, encoder-decoder, and encoder-only models on six IPC domains using natural language, PDDL, and compact representations under zero-shot, one-shot, chain-of-thought, and fine-tuning settings.
Finding Pretrained models show limited valid plan generation; fine-tuned CodeT5 improves strongly in-distribution, but neither evaluated model produces valid plans on unseen domains.
6 IPC domains 3 input representations 4 adaptation settings
Main takeaway

Fine-tuning improves performance within the training distribution, but effective specialization still requires planning-oriented representations and architectures that generalize beyond memorized domain patterns.

Action-centric plan generation

Initial state

(ontable a) (ontable b) (clear a) (clear b) (handempty)

Goal state

(on a b)

Action-centric planning Autoregressive LLM

Conditioned on the initial and goal state, the model emits plan tokens left to right and assigns a next-token probability at each decoding step.

P(token_t | s_init, s_goal, token_{<t})
pickup
0.78
a
0.92
stack
0.73
a,b
0.89
pickup(a) stack(a,b)
PDDL execution trace
pickup(a) stack(a,b)
1pickup(a) 2stack(a,b)
Zero-shot Limited

Pretrained models remain weak in zero-shot settings; the best vanilla result in the chapter is still only 43.52% under one-shot PDDL prompting.

Fine-tuned 97.57%

Fine-tuned CodeT5 with compact input achieves 97.57% satisficing plans, of which 86.21% are optimal, on in-distribution problems.

Generalization 0 valid

No valid plans are produced on unseen domains such as childsnack, depots, and satellite.

RQ3

Modeling

How can compact foundation models be trained from scratch to support plan generation?

This chapter reformulates generalized planning as transition-model learning. Instead of predicting action tokens autoregressively, a compact model predicts successor-state embeddings from explicit symbolic states and goals, then recovers executable actions through symbolic successor verification.

Question Can generalized planning be learned as state-to-state transition prediction instead of autoregressive action generation?
Method Encode each state-goal pair as an Instance Learning Graph (ILG), tokenize it with WL, Shortest Path, GraphBPE, or SimHash, learn residual transitions with LSTM or XGBoost, and decode actions by nearest valid successor matching in Succ(s_t).
Finding Compact 1.2M-parameter state-centric models achieve strong size extrapolation when the tokenizer matches the domain structure: WL+XGBoost+ reaches 45.0% on Blocksworld and 87.2% on VisitAll.
4 IPC domains 4 graph tokenizers 1.2M parameters
Main takeaway

Generalization comes from the representation as much as the model: explicit symbolic states, ILGs, and the right tokenizer reduce state drift and let compact transition models extrapolate beyond the training object range.

State-centric transition learning

Current state s_t

(ontable a) (ontable b) (clear a) (clear b) (handempty)

Goal g

(on a b)

ILG encoder φ(s_t, g)
Residual transition model v_t = φ(s_t) + f(φ(s_t), φ(g))
Neuro-symbolic decoding argmin s in Succ(s_t) ||φ(s) - v_t||_2
Step 1 Predict a valid successor state

From s_0, the model predicts an embedding for the next state and the decoder recovers pickup(a) by matching against valid successors.

Step 2 Keep the symbolic state explicit

At the next step, successor verification selects the state that supports stack(a,b), so the rollout remains executable throughout.

WL Local color refinement

Permutation- and size-invariant; best when local neighborhoods determine transition dynamics.

Blocksworld best ext. 45.0%
Shortest Path Global connectivity

Captures reachability structure directly; strongest and most consistent on VisitAll.

VisitAll best ext. 87.2%
GraphBPE Data-driven graph motifs

Learns recurring ILG token patterns from DFS linearizations; best state-centric tokenizer on Gripper.

Gripper best ext. 43.8%
SimHash Random projection hashing

Fast and permutation-invariant, but lossy; retains some transport structure only with LSTM decoding.

Gripper best ext. 31.3%

RQ4

Integration

How can language models be used along with symbolic methods to achieve robust plan generation via neurosymbolic architectures?

This chapter instantiates Plan-SOFAI, a Slow and Fast AI architecture for classical planning. Fast System-1 solvers propose candidate plans, a metacognitive controller decides whether they are trustworthy enough to use, and slow symbolic System-2 planners verify, repair, or replace them when needed.

Question How can language models be used along with symbolic methods to achieve robust plan generation via neurosymbolic architectures?
Method Instantiate S1 with Plansformer and case-based plan selectors, S2 with Fast Downward and LPG, and a metacognitive controller that routes by confidence, accumulated experience, risk aversion, and estimated time cost.
Finding PF is the strongest standalone S1 solver at 80.4% valid and 77.2% optimal plans, but neurosymbolic embodiments are substantially more robust: PS-LPG solves 100% of 500 benchmark problems and PS-MIX solves 98% with 89.0% optimal plans.
500 benchmark problems 5 planning domains S1 + MC + S2
Main takeaway

Language models are most useful as fast proposers. Symbolic planners provide the correction and repair layer, and metacognition decides when to trust the proposal and when to escalate.

Plan-SOFAI: thinking fast and slow in planning

Input Problem + domain

The Model of the World provides the PDDL environment, instance, and time budget.

System 1 Fast proposer

Plansformer and case-based selectors return a quick plan proposal and confidence score.

PF LEV/JAC/CB
Metacognition MC-1 / MC-2 routing

Confidence, experience, risk aversion, and estimated cost decide whether to accept, repair, or replan.

world self others
System 2 Symbolic repair / solve

LPG repairs partial plans; FD solves from scratch when the S1 proposal is not usable.

LPG FD
Output Verified plan

Return an accepted or repaired plan only when correctness constraints are satisfied.

Standalone S1 PF
Valid plans 80.4%
Optimal plans 77.2%

Fast and strong for a learned proposer, but still not reliable enough as a standalone planner.

Best coverage PS-LPG
Valid plans 100.0%
Optimal plans 86.8%

Solves all 500 problems and reduces average plan length by 14% relative to LPG.

Best tradeoff PS-MIX
Valid plans 98.0%
Optimal plans 89.0%

The strongest overall balance between solve rate, plan quality, and runtime.

RQ5

Application

How do new generalized planners created with language models and symbolic approaches perform in applications?

Chapter 7 evaluates the generalized planning methods developed earlier in three deployment settings: dialog-based information retrieval, trustworthy conversational support for HIV/AIDS information, and adaptive replanning in a stochastic rocket-assembly factory. The emphasis is on whether these methods remain effective under ambiguity, safety constraints, and operational disruptions rather than only on benchmark instances.

Question Do generalized planners built from planning, language models, and symbolic or learning-based control transfer to real application domains?
Method Study PRUDENT on UNSPSC and ICD-10, SafeGenChat on 47 UNAIDS HIV/AIDS FAQs with SOFAI-style metacognitive routing, and six reinforcement-learning architectures on a stochastic rocket-assembly MDP with panomaly = 0.05.
Finding P+RL provides the strongest overall dialog performance, SafeGenChat routes sensitive queries to verified responses, and all learned manufacturing policies exceed 91% success with DQN, AVI, and ASNet reaching 100%.
UNSPSC + ICD-10 47 UNAIDS QA pairs 6 RL architectures
Main takeaway

Generalized planning remains effective in deployment when symbolic structure, metacognitive risk control, and adaptive replanning are matched to the application domain rather than relying on a standalone generator.

Application deployment

Collaborative assistants PRUDENT
UNSPSC / ICD-10

Planning and reinforcement learning are interleaved to resolve ambiguous user requests, select the appropriate hierarchy, and drive multi-turn retrieval over heterogeneous data sources.

Query Code for cholera
Intent Select ICD-10 Plan retrieval A00.0
P-P 11.1s avg

Accurate, but it requires manual data-source selection.

P-RL 8.7s avg

Fastest, but incomplete on 4 of 9 evaluated queries.

P+RL Best overall

Shortest task length with stronger task completion across both datasets.

Trustworthy dialog SafeGenChat
SOFAI routing

Query risk, model confidence, and prior experience determine whether the system answers with a retrieved LLM response or escalates to a verified rule-based assistant.

User query
Risk + confidence
Low risk System-1

Llama-3-8B with retrieved QA context

High risk System-2

SafeChat policies and verified answers

47 UNAIDS FAQ pairs
43/47 non-harmful under both Granite Guardian conditions
T1-T4 risk and confidence thresholds drive routing
Adaptive replanning Stochastic rocket assembly
panomaly = 0.05

A discrete-time MDP models four robots, a conveyor, and four anomaly types. Learned policies recover through waiting, retries, and intervention actions without halting the line.

Start Stop 1 Stop 2 Inspect
Tray
r1 r2 r3 r4
Overheat Tray jam Grasp fail Missing part
DQN100.0%
AVI100.0%
ASNet100.0%
PPO-MLP96.7%
GNN-RGAT93.3%
GNN-WMPNN91.7%

All learned policies exceed 91% success; DQN, AVI, and ASNet pass 60/60 episodes, while the random baseline remains below 5%.

Committee and Collaborators

Committee

Dissertation advisors and committee members who supported and shaped this work.

Major Professor

Biplav Srivastava

University of South Carolina

University of South Carolina

Major Professor

Amit Sheth

University of South Carolina

University of South Carolina

Examination Chair

Ramtin Zand

University of South Carolina

University of South Carolina

Committee Member

Lior Horesh

IBM Research

IBM Research

Committee Member

Sarath Sreedharan

Colorado State University

Colorado State University