Dissertation map Five research questions

Five Research Questions

The detailed technical structure of the dissertation

Each research question answers a different part of the same bigger problem: how to use language models in planning systems without sacrificing validity, structure, or practical reliability.

Home Jump to all questions

RQ1 RQ2 RQ3 RQ4 RQ5

Dissertation structure

RQ1. Characterization

Map the field and understand how language models are being used in planning research.

RQ2. Specialization

Test when pretrained and fine-tuned models can generate valid plans and where they fail.

RQ3. Modeling

Move to compact state-centric models that predict successor states instead of raw action tokens.

RQ4. Integration

Combine fast learned planners with symbolic verification, repair, and metacognitive control.

RQ5. Application

Evaluate the resulting planning ideas in dialog support, safe assistance, and manufacturing.

Research Questions

Five research questions

Together, these questions trace the path from understanding the literature to building planning systems that are both expressive and dependable.

RQ1

Characterization

How are language models being used for automated planning?

This chapter constructs a structured characterization of the role of language models in automated planning. The contribution is twofold: a manual taxonomy built from the literature through November 2023, and a semi-automated framework for extending that taxonomy and analyzing how the field changes over time.

Question What functional roles are language models playing within planning pipelines?

Method Manually categorize 126 papers into eight core categories, then update the analysis with a semi-automated, human-augmented pipeline over 47 newer papers.

Finding Plan generation is the dominant category in the initial taxonomy, while later updates show category drift and introduce goal decomposition and replanning.

126-paper taxonomy 8 core categories 47-paper update

Main takeaway

Language models are not used in planning in a single uniform way; they occupy multiple functional roles, and those roles are shifting as the community moves from end-to-end plan generation toward more structured and tool-supported uses.

Category drift

2020 2021 2022 2023 2024

Down Plan generation

Still dominant in D2, but its relative share decreases.

Down Language translation

Declines as translation is treated as necessary but insufficient for planning.

Down Interactive planning

Decreases as end-to-end interactive use remains difficult to scale reliably.

Up Model construction

Grows and becomes the second-highest category in D2.

Stable Heuristics optimization

Maintains a stable presence across the two datasets.

Up Tool integration

Increases and reaches the third-highest share in D2.

Down Brain-inspired planning

Declines as work shifts toward concrete neuro-symbolic architectures.

Down Multi-agent planning

Decreases as coordination reliability remains a major challenge.

New Goal decomposition

Emerges in D2 with 4 papers focused on subgoal structuring.

New Replanning

Emerges in D2 with 1 paper centered on plan adaptation after failure.

RQ2

Specialization

How can language models be used for effective plan generation?

This chapter conducts a controlled evaluation of plan generation across language-model architectures, input representations, and adaptation strategies. The study compares pretrained and fine-tuned models on benchmark planning domains to identify where specialization helps and where generalization still fails.

Question How do pretrained and specialized language models behave on classical plan-generation tasks?

Method Evaluate decoder-only, encoder-decoder, and encoder-only models on six IPC domains using natural language, PDDL, and compact representations under zero-shot, one-shot, chain-of-thought, and fine-tuning settings.

Finding Pretrained models show limited valid plan generation; fine-tuned CodeT5 improves strongly in-distribution, but neither evaluated model produces valid plans on unseen domains.

6 IPC domains 3 input representations 4 adaptation settings

Main takeaway

Fine-tuning improves performance within the training distribution, but effective specialization still requires planning-oriented representations and architectures that generalize beyond memorized domain patterns.

Action-centric plan generation

Initial state

(ontable a) (ontable b) (clear a) (clear b) (handempty)

Goal state

(on a b)

Action-centric planning Autoregressive LLM

Conditioned on the initial and goal state, the model emits plan tokens left to right and assigns a next-token probability at each decoding step.

P(token_t | s_init, s_goal, token_{<t})

pickup

0.78

a

0.92

stack

0.73

a,b

0.89

pickup(a) stack(a,b)

PDDL execution trace

pickup(a) → stack(a,b)

1pickup(a) 2stack(a,b)

Zero-shot Limited

Pretrained models remain weak in zero-shot settings; the best vanilla result in the chapter is still only 43.52% under one-shot PDDL prompting.

Fine-tuned 97.57%

Fine-tuned CodeT5 with compact input achieves 97.57% satisficing plans, of which 86.21% are optimal, on in-distribution problems.

Generalization 0 valid

No valid plans are produced on unseen domains such as childsnack, depots, and satellite.

RQ3

Modeling

How can compact foundation models be trained from scratch to support plan generation?

This chapter reformulates generalized planning as transition-model learning. Instead of predicting action tokens autoregressively, a compact model predicts successor-state embeddings from explicit symbolic states and goals, then recovers executable actions through symbolic successor verification.

Question Can generalized planning be learned as state-to-state transition prediction instead of autoregressive action generation?

Method Encode each state-goal pair as an Instance Learning Graph (ILG), tokenize it with WL, Shortest Path, GraphBPE, or SimHash, learn residual transitions with LSTM or XGBoost, and decode actions by nearest valid successor matching in Succ(s_t).

Finding Compact 1.2M-parameter state-centric models achieve strong size extrapolation when the tokenizer matches the domain structure: WL+XGBoost+ reaches 45.0% on Blocksworld and 87.2% on VisitAll.

4 IPC domains 4 graph tokenizers 1.2M parameters

Main takeaway

Generalization comes from the representation as much as the model: explicit symbolic states, ILGs, and the right tokenizer reduce state drift and let compact transition models extrapolate beyond the training object range.

State-centric transition learning

Current state s_t

(ontable a) (ontable b) (clear a) (clear b) (handempty)

Goal g

(on a b)

ILG encoder φ(s_t, g)

Residual transition model v_t = φ(s_t) + f(φ(s_t), φ(g))

Neuro-symbolic decoding argmin s in Succ(s_t) ||φ(s) - v_t||_2

Step 1 Predict a valid successor state

From s_0, the model predicts an embedding for the next state and the decoder recovers pickup(a) by matching against valid successors.

Step 2 Keep the symbolic state explicit

At the next step, successor verification selects the state that supports stack(a,b), so the rollout remains executable throughout.

WL Local color refinement

Permutation- and size-invariant; best when local neighborhoods determine transition dynamics.

Blocksworld best ext. 45.0%

Shortest Path Global connectivity

Captures reachability structure directly; strongest and most consistent on VisitAll.

VisitAll best ext. 87.2%

GraphBPE Data-driven graph motifs

Learns recurring ILG token patterns from DFS linearizations; best state-centric tokenizer on Gripper.

Gripper best ext. 43.8%

SimHash Random projection hashing

Fast and permutation-invariant, but lossy; retains some transport structure only with LSTM decoding.

Gripper best ext. 31.3%

RQ4

Integration

How can language models be used along with symbolic methods to achieve robust plan generation via neurosymbolic architectures?

This chapter instantiates Plan-SOFAI, a Slow and Fast AI architecture for classical planning. Fast System-1 solvers propose candidate plans, a metacognitive controller decides whether they are trustworthy enough to use, and slow symbolic System-2 planners verify, repair, or replace them when needed.

Question How can language models be used along with symbolic methods to achieve robust plan generation via neurosymbolic architectures?

Method Instantiate S1 with Plansformer and case-based plan selectors, S2 with Fast Downward and LPG, and a metacognitive controller that routes by confidence, accumulated experience, risk aversion, and estimated time cost.

Finding PF is the strongest standalone S1 solver at 80.4% valid and 77.2% optimal plans, but neurosymbolic embodiments are substantially more robust: PS-LPG solves 100% of 500 benchmark problems and PS-MIX solves 98% with 89.0% optimal plans.

500 benchmark problems 5 planning domains S1 + MC + S2

Main takeaway

Language models are most useful as fast proposers. Symbolic planners provide the correction and repair layer, and metacognition decides when to trust the proposal and when to escalate.

Plan-SOFAI: thinking fast and slow in planning

Input Problem + domain

The Model of the World provides the PDDL environment, instance, and time budget.

System 1 Fast proposer

Plansformer and case-based selectors return a quick plan proposal and confidence score.

PF LEV/JAC/CB

Metacognition MC-1 / MC-2 routing

Confidence, experience, risk aversion, and estimated cost decide whether to accept, repair, or replan.

world self others

System 2 Symbolic repair / solve

LPG repairs partial plans; FD solves from scratch when the S1 proposal is not usable.

LPG FD

Output Verified plan

Return an accepted or repaired plan only when correctness constraints are satisfied.

Standalone S1 PF

Valid plans 80.4%

Optimal plans 77.2%

Fast and strong for a learned proposer, but still not reliable enough as a standalone planner.

Best coverage PS-LPG

Valid plans 100.0%

Optimal plans 86.8%

Solves all 500 problems and reduces average plan length by 14% relative to LPG.

Best tradeoff PS-MIX

Valid plans 98.0%

Optimal plans 89.0%

The strongest overall balance between solve rate, plan quality, and runtime.

RQ5

Application

How do new generalized planners created with language models and symbolic approaches perform in applications?

Chapter 7 evaluates the generalized planning methods developed earlier in three deployment settings: dialog-based information retrieval, trustworthy conversational support for HIV/AIDS information, and adaptive replanning in a stochastic rocket-assembly factory. The emphasis is on whether these methods remain effective under ambiguity, safety constraints, and operational disruptions rather than only on benchmark instances.

Question Do generalized planners built from planning, language models, and symbolic or learning-based control transfer to real application domains?

Method Study PRUDENT on UNSPSC and ICD-10, SafeGenChat on 47 UNAIDS HIV/AIDS FAQs with SOFAI-style metacognitive routing, and six reinforcement-learning architectures on a stochastic rocket-assembly MDP with panomaly = 0.05.

Finding P+RL provides the strongest overall dialog performance, SafeGenChat routes sensitive queries to verified responses, and all learned manufacturing policies exceed 91% success with DQN, AVI, and ASNet reaching 100%.

UNSPSC + ICD-10 47 UNAIDS QA pairs 6 RL architectures

Main takeaway

Generalized planning remains effective in deployment when symbolic structure, metacognitive risk control, and adaptive replanning are matched to the application domain rather than relying on a standalone generator.

Application deployment

Collaborative assistants PRUDENT

UNSPSC / ICD-10

Planning and reinforcement learning are interleaved to resolve ambiguous user requests, select the appropriate hierarchy, and drive multi-turn retrieval over heterogeneous data sources.

Query Code for cholera

Intent Select ICD-10 Plan retrieval A00.0

P-P 11.1s avg

Accurate, but it requires manual data-source selection.

P-RL 8.7s avg

Fastest, but incomplete on 4 of 9 evaluated queries.

P+RL Best overall

Shortest task length with stronger task completion across both datasets.

Trustworthy dialog SafeGenChat

SOFAI routing

Query risk, model confidence, and prior experience determine whether the system answers with a retrieved LLM response or escalates to a verified rule-based assistant.

User query

Risk + confidence

Low risk System-1

Llama-3-8B with retrieved QA context

High risk System-2

SafeChat policies and verified answers

47 UNAIDS FAQ pairs

43/47 non-harmful under both Granite Guardian conditions

T1-T4 risk and confidence thresholds drive routing

Adaptive replanning Stochastic rocket assembly

panomaly = 0.05

A discrete-time MDP models four robots, a conveyor, and four anomaly types. Learned policies recover through waiting, retries, and intervention actions without halting the line.

Start Stop 1 Stop 2 Inspect

Tray

r1 r2 r3 r4

Overheat Tray jam Grasp fail Missing part

DQN100.0%

AVI100.0%

ASNet100.0%

PPO-MLP96.7%

GNN-RGAT93.3%

GNN-WMPNN91.7%

All learned policies exceed 91% success; DQN, AVI, and ASNet pass 60/60 episodes, while the random baseline remains below 5%.