Research Trends — Agentic AI & VLA

Agentic Workflows × VLA × Robotics / UAV

On a single 4090/5090 you cannot out-scale Physical Intelligence (π0), NVIDIA (GR00T), or the OpenVLA team. The winning move is to take a frozen or LoRA-adapted 7B-class VLA and make the agentic wrapper the contribution — turning most of your compute into inference in simulation, and landing on the hot reliability question: does the policy succeed repeatedly, not just once?

Recommended: Topic A — "Self-Healing VLA". Cheapest to validate, cleanest publishable metric, clearly disjoint from existing theses. If the student specifically wants drones, Topic C (Aerial VLA) is the best UAV option.

At a glance

#	Topic	Focus	Venue
★ A	Self-Healing VLA	Agentic failure detection & recovery around a frozen policy	CoRL, ICRA, RA-L
B	Adaptive Embodied Reasoning	Gate slow chain-of-thought on uncertainty	ICRA, CoRL/ICLR ws
C	Aerial VLA	Language-conditioned closed-loop UAV control	ICRA/IROS, RA-L
D	VLA Robustness Benchmark	Distribution-shift degradation + abstention	NeurIPS/ICLR D&B
E	Hierarchical Verified VLA	LLM planner + VLA skills + postcondition checks	CoRL

Self-Healing VLA

★ Recommended

Agentic failure detection & recovery wrapped around a frozen policy.

Problem

VLA policies fail silently mid-task — they grasp the wrong object or stall, yet keep emitting actions as if nothing is wrong.

Gap

Modern 7B VLAs have no built-in sense of "I am failing"; recovery has been studied for classical policies, not as a model-agnostic layer over today's VLAs.

Contribution

Build a plug-in supervisor — a VLM progress-estimator + an LLM recovery planner — around a frozen OpenVLA/SmolVLA that detects stalls/drift and triggers retry, re-plan, or reset. No policy retraining.

Research question. By how much does a monitor-and-recover layer raise task success over the bare VLA, and how accurately does it detect failures (precision/recall)?

LIBERO, SimplerEnv CoRL, ICRA, RA-L Inference-only — fits a 4090

References

OpenVLA: An Open-Source Vision-Language-Action Model — Kim et al., CoRL 2024
REFLECT: Summarizing Robot Experiences for Failure Explanation & Correction — Liu et al., CoRL 2023
AI Agents That Matter — Kapoor et al., 2024

Adaptive Embodied Reasoning

Reason before acting — but only when the policy is uncertain.

Problem

Embodied chain-of-thought makes VLAs smarter but several times slower, because it writes long reasoning traces at every single step — too slow for real-time control.

Gap

Reasoning is applied uniformly; no one has made it conditional on whether a given step actually needs it.

Contribution

Train a cheap uncertainty gate that fires ECoT only on hard steps and acts directly otherwise; chart the accuracy-vs-latency Pareto and show most of the gain at a fraction of the tokens.

Research question. What is the Pareto frontier between reasoning cost, latency, and success, and can gated reasoning keep most of the accuracy at a fraction of the compute?

LIBERO + ECoT ICRA, CoRL/ICLR workshop Small-VLA fine-tune + inference

References

Robotic Control via Embodied Chain-of-Thought Reasoning (ECoT) — Zawalski et al., CoRL 2024
OpenVLA — Kim et al., CoRL 2024
τ-bench: Tool-Agent-User Interaction (pass^k consistency) — Yao et al., 2024

Aerial VLA

★ Best UAV pick

A VLA that flies — language-conditioned closed-loop UAV control.

Problem

Drones would benefit from language-driven control, but VLAs are built almost exclusively for tabletop arms.

Gap

No open VLA maps image + instruction to low-level flight actions; aerial work is waypoint navigation, not learned closed-loop control.

Contribution

LoRA-adapt a VLA to emit flight commands in Flightmare / Isaac Aerial; evaluate goal-reaching, obstacle avoidance, and scene generalization. (Keep it control, not mission planning, to stay disjoint from Mazen's thesis.)

Research question. Can a LoRA-adapted VLA produce reliable low-level flight actions from image + instruction, and how well does it generalize to unseen scenes?

Isaac Aerial / Flightmare / AirSim ICRA/IROS, RA-L Sim + LoRA fine-tune

References

AerialVLN: Vision-and-Language Navigation for UAVs — Liu et al., ICCV 2023
CityNav: Language-Goal Aerial Navigation Dataset — Lee et al., 2024
AirSim: High-Fidelity Visual and Physical Simulation — Shah et al., FSR 2017

VLA Robustness Benchmark

How brittle are VLAs — and can they learn to say "I'm not sure"?

Problem

Reported VLA success rates assume clean conditions; small changes in lighting, texture, or camera pose can crater them — unnoticed.

Gap

There is no standard distribution-shift stress test for VLAs, and no abstention mechanism to fall back on.

Contribution

Release a reproducible perturbation suite over SimplerEnv plus an out-of-distribution detector that lets the policy abstain; report degradation curves and the safety-vs-coverage tradeoff.

Research question. How much does VLA success drop under controlled perturbations, and can an OOD detector trade a little coverage for a large gain in safety?

SimplerEnv + perturbations NeurIPS / ICLR D&B Inference-heavy — cheap

References

Evaluating Real-World Robot Manipulation Policies in Simulation (SimplerEnv) — Li et al., CoRL 2024
OpenVLA — Kim et al., CoRL 2024
Open X-Embodiment: Robotic Learning Datasets and RT-X Models — O'Neill et al., ICRA 2024

Hierarchical Verified VLA

An LLM plans, VLA skills execute, and a verifier checks every step.

Problem

Flat VLAs collapse on long-horizon tasks because one early error silently cascades through the rest of the episode.

Gap

LLM task-planners exist, but step-level verification of each VLA skill's outcome is rarely closed-loop.

Contribution

Combine an LLM planner, a VLA skill library, and a postcondition verifier that gates each step (did the drawer actually open?); measure long-horizon success and cascade reduction on RoboCasa / LIBERO-Long.

Research question. Does verified step-wise decomposition beat a flat VLA on long-horizon tasks, and how much does the verifier reduce cascading failures?

RoboCasa, LIBERO-Long CoRL Frozen skills + LLM planner

References

Do As I Can, Not As I Say (SayCan) — Ahn et al., CoRL 2022
Code as Policies: Language Model Programs for Embodied Control — Liang et al., ICRA 2023
RoboCasa: Large-Scale Simulation of Everyday Tasks — Nasiriany et al., RSS 2024

Hardware reality check

Component	4090 (24 GB)	5090 (32 GB)	Notes
OpenVLA-7B inference	✅ (bf16 ~16 GB)	✅	Frozen executor
OpenVLA-7B LoRA fine-tune	⚠️ QLoRA/offload	✅ comfortable	5090 is the better buy
SmolVLA (~0.45B, 2025)	✅ trivially	✅	Built for consumer GPUs
GR00T-N1 / TinyVLA / Diffusion Policy	✅ LoRA	✅	All fit
Sim: LIBERO, ManiSkill3, SimplerEnv, Isaac Lab	✅ RTX-accel.	✅	Single-GPU
UAV sim: Isaac Aerial / Flightmare / AirSim+PX4	✅	✅	No real hardware needed

سير العمل الوكيلي × VLA × الروبوتات والطائرات المسيّرة

على بطاقة 4090/5090 واحدة لا يمكنك منافسة مختبرات مثل Physical Intelligence (π0) أو NVIDIA (GR00T) أو فريق OpenVLA في الحجم. الحركة الرابحة هي أخذ نموذج VLA بحجم 7B مُجمَّد أو مُكيَّف عبر LoRA وجعل الغلاف الوكيلي هو المساهمة — فيتحوّل معظم الحوسبة إلى استدلال داخل المحاكاة، ويتركّز العمل على سؤال الموثوقية الساخن: هل تنجح السياسة تكراراً لا مرةً واحدة فقط؟

الموصى به: الموضوع A — «VLA ذاتي التعافي». الأرخص في التحقق، وأنظف مقياس قابل للنشر، ومنفصل بوضوح عن الأطروحات القائمة. وإذا أراد الطالب الطائرات المسيّرة تحديداً، فالموضوع C (VLA الجوّي) هو الخيار الأفضل.

نظرة سريعة

#	الموضوع	التركيز	جهة النشر
★ A	VLA ذاتي التعافي	كشف الإخفاق والتعافي منه وكيلياً حول سياسة مُجمَّدة	CoRL، ICRA، RA-L
B	استدلال تجسيدي تكيّفي	تبويب التفكير البطيء على عدم اليقين	ICRA، ورشة CoRL/ICLR
C	VLA جوّي	تحكّم حلقي مغلق في الطائرة موجَّه باللغة	ICRA/IROS، RA-L
D	معيار متانة VLA	تدهور تحت انزياح التوزيع + الامتناع	NeurIPS/ICLR D&B
E	VLA هرمي مُتحقَّق منه	مخطِّط LLM + مهارات VLA + فحص ما بعد الشرط	CoRL

VLA ذاتي التعافي

★ موصى به

كشف الإخفاق والتعافي منه وكيلياً حول سياسة مُجمَّدة.

المشكلة

تُخفِق سياسات VLA بصمت في منتصف المهمة — تُمسك الجسم الخطأ أو تتجمّد، لكنها تواصل إصدار الأفعال كأن شيئاً لم يحدث.

الفجوة

لا تملك نماذج VLA الحديثة (7B) إحساساً ذاتياً بـ«أنا أُخفِق»؛ ودُرِس التعافي للسياسات الكلاسيكية لا كطبقة مستقلة عن النموذج فوق نماذج VLA الحالية.

مساهمة الطالب

بناء مشرف قابل للتركيب — مُقدِّر تقدّم بنموذج رؤية-لغة + مخطِّط تعافٍ بنموذج LLM — حول OpenVLA/SmolVLA مُجمَّد يكشف التعثّر/الانحراف ويُطلِق إعادة المحاولة أو إعادة التخطيط أو العودة الآمنة. دون إعادة تدريب السياسة.

سؤال البحث. بأي قدر ترفع طبقة المراقبة-والتعافي نسبةَ نجاح المهمة مقارنةً بالـVLA المجرّد، وبأي دقة واستدعاء تكشف الإخفاق؟

LIBERO، SimplerEnv CoRL، ICRA، RA-L استدلال فقط — يناسب 4090

مراجع ذات صلة

OpenVLA: An Open-Source Vision-Language-Action Model — Kim et al., CoRL 2024
REFLECT: Failure Explanation & Correction — Liu et al., CoRL 2023
AI Agents That Matter — Kapoor et al., 2024

استدلال تجسيدي تكيّفي

التفكير قبل الفعل — لكن عند عدم اليقين فقط.

المشكلة

يجعل الاستدلال التجسيدي المتسلسل نماذج VLA أذكى لكن أبطأ بأضعاف، لأنه يكتب آثار تفكير طويلة عند كل خطوة — أبطأ من أن يصلح للتحكّم اللحظي.

الفجوة

يُطبَّق التفكير بشكل موحّد؛ ولم يجعله أحد مشروطاً بما إذا كانت الخطوة تحتاجه فعلاً.

مساهمة الطالب

تدريب بوّابة عدم يقين رخيصة تُطلِق ECoT على الخطوات الصعبة فقط وتتصرّف مباشرةً فيما عداها؛ ورسم منحنى الدقة مقابل الكمون وإظهار معظم المكسب بجزء من الرموز.

سؤال البحث. ما منحنى المفاضلة بين تكلفة التفكير والكمون والنجاح، وهل يحافظ التفكير المُبوَّب على معظم الدقة بجزء من الحوسبة؟

LIBERO + ECoT ICRA، ورشة CoRL/ICLR ضبط VLA صغير + استدلال

مراجع ذات صلة

Robotic Control via Embodied Chain-of-Thought (ECoT) — Zawalski et al., CoRL 2024
OpenVLA — Kim et al., CoRL 2024
τ-bench (pass^k consistency) — Yao et al., 2024

VLA جوّي

★ أفضل خيار للطائرات

نموذج VLA يطير — تحكّم حلقي مغلق موجَّه باللغة.

المشكلة

ستستفيد الطائرات المسيّرة من التحكّم الموجَّه باللغة، لكن نماذج VLA مبنية تقريباً حصراً للأذرع المكتبية.

الفجوة

لا يوجد نموذج VLA مفتوح يُحوّل الصورة + التعليمات إلى أفعال طيران منخفضة المستوى؛ والعمل الجوي ملاحةٌ بنقاط مسار لا تحكّماً حلقياً مغلقاً مُتعلَّماً.

مساهمة الطالب

تكييف VLA عبر LoRA لإصدار أوامر طيران في Flightmare / Isaac Aerial؛ وتقييم الوصول للهدف وتجنّب العوائق والتعميم عبر المشاهد. (أبقِه تحكّماً لا تخطيطاً للمهمة ليبقى منفصلاً عن أطروحة مازن.)

سؤال البحث. هل يستطيع نموذج VLA مُكيَّف عبر LoRA إنتاج أفعال طيران موثوقة من الصورة + التعليمات، وما مدى تعميمه على مشاهد غير مرئية؟

Isaac Aerial / Flightmare / AirSim ICRA/IROS، RA-L محاكاة + ضبط LoRA

مراجع ذات صلة

AerialVLN: Vision-and-Language Navigation for UAVs — Liu et al., ICCV 2023
CityNav: Language-Goal Aerial Navigation — Lee et al., 2024
AirSim: High-Fidelity Simulation — Shah et al., FSR 2017

معيار متانة VLA

ما مدى هشاشة نماذج VLA — وهل تتعلّم قول «لستُ متأكداً»؟

المشكلة

تفترض نسبُ النجاح المُعلَنة ظروفاً نظيفة؛ وقد تنهار النماذج مع تغيّرات طفيفة في الإضاءة أو القوام أو وضعية الكاميرا — دون أن يُلاحَظ ذلك.

الفجوة

لا يوجد اختبار إجهاد قياسيّ لانزياح التوزيع لنماذج VLA، ولا آلية امتناع تلجأ إليها.

مساهمة الطالب

إصدار حزمة اضطرابات قابلة لإعادة الإنتاج فوق SimplerEnv مع كاشف خارج التوزيع يسمح بالامتناع؛ وعرض منحنيات التدهور ومفاضلة الأمان مقابل التغطية.

سؤال البحث. كم ينخفض نجاح الـVLA تحت اضطرابات مضبوطة، وهل يستطيع كاشف خارج التوزيع مقايضة قليل من التغطية بمكسب كبير في الأمان؟

SimplerEnv + اضطرابات NeurIPS / ICLR D&B كثيف الاستدلال — رخيص

مراجع ذات صلة

Evaluating Real-World Robot Manipulation Policies in Simulation (SimplerEnv) — Li et al., CoRL 2024
OpenVLA — Kim et al., CoRL 2024
Open X-Embodiment / RT-X — O'Neill et al., ICRA 2024

VLA هرمي مُتحقَّق منه

نموذج LLM يُخطّط، ومهارات VLA تُنفّذ، ومُتحقِّق يفحص كل خطوة.

المشكلة

تنهار نماذج VLA المسطّحة في المهام طويلة الأمد لأن خطأً مبكّراً واحداً يتراكم بصمت عبر بقية الحلقة.

الفجوة

توجد مخطِّطات مهام بنماذج LLM، لكن التحقّق من نتيجة كل مهارة VLA على مستوى الخطوة نادراً ما يكون حلقياً مغلقاً.

مساهمة الطالب

الجمع بين مخطِّط LLM ومكتبة مهارات VLA ومُتحقِّق من شرط ما بعد كل خطوة (هل فُتح الدرج فعلاً؟)؛ وقياس النجاح طويل الأمد وخفض التراكم على RoboCasa / LIBERO-Long.

سؤال البحث. هل يتفوق التفكيك المُتحقَّق منه خطوةً بخطوة على VLA المسطّح في المهام طويلة الأمد، وكم يقلّل المُتحقِّق من الإخفاقات المتتالية؟

RoboCasa، LIBERO-Long CoRL مهارات مُجمَّدة + مخطِّط LLM

مراجع ذات صلة

Do As I Can, Not As I Say (SayCan) — Ahn et al., CoRL 2022
Code as Policies — Liang et al., ICRA 2023
RoboCasa — Nasiriany et al., RSS 2024

التحقق من ملاءمة العتاد

المكوّن	4090 (24 GB)	5090 (32 GB)	ملاحظات
استدلال OpenVLA-7B	✅ (bf16 ~16 GB)	✅	منفّذ مُجمَّد
ضبط OpenVLA-7B عبر LoRA	⚠️ QLoRA/تفريغ	✅ مريح	الـ5090 خيار أفضل
SmolVLA (~0.45B، 2025)	✅ بسهولة	✅	مصمَّم للبطاقات الاستهلاكية
GR00T-N1 / TinyVLA / Diffusion Policy	✅ LoRA	✅	الكل مناسب
محاكاة: LIBERO, ManiSkill3, SimplerEnv, Isaac Lab	✅ مُسرَّعة على RTX	✅	بطاقة واحدة
محاكاة الطائرات: Isaac Aerial / Flightmare / AirSim+PX4	✅	✅	دون عتاد حقيقي

Agentic AI × VLA × Agriculture / Sustainability / Environment

A truly embodied VLA (a harvesting arm) is possible in simulation but bottlenecked by the scarcity of agricultural sim assets — high-risk for 12 months. The sweet spot is the agentic tool-calling framing: an agent that orchestrates crop models, geospatial APIs, sensors, and retrieval, where the contribution is reliability, evaluation, safety, or efficiency.

Recommended: E1 (Geospatial tool-calling LAM) as the primary; A1 (Agronomic decision agent) as the strongest agriculture option; S1 (Green agentic AI) as the most original sustainability angle.

Disjointness. Avoid anything based on drone/aerial crop imagery — that is Abdullah's lane. These directions orchestrate tools or use ground-level imagery instead.

At a glance

#	Topic	Focus	Venue
★ A1	Agronomic Decision Agent	Tool-calling agronomy + verification layer	CEA; AAAI AISI; NeurIPS CCAI
A2	Crop-Disease Agent	Ground-leaf VLM + treatment retrieval + abstention	CVPR ag ws; CEA
A3	Embodied Ag-VLA	Sim harvesting / weeding manipulation	ICRA/IROS
★ E1	Geospatial Tool-Calling LAM	Earth-Engine function calling vs GPT-4o	IGARSS; NeurIPS CCAI/D&B
E2	Env-Agent Reliability Benchmark	Cost / holdout / repeated-trial controls	NeurIPS D&B
★ S1	Green Agentic AI	Energy cut via uncertainty-gated routing	SUSCOM; IEEE T-SUSC

Agriculture

Agronomic Decision Agent

★ Recommended (agri)

A tool-calling agent that turns crop models and field data into farm advice.

Problem

Farmers get generic LLM advice that can be confidently wrong about irrigation or fertilization — costly, and sometimes unsafe for the crop.

Gap

LLMs are not grounded in validated crop models, and agentic agronomy advice has no reliability or safety evaluation.

Contribution

A tool-calling agent that invokes DSSAT/APSIM + weather/soil APIs with a constraint verifier; benchmark recommendation reliability against expert ground truth and measure the reduction in harmful advice.

Research question. Can a fine-tuned 7B agent match expert/ground-truth recommendations, and does step-verification measurably cut harmful advice?

DSSAT/APSIM, AgML, weather/soil APIs Computers & Electronics in Agriculture; AAAI AISI; NeurIPS CCAI Tool-calling LoRA — cheap

References

xLAM: A Family of Large Action Models — Zhang et al., NAACL 2025
ReAct: Synergizing Reasoning and Acting in LLMs — Yao et al., ICLR 2023
The DSSAT Cropping System Model — Jones et al., Eur. J. Agronomy 2003

Retrieval-Grounded Crop-Disease Agent

Diagnose a leaf from a photo, retrieve the right treatment, abstain when unsure.

Problem

Crop-disease classifiers output a label but no actionable, safe treatment — and they hallucinate confidently on diseases they have not seen.

Gap

Few systems pair diagnosis with grounded treatment retrieval and calibrated abstention on out-of-distribution cases.

Contribution

A ground-image VLM agent that diagnoses, retrieves a treatment protocol, and abstains under low confidence; measure the safe-abstention rate and the reduction in hallucinated treatments.

Research question. What fraction of cases can the agent safely abstain on, and does retrieval grounding reduce hallucinated treatments?

PlantVillage, PlantDoc, PlantWild CVPR ag workshop; CEA VLM LoRA — fits a 4090

References

An Open Access Repository of Images on Plant Health (PlantVillage) — Hughes & Salathé, 2015
PlantDoc: A Dataset for Visual Plant Disease Detection — Singh et al., 2020
Visual Instruction Tuning (LLaVA) — Liu et al., NeurIPS 2023

Embodied Ag-VLA

A VLA arm for harvesting and weeding in simulation.

Problem

Agricultural manipulation (picking, weeding) is highly variable; hand-coded policies do not transfer across crop layouts.

Gap

VLAs are untested on agricultural manipulation, and dedicated sim assets barely exist.

Contribution

Build a small ag-manipulation sim on Isaac Lab and LoRA-adapt a VLA; measure cross-layout generalization. (Higher risk: asset-building is the main cost.)

Research question. Can a LoRA-adapted VLA generalize manipulation across varied crop layouts in simulation?

Isaac Lab + custom ag assets ICRA/IROS Sim-asset effort is the risk

References

OpenVLA — Kim et al., CoRL 2024
Orbit / Isaac Lab: A Unified Simulation Framework — Mittal et al., IEEE RA-L 2023
RoboCasa (large-scale manipulation sim) — Nasiriany et al., RSS 2024

Environment / Geospatial

Geospatial Tool-Calling LAM

★ Recommended (primary)

Teach a 7B model to drive Earth-observation tools better than GPT-4o.

Problem

Environmental analysts must hand-write Earth Engine code, and general LLMs generate plausible-but-broken geospatial pipelines.

Gap

No open 7B model is fine-tuned for executable Earth-observation function-calling, and frontier models are not grounded in the API.

Contribution

Synthesize an EO function-calling dataset (APIGen-style), LoRA-tune xLAM-7B, and benchmark executable-task success vs GPT-4o on deforestation / flood / land-use queries.

Research question. Does domain function-calling fine-tuning beat frontier models on executable geospatial analysis tasks?

Google Earth Engine, Sentinel/Landsat, GEO-Bench IGARSS; NeurIPS CCAI; NeurIPS D&B Compute runs cloud-side — very cheap

References

APIGen: Automated Pipeline for Function-Calling Datasets — Liu et al., NeurIPS 2024
GEO-Bench: Toward Foundation Models for Earth Monitoring — Lacoste et al., NeurIPS 2023
Google Earth Engine: Planetary-scale Geospatial Analysis — Gorelick et al., RSE 2017

Environmental-Agent Reliability Benchmark

Which environmental-agent results survive honest, cost-controlled evaluation?

Problem

Environmental-agent papers report headline gains that may not hold once evaluation is fair.

Gap

No environmental-agent benchmark enforces proper holdouts, repeated-trial consistency, and cost control (the issues raised by "AI Agents That Matter").

Contribution

Build that benchmark and re-evaluate existing agents under joint accuracy–cost–holdout controls; report which reported gains actually survive.

Research question. Which published environmental-agent gains survive joint accuracy–cost–holdout controls?

GEO-Bench, BigEarthNet, SpaceNet NeurIPS Datasets & Benchmarks Evaluation study — cheap

References

AI Agents That Matter — Kapoor et al., 2024
AgentBench: Evaluating LLMs as Agents — Liu et al., ICLR 2024
BigEarthNet: Remote Sensing Benchmark Archive — Sumbul et al., IGARSS 2019

Sustainability / Green AI

Green Agentic AI

★ Most original

Cut the energy and carbon of agentic pipelines without losing accuracy.

Problem

Agentic pipelines call large models repeatedly, with a large — and usually unmeasured — energy and carbon cost.

Gap

Energy is rarely measured per agentic workflow, and model routing has been studied for accuracy/cost, not for energy.

Contribution

Instrument an agentic pipeline with CodeCarbon and add uncertainty-gated small→large routing; quantify the energy saved at a fixed task-success target. A single GPU is the cleanest scale for attributing energy.

Research question. How much energy can be saved at fixed task success by uncertainty-gated model routing?

τ-bench / BFCL + CodeCarbon SUSCOM; IEEE T-SUSC; NeurIPS efficiency workshop Single GPU is the right scale

References

FrugalGPT: Using LLMs While Reducing Cost — Chen et al., 2023
RouteLLM: Learning to Route LLMs with Preference Data — Ong et al., 2024
Power Hungry Processing: Watts Driving the Cost of AI Deployment? — Luccioni et al., FAccT 2024

Hardware reality check

Component	4090	5090	Notes
Tool-calling fine-tune (xLAM-7B / Qwen2.5-7B, LoRA)	✅ QLoRA	✅	Function-calling SFT fits easily
Crop-disease VLM (Qwen2-VL-7B / PaliGemma, LoRA)	✅	✅	Ground-level leaf images, not drone
Geospatial agent (Earth Engine / Sentinel APIs)	✅	✅	Heavy compute runs cloud-side
Green-AI energy measurement (CodeCarbon)	✅	✅	Single GPU is the correct scale
Embodied ag-VLA in Isaac Lab	⚠️	⚠️	Custom ag assets are the real cost

الذكاء الاصطناعي الوكيلي × VLA × الزراعة والاستدامة والبيئة

إنّ نموذج VLA التجسيدي الحقيقي (ذراع للحصاد) ممكن في المحاكاة لكنه مقيَّد بندرة أصول المحاكاة الزراعية — خيار عالي المخاطرة خلال 12 شهراً. النقطة المثلى هي إطار استدعاء الأدوات الوكيليّ: وكيلٌ يُنسّق نماذج المحاصيل والواجهات الجغرافية المكانية وأجهزة الاستشعار والاسترجاع، حيث تكون المساهمة في الموثوقية أو التقييم أو الأمان أو الكفاءة.

الموصى به: E1 (LAM الجغرافي المكاني) كخيار أساسي؛ وA1 (وكيل القرار الزراعي) أقوى خيار زراعي؛ وS1 (الذكاء الوكيلي الأخضر) أكثر زوايا الاستدامة أصالةً.

الانفصال. تجنّب أي شيء يعتمد على صور المحاصيل الجوية بالطائرات المسيّرة — فذلك مجال عبدالله. هذه التوجهات تُنسّق الأدوات أو تستخدم صوراً أرضية بدلاً من ذلك.

نظرة سريعة

#	الموضوع	التركيز	جهة النشر
★ A1	وكيل القرار الزراعي	استدعاء أدوات زراعية + طبقة تحقّق	CEA؛ AAAI AISI؛ NeurIPS CCAI
A2	وكيل أمراض المحاصيل	VLM لأوراق أرضية + استرجاع العلاج + امتناع	ورشة CVPR؛ CEA
A3	VLA زراعي تجسيدي	مناورة حصاد/إزالة أعشاب في المحاكاة	ICRA/IROS
★ E1	LAM جغرافي مكاني	استدعاء دوال Earth-Engine مقابل GPT-4o	IGARSS؛ NeurIPS CCAI/D&B
E2	معيار موثوقية الوكلاء البيئيين	ضوابط التكلفة/الاختبار المحجوز/التكرار	NeurIPS D&B
★ S1	ذكاء وكيلي أخضر	خفض الطاقة عبر توجيه مُبوَّب بعدم اليقين	SUSCOM؛ IEEE T-SUSC

الزراعة

وكيل القرار الزراعي

★ موصى به (زراعة)

وكيل استدعاء أدوات يُحوّل نماذج المحاصيل وبيانات الحقل إلى نصائح زراعية.

المشكلة

يحصل المزارعون على نصائح عامة من نماذج اللغة قد تكون خاطئة بثقة حول الريّ أو التسميد — مكلفة وأحياناً ضارة بالمحصول.

الفجوة

نماذج اللغة غير مُسنَدة إلى نماذج محاصيل مُتحقَّق منها، والنصيحة الزراعية الوكيلية بلا أي تقييم للموثوقية أو الأمان.

مساهمة الطالب

وكيل استدعاء أدوات يُشغّل DSSAT/APSIM + واجهات الطقس/التربة مع مُتحقِّق قيود؛ ومقارنة موثوقية التوصيات بالحقيقة المرجعية للخبراء وقياس خفض النصائح الضارة.

سؤال البحث. هل يمكن لوكيل 7B مضبوط مطابقةُ توصيات الخبير، وهل يقلّل التحقّق خطوةً بخطوة النصائح الضارة بشكل قابل للقياس؟

DSSAT/APSIM، AgML، واجهات الطقس/التربة Computers & Electronics in Agriculture؛ AAAI AISI؛ NeurIPS CCAI ضبط استدعاء أدوات — رخيص

مراجع ذات صلة

xLAM: A Family of Large Action Models — Zhang et al., NAACL 2025
ReAct: Synergizing Reasoning and Acting in LLMs — Yao et al., ICLR 2023
The DSSAT Cropping System Model — Jones et al., 2003

وكيل تشخيص أمراض المحاصيل المستند إلى الاسترجاع

تشخيص الورقة من صورة، واسترجاع العلاج الصحيح، والامتناع عند عدم اليقين.

المشكلة

تُخرِج مصنّفات أمراض المحاصيل تسميةً دون علاج عملي وآمن — وتُهلوِس بثقة على أمراض لم ترها.

الفجوة

قليلة هي الأنظمة التي تقرن التشخيص باسترجاع علاج مُسنَد وامتناع معاير على الحالات خارج التوزيع.

مساهمة الطالب

وكيل VLM لصور أرضية يُشخّص ويسترجع بروتوكول علاج ويمتنع عند انخفاض الثقة؛ وقياس معدّل الامتناع الآمن وخفض العلاجات المُهلوسة.

سؤال البحث. ما نسبة الحالات التي يمكن للوكيل الامتناع عنها بأمان، وهل يقلّل الإسناد العلاجاتِ المُهلوسة؟

PlantVillage، PlantDoc، PlantWild ورشة CVPR الزراعية؛ CEA ضبط VLM — يناسب 4090

مراجع ذات صلة

PlantVillage: Repository of Images on Plant Health — Hughes & Salathé, 2015
PlantDoc: Dataset for Visual Plant Disease Detection — Singh et al., 2020
Visual Instruction Tuning (LLaVA) — Liu et al., NeurIPS 2023

VLA زراعي تجسيدي

ذراع VLA للحصاد وإزالة الأعشاب في المحاكاة.

المشكلة

المناورة الزراعية (القطف، إزالة الأعشاب) شديدة التغيّر؛ والسياسات المكتوبة يدوياً لا تنتقل عبر تخطيطات المحاصيل.

الفجوة

لم تُختبَر نماذج VLA على المناورة الزراعية، وأصول المحاكاة المخصّصة شبه معدومة.

مساهمة الطالب

بناء محاكاة مناورة زراعية صغيرة على Isaac Lab وتكييف VLA عبر LoRA؛ وقياس التعميم عبر التخطيطات. (أعلى مخاطرة: بناء الأصول هو التكلفة الرئيسة.)

سؤال البحث. هل يمكن لنموذج VLA مُكيَّف عبر LoRA تعميمُ المناورة عبر تخطيطات محاصيل متنوعة في المحاكاة؟

Isaac Lab + أصول زراعية مخصّصة ICRA/IROS جهد بناء الأصول هو المخاطرة

مراجع ذات صلة

OpenVLA — Kim et al., CoRL 2024
Orbit / Isaac Lab — Mittal et al., IEEE RA-L 2023
RoboCasa — Nasiriany et al., RSS 2024

البيئة / الجغرافيا المكانية

LAM لاستدعاء الأدوات الجغرافية المكانية

★ موصى به (أساسي)

تعليم نموذج 7B تشغيلَ أدوات رصد الأرض أفضل من GPT-4o.

المشكلة

يضطر المحلّلون البيئيون لكتابة شيفرة Earth Engine يدوياً، ونماذج اللغة العامة تُولّد مسارات جغرافية معقولة لكنها معطوبة.

الفجوة

لا يوجد نموذج 7B مفتوح مضبوط على استدعاء دوال رصد الأرض القابلة للتنفيذ، والنماذج الرائدة غير مُسنَدة إلى الواجهة البرمجية.

مساهمة الطالب

اصطناع مجموعة بيانات استدعاء دوال رصد الأرض (بأسلوب APIGen)، وضبط xLAM-7B عبر LoRA، ومقارنة نجاح المهام القابلة للتنفيذ مقابل GPT-4o على استعلامات إزالة الغابات/الفيضانات/استخدام الأراضي.

سؤال البحث. هل يتفوق الضبط على استدعاء دوال المجال على النماذج الرائدة في مهام التحليل الجغرافي القابلة للتنفيذ؟

Google Earth Engine، Sentinel/Landsat، GEO-Bench IGARSS؛ NeurIPS CCAI؛ NeurIPS D&B الحوسبة على السحابة — رخيص جداً

مراجع ذات صلة

APIGen: Automated Pipeline for Function-Calling Datasets — Liu et al., NeurIPS 2024
GEO-Bench: Foundation Models for Earth Monitoring — Lacoste et al., NeurIPS 2023
Google Earth Engine — Gorelick et al., RSE 2017

معيار موثوقية الوكلاء البيئيين

أيّ نتائج للوكلاء البيئيين تصمد أمام تقييم نزيه ومضبوط التكلفة؟

المشكلة

تُبلِّغ أوراق الوكلاء البيئيين عن مكاسب لافتة قد لا تصمد عند جعل التقييم نزيهاً.

الفجوة

لا معيار للوكلاء البيئيين يفرض مجموعات اختبار محجوزة سليمة واتساق التجارب المتكررة وضبط التكلفة (وهي مشكلات «AI Agents That Matter»).

مساهمة الطالب

بناء هذا المعيار وإعادة تقييم الوكلاء القائمين تحت ضوابط الدقة–التكلفة–الاختبار المحجوز مجتمعة؛ وبيان أيّ مكاسب تصمد فعلاً.

سؤال البحث. أيّ مكاسب منشورة للوكلاء البيئيين تصمد أمام ضوابط الدقة–التكلفة–مجموعة الاختبار مجتمعة؟

GEO-Bench، BigEarthNet، SpaceNet NeurIPS Datasets & Benchmarks دراسة تقييم — رخيصة

مراجع ذات صلة

AI Agents That Matter — Kapoor et al., 2024
AgentBench: Evaluating LLMs as Agents — Liu et al., ICLR 2024
BigEarthNet — Sumbul et al., IGARSS 2019

الاستدامة / الذكاء الأخضر

ذكاء وكيلي أخضر

★ الأكثر أصالة

خفض طاقة وكربون مسارات العمل الوكيلية دون خسارة الدقة.

المشكلة

تستدعي المساراتُ الوكيلية النماذجَ الكبيرة مراراً، بتكلفة طاقة وكربون كبيرة وغير مقيسة عادةً.

الفجوة

نادراً ما تُقاس الطاقة لكل مسار عمل وكيليّ، ودُرِس توجيه النماذج للدقة/التكلفة لا للطاقة.

مساهمة الطالب

تجهيز مسار وكيليّ بـCodeCarbon وإضافة توجيه مُبوَّب بعدم اليقين (صغير ← كبير)؛ وتحديد الطاقة الموفّرة عند هدفٍ ثابت لنجاح المهمة. والبطاقة الواحدة هي المقياس الأنظف لإسناد الطاقة.

سؤال البحث. كم يمكن توفير الطاقة عند ثبات نجاح المهمة عبر التوجيه المُبوَّب بعدم اليقين؟

τ-bench / BFCL + CodeCarbon SUSCOM؛ IEEE T-SUSC؛ ورشة كفاءة NeurIPS البطاقة الواحدة هي المقياس الصحيح

مراجع ذات صلة

FrugalGPT — Chen et al., 2023
RouteLLM: Learning to Route LLMs — Ong et al., 2024
Power Hungry Processing: Watts Driving the Cost of AI Deployment? — Luccioni et al., FAccT 2024

التحقق من ملاءمة العتاد

المكوّن	4090	5090	ملاحظات
ضبط استدعاء الأدوات (xLAM-7B / Qwen2.5-7B، LoRA)	✅ QLoRA	✅	ضبط استدعاء الدوال يناسبها بسهولة
VLM لأمراض المحاصيل (Qwen2-VL-7B / PaliGemma)	✅	✅	صور أوراق أرضية، لا جوية
وكيل جغرافي مكاني (Earth Engine / Sentinel)	✅	✅	الحوسبة الثقيلة على السحابة
قياس طاقة الذكاء الأخضر (CodeCarbon)	✅	✅	البطاقة الواحدة هي المقياس الصحيح
VLA زراعي تجسيدي في Isaac Lab	⚠️	⚠️	أصول المحاكاة الزراعية هي التكلفة الحقيقية

Medical VLA × Agentic AI × Healthcare

"Medical VLA" has two GPU-feasible readings: (1) embodied = surgical/interventional robotics, controlling a surgical robot in simulation; and (2) agentic clinical AI = tool-calling, GUI, and decision agents. In healthcare, safety is the headline contribution, and the field uniquely rewards on-device / private models, since patient data never leaves the machine.

Recommended. Robotics-inclined → M1 (Surgical VLA + safety monitor); NLP/systems-inclined → C1 (EHR GUI agent); most original → S1 (medical-agent safety red-team + defense).

Disjointness. Keep the contribution on actions, control, GUI automation, and action-safety — never on RAG answer-quality or retrieval-faithfulness evaluation, which is Safa's lane (AegisRAG).

At a glance

#	Topic	Focus	Venue
★ M1	Surgical VLA + Safety Monitor	Subtask control + enforced safety constraints	ICRA/IROS, RA-L, T-MRB
M2	Assistive / Rehab VLA	Assistive manipulation generalization	ICRA/IROS, RA-L
★ C1	EHR GUI Agent	Zero-harmful-action EHR automation	ML4H, CHIL, IEEE JBHI
C2	Clinical Decision-Support Agent	Tool-use + step-verification + abstention	CMPB, AIME, IEEE JBHI
C3	Multi-Agent Clinical Reliability	pass^k consistency + cost of debate	ML4H, NeurIPS D&B
★ S1	Medical-Agent Safety Red-Team	Harmful-action benchmark + defense	NeurIPS D&B, ML4H
S2	On-Device Private Medical Agent	Distill 7B→≤3B; privacy + efficiency	IEEE JBHI, SUSCOM, EMNLP

Medical VLA / Surgical Robotics

Surgical VLA + Agentic Safety Monitor

★ Recommended (robotics)

A VLA that performs surgical subtasks, with a supervisor that never lets it cross a line.

Problem

A surgical VLA that occasionally moves outside safe bounds is unusable — a single unsafe action can be catastrophic.

Gap

Surgical VLAs optimize task success, not enforced safety, and agentic safety supervision is unexplored in surgical simulation.

Contribution

Wrap a surgical VLA with a constraint-enforcing safety monitor + recovery; measure constraint-violation reduction versus the cost to subtask success in Orbit-Surgical / SurRoL.

Research question. How much does the agentic safety layer reduce constraint violations, and at what cost to subtask success versus the base VLA?

Orbit-Surgical, SurRoL, JIGSAWS ICRA, IROS, RA-L, IEEE T-MRB Isaac surgical sim — fits RTX

References

ORBIT-Surgical: Learning Surgical Augmented Dexterity — Yu et al., ICRA 2024
SurRoL: dVRK-Compatible Surgical RL Platform — Xu et al., IROS 2021
JIGSAWS: Surgical Activity Dataset — Gao et al., MICCAI-W 2014

Assistive / Rehab VLA

Language-conditioned assistive manipulation for patient-care robots.

Problem

Assistive robots must adapt to each patient's body and setup; scripted controllers do not generalize.

Gap

VLAs are untested on assistive caregiving tasks across patient variation.

Contribution

LoRA-adapt a VLA on assistive tasks (feeding, repositioning, fetching) in Assistive Gym; measure cross-patient generalization.

Research question. Can a LoRA-adapted VLA generalize assistive tasks across patient configurations?

Assistive Gym ICRA/IROS, RA-L Sim + LoRA fine-tune

References

Assistive Gym: A Physics Simulation Framework for Assistive Robotics — Erickson et al., ICRA 2020
OpenVLA — Kim et al., CoRL 2024

Clinical Agentic AI

EHR GUI Agent (Zero-Harmful-Action)

★ Recommended (NLP)

An agent that operates real EHR software — provably blocked from harmful actions.

Problem

Clinicians lose hours to EHR clicking; an LLM agent that automates it could also issue harmful orders.

Gap

No EHR GUI agent is trained under a provable zero-harmful-action constraint, and there is no clinical-GUI benchmark.

Contribution

Fine-tune OS-Atlas-7B on OpenEMR/OpenMRS with Synthea (synthetic, no PHI) under a constrained action space; build a clinical-GUI benchmark; beat GPT-4o frameworks while satisfying the safety constraint.

Research question. Can a domain-adapted 7B model beat GPT-4o-based frameworks on EHR task success while satisfying a zero-harmful-action constraint?

OpenEMR/OpenMRS + Synthea (no PHI) ML4H, CHIL, IEEE JBHI GUI LoRA — fits a 4090; no IRB

References

OS-Atlas: A Foundation Action Model for GUI Agents — Wu et al., ICLR 2025
EHRAgent: Code Empowers LLM Agents for EHR Reasoning — Shi et al., EMNLP 2024
Synthea: Synthetic Patient Generator — Walonoski et al., JAMIA 2018

Clinical Decision-Support Agent

A tool-using clinical advisor that verifies each step and abstains when unsafe.

Problem

Clinical LLMs issue unsafe recommendations without showing their work or knowing when to defer.

Gap

Tool-use clinical agents rarely include step-verification and abstention as an explicit safety mechanism.

Contribution

An agent that calls calculators/guidelines/drug databases, verifies each step, and abstains; measure the reduction in unsafe recommendations at a fixed task-success level.

Research question. Does step-verification reduce unsafe recommendations at a fixed task-success level?

MedCalc-Bench, drug DBs, guidelines CMPB, AIME, IEEE JBHI Tool-calling — cheap

References

MedCalc-Bench: Medical Calculation in LLMs — Khandekar et al., NeurIPS 2024
Almanac: Retrieval-Augmented LLMs for Clinical Medicine — Zakka et al., NEJM AI 2024
ReAct — Yao et al., 2023

Multi-Agent Clinical Reliability

Is multi-agent medical debate actually more reliable — and worth the cost?

Problem

Multi-agent medical "debate" is reported to boost accuracy, but its reliability and cost are unclear.

Gap

The pass^k consistency and compute cost of medical multi-agent systems are unmeasured.

Contribution

Measure pass^1 vs pass^k and token cost of MedAgents/MDAgents on AgentClinic / MedQA; show whether the debate machinery is worth its expense.

Research question. How much does pass^1 overstate deployable reliability, and is multi-agent debate worth its added cost?

AgentClinic, MedAgentBench, MedQA ML4H, NeurIPS D&B Inference-heavy — cheap

References

MedAgents: LLMs as Collaborators for Zero-Shot Medical Reasoning — Tang et al., ACL Findings 2024
MDAgents: Adaptive Collaboration of LLMs for Medical Decision-Making — Kim et al., NeurIPS 2024
AgentClinic: A Multimodal Agent Benchmark in Clinical Environments — Schmidgall et al., 2024

Cross-Cutting Safety & Efficiency

Medical-Agent Safety Red-Team + Defense

★ Most original

Stress-test clinical agents for harmful actions, then build a guardrail that halves them.

Problem

Open medical agents can be jailbroken into harmful actions and unsafe recommendations.

Gap

There is no action-level safety red-team benchmark for clinical agents paired with a defense.

Contribution

Build a harmful-action benchmark, measure how often open agents comply, design a guardrail/refusal-grounding defense, and show it halves unsafe actions without utility loss.

Research question. What fraction of unsafe clinical actions do open medical agents execute, and can a defense halve it without utility loss?

MedSafetyBench + AgentClinic NeurIPS D&B, ML4H, IEEE S&P workshop Benchmark + defense — cheap

References

MedSafetyBench: Evaluating & Improving Medical Safety of LLMs — Han et al., NeurIPS 2024 D&B
AgentClinic — Schmidgall et al., 2024
MedQA (USMLE) — What Disease Does This Patient Have? — Jin et al., 2021

On-Device Private Medical Agent

Shrink a medical model to the edge so patient data never leaves the device.

Problem

Cloud medical LLMs send protected health information off-device — a privacy and compliance barrier to deployment.

Gap

The utility-vs-size tradeoff for distilled on-device medical agents is poorly characterized.

Contribution

Distill a 7B medical model to ≤2–3B; quantify the utility retained, the latency, and the privacy gain of fully local inference.

Research question. What utility is retained at 3× compression, and what privacy and latency benefits result?

MedQA / VQA-RAD + CodeCarbon IEEE JBHI, SUSCOM, EMNLP Findings Distillation — fits a 4090

References

LLaVA-Med: A Vision-Language Assistant for Biomedicine — Li et al., NeurIPS 2023
BioMistral: Open-Source Medical LLMs — Labrak et al., ACL Findings 2024
Distilling the Knowledge in a Neural Network — Hinton et al., 2015

Hardware reality check

Component	4090	5090	Notes
Surgical sim (Orbit-Surgical / SurRoL) + VLA LoRA	✅	✅	Isaac RTX-accelerated
EHR GUI agent — fine-tune OS-Atlas-7B (LoRA)	✅ QLoRA	✅	OpenEMR/OpenMRS + Synthea
Clinical tool-calling agent (xLAM-7B / Qwen2.5-7B)	✅	✅	Guidelines, drug-interaction, calculators
Medical VLM (LLaVA-Med / MedGemma / Qwen2-VL-7B)	✅	✅	Report gen, VQA
Distill medical 7B → ≤2–3B for edge	✅	✅	Privacy + efficiency

VLA الطبي × الذكاء الاصطناعي الوكيلي × الرعاية الصحية

لـ«VLA الطبي» قراءتان مجديتان على بطاقة واحدة: (1) تجسيدي = روبوتات جراحية/تداخلية تتحكم بروبوت جراحي في المحاكاة؛ و(2) ذكاء سريري وكيلي = وكلاء استدعاء أدوات وواجهات رسومية وقرار. في الرعاية الصحية، الأمان هو المساهمة الحاسمة، والمجال يكافئ تحديداً النماذج على الجهاز / الخاصة، إذ لا تغادر بيانات المريض الجهاز.

الموصى به. الميّال للروبوتات ← M1 (VLA جراحي + مراقب أمان)؛ الميّال لمعالجة اللغة/الأنظمة ← C1 (وكيل واجهة EHR)؛ الأكثر أصالة ← S1 (اختبار أمان الوكلاء الطبيين + دفاع).

الانفصال. أبقِ المساهمة على الأفعال والتحكّم وأتمتة الواجهات وأمان الأفعال — وليس على تقييم جودة إجابات RAG أو وفاء الاسترجاع، فذلك مجال صفا (AegisRAG).

نظرة سريعة

#	الموضوع	التركيز	جهة النشر
★ M1	VLA جراحي + مراقب أمان	تحكّم بمهام فرعية + فرض قيود أمان	ICRA/IROS، RA-L، T-MRB
M2	VLA مساعِد / تأهيلي	تعميم المناورة المساعِدة	ICRA/IROS، RA-L
★ C1	وكيل واجهة EHR	أتمتة EHR بصفر أفعال ضارة	ML4H، CHIL، IEEE JBHI
C2	وكيل دعم القرار السريري	استخدام أدوات + تحقّق + امتناع	CMPB، AIME، IEEE JBHI
C3	موثوقية سريرية متعددة الوكلاء	اتساق pass^k + تكلفة النقاش	ML4H، NeurIPS D&B
★ S1	اختبار أمان الوكلاء الطبيين	معيار أفعال ضارة + دفاع	NeurIPS D&B، ML4H
S2	وكيل طبي خاص على الجهاز	تقطير 7B←≤3B؛ خصوصية + كفاءة	IEEE JBHI، SUSCOM، EMNLP

VLA الطبي / الروبوتات الجراحية

VLA جراحي + مراقب أمان وكيلي

★ موصى به (روبوتات)

نموذج VLA يُنفّذ مهام جراحية فرعية، ومشرفٌ لا يدعه يتجاوز الحدّ أبداً.

المشكلة

نموذج VLA جراحي يتحرّك أحياناً خارج الحدود الآمنة غيرُ صالح للاستخدام — فالفعل غير الآمن الواحد قد يكون كارثياً.

الفجوة

تُحسِّن نماذج VLA الجراحية نجاح المهمة لا الأمان المفروض، والإشراف الأمنيّ الوكيليّ غير مستكشَف في المحاكاة الجراحية.

مساهمة الطالب

تغليف VLA جراحي بمراقب أمان يفرض القيود + تعافٍ؛ وقياس خفض انتهاكات القيود مقابل الكلفة على نجاح المهمة في Orbit-Surgical / SurRoL.

سؤال البحث. كم تخفض طبقةُ الأمان الوكيلية انتهاكاتِ القيود، وبأي كلفة على نجاح المهمة الفرعية مقارنةً بالـVLA الأساسي؟

Orbit-Surgical، SurRoL، JIGSAWS ICRA، IROS، RA-L، IEEE T-MRB محاكاة جراحية على Isaac — تناسب RTX

مراجع ذات صلة

ORBIT-Surgical: Learning Surgical Augmented Dexterity — Yu et al., ICRA 2024
SurRoL: dVRK-Compatible Surgical RL Platform — Xu et al., IROS 2021
JIGSAWS: Surgical Activity Dataset — Gao et al., 2014

VLA مساعِد / تأهيلي

مناورة مساعِدة موجَّهة باللغة لروبوتات رعاية المرضى.

المشكلة

يجب أن تتكيّف الروبوتات المساعِدة مع جسم كل مريض وإعداده؛ والمتحكّمات المكتوبة مسبقاً لا تُعمَّم.

الفجوة

لم تُختبَر نماذج VLA على مهام الرعاية المساعِدة عبر تباين المرضى.

مساهمة الطالب

تكييف VLA عبر LoRA على مهام مساعِدة (إطعام، إعادة تموضع، إحضار) في Assistive Gym؛ وقياس التعميم عبر المرضى.

سؤال البحث. هل يمكن لنموذج VLA مُكيَّف عبر LoRA تعميمُ المهام المساعِدة عبر إعدادات المرضى؟

Assistive Gym ICRA/IROS، RA-L محاكاة + ضبط LoRA

مراجع ذات صلة

Assistive Gym: Physics Simulation for Assistive Robotics — Erickson et al., ICRA 2020
OpenVLA — Kim et al., CoRL 2024

الذكاء السريري الوكيلي

وكيل واجهة EHR (صفر أفعال ضارة)

★ موصى به (لغة)

وكيل يُشغّل برمجيات EHR حقيقية — ممنوعٌ بإثباتٍ من الأفعال الضارة.

المشكلة

يخسر الأطباء ساعات في النقر داخل أنظمة السجلات الصحية؛ ووكيل LLM يؤتمِت ذلك قد يُصدِر أيضاً أوامر ضارة.

الفجوة

لا يوجد وكيل واجهة EHR مُدرَّب تحت قيد قابل للإثبات بصفر أفعال ضارة، ولا يوجد معيار لواجهات السجلات السريرية.

مساهمة الطالب

ضبط OS-Atlas-7B على OpenEMR/OpenMRS ببيانات Synthea (اصطناعية، دون بيانات صحية) تحت مساحة أفعال مقيَّدة؛ وبناء معيار واجهة سريرية؛ والتفوّق على أطر GPT-4o مع استيفاء قيد الأمان.

سؤال البحث. هل يمكن لنموذج 7B مُكيَّف التفوّقُ على أطر GPT-4o في نجاح مهام EHR مع استيفاء قيد صفر أفعال ضارة؟

OpenEMR/OpenMRS + Synthea (دون بيانات صحية) ML4H، CHIL، IEEE JBHI ضبط واجهات LoRA — يناسب 4090؛ دون موافقة أخلاقية

مراجع ذات صلة

OS-Atlas: A Foundation Action Model for GUI Agents — Wu et al., ICLR 2025
EHRAgent: Code Empowers LLM Agents for EHR Reasoning — Shi et al., EMNLP 2024
Synthea: Synthetic Patient Generator — Walonoski et al., JAMIA 2018

وكيل دعم القرار السريري

مستشار سريري يستخدم الأدوات، يتحقّق من كل خطوة ويمتنع عند انعدام الأمان.

المشكلة

تُصدِر نماذج اللغة السريرية توصيات غير آمنة دون إظهار استدلالها ودون معرفة متى تُحيل الأمر.

الفجوة

نادراً ما تتضمّن الوكلاء السريرية القائمة على الأدوات تحقّقاً خطوةً بخطوة وامتناعاً كآلية أمان صريحة.

مساهمة الطالب

وكيل يستدعي الحاسبات/الإرشادات/قواعد الأدوية، ويتحقّق من كل خطوة، ويمتنع؛ وقياس خفض التوصيات غير الآمنة عند مستوى ثابت لنجاح المهمة.

سؤال البحث. هل يقلّل التحقّق خطوةً بخطوة التوصياتِ غير الآمنة عند مستوى ثابت لنجاح المهمة؟

MedCalc-Bench، قواعد الأدوية، الإرشادات CMPB، AIME، IEEE JBHI استدعاء أدوات — رخيص

مراجع ذات صلة

MedCalc-Bench: Medical Calculation in LLMs — Khandekar et al., NeurIPS 2024
Almanac: Retrieval-Augmented LLMs for Clinical Medicine — Zakka et al., NEJM AI 2024
ReAct — Yao et al., 2023

موثوقية سريرية متعددة الوكلاء

هل النقاش الطبي متعدد الوكلاء أكثر موثوقيةً فعلاً — وهل يستحق تكلفته؟

المشكلة

يُقال إن «النقاش» الطبي متعدد الوكلاء يرفع الدقة، لكن موثوقيته وتكلفته غير واضحتين.

الفجوة

اتساق pass^k وتكلفة الحوسبة للأنظمة الطبية متعددة الوكلاء غير مقيسة.

مساهمة الطالب

قياس pass^1 مقابل pass^k وتكلفة الرموز لـMedAgents/MDAgents على AgentClinic / MedQA؛ وبيان ما إذا كانت آلية النقاش تستحق نفقتها.

سؤال البحث. كم يبالغ pass^1 في تقدير الموثوقية القابلة للنشر، وهل يستحق النقاش متعدد الوكلاء تكلفته الإضافية؟

AgentClinic، MedAgentBench، MedQA ML4H، NeurIPS D&B كثيف الاستدلال — رخيص

مراجع ذات صلة

MedAgents: LLMs as Collaborators for Zero-Shot Medical Reasoning — Tang et al., ACL Findings 2024
MDAgents: Adaptive Collaboration for Medical Decision-Making — Kim et al., NeurIPS 2024
AgentClinic — Schmidgall et al., 2024

الأمان والكفاءة الشاملان

اختبار اختراق أمان الوكلاء الطبيين + دفاع

★ الأكثر أصالة

إخضاع الوكلاء السريريين للضغط بحثاً عن أفعال ضارة، ثم بناء حاجزٍ يخفضها إلى النصف.

المشكلة

يمكن خداع الوكلاء الطبية المفتوحة لتنفيذ أفعال ضارة وتوصيات غير آمنة.

الفجوة

لا يوجد معيار اختبار اختراق أمنيّ على مستوى الأفعال للوكلاء السريريين مقروناً بدفاع.

مساهمة الطالب

بناء معيار أفعال ضارة، وقياس مدى امتثال الوكلاء المفتوحة، وتصميم دفاع (حاجز/إسناد رفض)، وإثبات أنه يخفض الأفعال غير الآمنة إلى النصف دون خسارة فائدة.

سؤال البحث. ما نسبة الأفعال السريرية غير الآمنة التي تنفّذها الوكلاء الطبية المفتوحة، وهل يخفضها دفاعٌ إلى النصف دون خسارة فائدة؟

MedSafetyBench + AgentClinic NeurIPS D&B، ML4H، ورشة IEEE S&P معيار + دفاع — رخيص

مراجع ذات صلة

MedSafetyBench: Evaluating & Improving Medical Safety of LLMs — Han et al., NeurIPS 2024 D&B
AgentClinic — Schmidgall et al., 2024
MedQA (USMLE) — Jin et al., 2021

وكيل طبي خاص على الجهاز

تصغير نموذج طبي إلى الحافة بحيث لا تغادر بيانات المريض الجهاز.

المشكلة

تُرسِل نماذج اللغة الطبية السحابية البيانات الصحية المحمية خارج الجهاز — عائق خصوصية وامتثال أمام النشر.

الفجوة

المفاضلة بين الفائدة والحجم للوكلاء الطبيين المقطَّرين على الجهاز غير موصوفة جيداً.

مساهمة الطالب

تقطير نموذج طبي 7B إلى ≤2–3B؛ وتحديد الفائدة المُحتفَظ بها والكمون ومكسب الخصوصية للاستدلال المحلي بالكامل.

سؤال البحث. ما الفائدة المُحتفَظ بها عند ضغط 3×، وما فوائد الخصوصية والكمون الناتجة؟

MedQA / VQA-RAD + CodeCarbon IEEE JBHI، SUSCOM، EMNLP Findings تقطير — يناسب 4090

مراجع ذات صلة

LLaVA-Med: A Vision-Language Assistant for Biomedicine — Li et al., NeurIPS 2023
BioMistral: Open-Source Medical LLMs — Labrak et al., ACL Findings 2024
Distilling the Knowledge in a Neural Network — Hinton et al., 2015

التحقق من ملاءمة العتاد

المكوّن	4090	5090	ملاحظات
محاكاة جراحية (Orbit-Surgical / SurRoL) + VLA LoRA	✅	✅	Isaac مُسرَّع على RTX
وكيل واجهة EHR — ضبط OS-Atlas-7B (LoRA)	✅ QLoRA	✅	OpenEMR/OpenMRS + Synthea
وكيل سريري لاستدعاء الأدوات (xLAM-7B / Qwen2.5-7B)	✅	✅	إرشادات، تفاعلات دوائية، حاسبات
VLM طبي (LLaVA-Med / MedGemma / Qwen2-VL-7B)	✅	✅	توليد تقارير، أسئلة بصرية
تقطير نموذج طبي 7B ← ≤2–3B للحافة	✅	✅	خصوصية + كفاءة

Research Trends in Agentic AI & VLA

اتجاهات البحث في الذكاء الاصطناعي الوكيلي ونماذج VLA

How to read this page

كيف تقرأ هذه الصفحة

Self-Healing VLA

Adaptive Embodied Reasoning

Aerial VLA

VLA Robustness Benchmark

Hierarchical Verified VLA

VLA ذاتي التعافي

استدلال تجسيدي تكيّفي

VLA جوّي

معيار متانة VLA

VLA هرمي مُتحقَّق منه

Agriculture

Agronomic Decision Agent

Retrieval-Grounded Crop-Disease Agent

Embodied Ag-VLA

Environment / Geospatial

Geospatial Tool-Calling LAM

Environmental-Agent Reliability Benchmark

Sustainability / Green AI

Green Agentic AI

الزراعة

وكيل القرار الزراعي

وكيل تشخيص أمراض المحاصيل المستند إلى الاسترجاع

VLA زراعي تجسيدي

البيئة / الجغرافيا المكانية

LAM لاستدعاء الأدوات الجغرافية المكانية

معيار موثوقية الوكلاء البيئيين

الاستدامة / الذكاء الأخضر

ذكاء وكيلي أخضر

Medical VLA / Surgical Robotics

Surgical VLA + Agentic Safety Monitor

Assistive / Rehab VLA

Clinical Agentic AI

EHR GUI Agent (Zero-Harmful-Action)

Clinical Decision-Support Agent

Multi-Agent Clinical Reliability

Cross-Cutting Safety & Efficiency

Medical-Agent Safety Red-Team + Defense

On-Device Private Medical Agent

VLA الطبي / الروبوتات الجراحية

VLA جراحي + مراقب أمان وكيلي

VLA مساعِد / تأهيلي

الذكاء السريري الوكيلي

وكيل واجهة EHR (صفر أفعال ضارة)

وكيل دعم القرار السريري

موثوقية سريرية متعددة الوكلاء

الأمان والكفاءة الشاملان

اختبار اختراق أمان الوكلاء الطبيين + دفاع

وكيل طبي خاص على الجهاز