I’m a Ph.D. candidate in Computer Science at the University of Massachusetts Lowell, advised by Professor Hadi Amiri, where I study data-efficient LLM training and training dynamics through linguistic complexity signals. In 2025, I was a Research Intern at Google DeepMind working on multilingual factuality evaluation for Gemini. Previously, I received my BS in Computer Science from KAIST.
My work connects two themes: using linguistic complexity to make training more efficient and stable, and enabling fine-grained linguistic control over model outputs.
- Data-efficient LLM training: data ordering/scheduling, data curation and selection for large-scale pretraining.
- Training dynamics & interpretability: scaling laws, learning phases, difficulty signals, capability evolution analysis.
Current directions
- Pretraining curricula at scale: data ordering strategies for single-epoch pretraining; analyzing training dynamics and compute efficiency across model scales.
- RL for controlled generation: reward shaping for constraint satisfaction, studying how data scheduling improves RL stability.
Selected Publications
Curriculum Learning for LLM Pretraining: An Analysis of Learning Dynamics
Mohamed Elgaar, Hadi Amiri
arXiv:2601.21698 [PDF]LingGen: Linguistic Fine-grained Controlled Generation
Mohamed Elgaar, Hadi Amiri
EACL 2026 [PDF]Linguistically-Controlled Paraphrase Generation
Mohamed Elgaar, Hadi Amiri
Findings of EMNLP 2025 [PDF]Ling-CL: Multiview Curriculum Learning using Linguistic Complexity
Mohamed Elgaar, Hadi Amiri
EMNLP 2023 [PDF]HuCurl: Human-induced Curriculum Discovery
Mohamed Elgaar, Hadi Amiri
ACL 2023 [PDF]
News
- [Dec 2025] Released a preprint on curriculum learning for LLM pretraining (learning dynamics analysis).
- [Nov 2025] Linguistically-Controlled Paraphrase Generation presented at EMNLP 2025.
- [July 2025] MedDecXtract presented at ACL 2025 Demo Track.
- [May 2025] Joined Google DeepMind as a Research Intern (Gemini multilingual factuality).
- [Oct 2024] Released the P-Masking / LingGen preprint on multi-attribute controlled generation.
- [Dec 2023] Presented Ling-CL at EMNLP 2023.
