Doctoral student, LISTIC, University of Savoie Mont Blanc
PERSONAL INFORMATION
Email: wassim.kharrat@univ-smb.fr
Office: A103
Address: 5 chemin de Bellevue, Annecy-le-Vieux, CS 80439, 74944 ANNECY CEDEX
Research team: ReGaRD
THESIS INFORMATION
Subject:
« SÉMARIS: Semantics and Language Models for Risk Analysis in Technical and Financial Construction Site Documents »
Keywords: LLM, NLP, Risk Management, Machine Learning, Anomaly Detection, Knowledge Extraction
Supervisors: Sébastien MONNET, Khadija ARFAOUI
Doctoral School: Science, Engineering, Environment (SIE)
Start of the thesis: December 2025
Abstract:
In a context where digital transformation is reshaping all sectors, the supervision of critical infrastructures requires tools capable of efficiently exploiting large volumes of heterogeneous data. Underground construction projects, such as tunnels, represent a particularly sensitive domain: they demand continuous monitoring to ensure worker safety, structural stability, and operational fluidity. Each day, construction companies generate a large number of technical reports describing progress, incidents, sensor readings, and site conditions. Manual analysis of these documents becomes increasingly unsustainable, hence the need for automated and intelligent approaches.
Large Language Models (LLMs) open up new possibilities for the automatic analysis of technical texts. They can extract relevant information, identify inconsistencies, and detect weak signals that may indicate potential risks. However, their use in the context of construction site supervision raises several challenges: detection reliability in a critical domain, heterogeneity of technical language, multiplicity of formats and data sources, and strong contextual variability. Moreover, LLMs require significant computational resources and have a high energy footprint, raising questions about their sustainability and integration into constrained industrial environments.
To address these challenges, this research proposes combining semantic analysis and machine learning approaches to automatically detect risks and anomalies in construction site monitoring reports. The goal is to develop more compact, energy-efficient language models that are suitable for resource-limited environments. The central research question can be formulated as follows: how can a language model be enriched with semantic approaches to detect risks in unstructured technical documents from complex industrial environments?
The project is structured around four main objectives:
1. Analysis and structuring of monitoring reports: understand the structure and terminology of daily reports to identify key elements (progress, incidents, technical measurements) and risk categories to detect. This step will help define relevant features for anomaly detection and design a knowledge base tailored to the domain.
2. Optimization and specialization of LLMs: integrate LLMs with topic modeling techniques (such as BERTopic, LITA, or LDA) to automatically extract dominant themes and track the evolution of risks. The objective is to improve weak-signal detection while reducing computational cost through model adaptation techniques such as distillation, quantization, or parameter-efficient fine-tuning approaches (LoRA, PEFT).
3. Energy efficiency and sustainability: evaluate the impact of compression and optimization methods on model performance and energy consumption. The challenge is to reconcile analytical accuracy with a reduced carbon footprint by exploring lightweight architectures capable of operating in constrained industrial settings.
4. Validation and experimentation: apply the developed methods to real-world use cases based on actual construction site reports. This phase will assess the relevance of the models, their robustness to data variability, and their ability to provide early warnings of risky situations.