1st Early-Career Researchers’ Day of the DATA Programme I-SITE CAP 20-25
Amphi 2
Pôle Commun
(English Version, français ci-dessous)
The Early-Career Researchers’ Day of the DATA Programme I-SITE CAP 20-25 will take place on February 26th, 2026, on the Cézeaux campus in Clermont-Ferrand. This event marks the first edition of a dedicated scientific meeting bringing together PhD students and postdoctoral researchers involved in research activities aligned with the DATA programme.
The DATA programme of I-SITE CAP 20-25 aims to develop technological equipment and software tools for the detection, collection, transfer, integration, and exploitation of large-scale data.
This one-day event aims to:
-
Present the current progress of doctoral and postdoctoral research within the DATA programme,
-
Encourage scientific exchange and collaboration between young researchers from different disciplines,
-
Highlight interdisciplinary data science projects developed on the Clermont-Ferrand site.
The scientific programme will include standard oral presentations as well as a poster session, which will take place during the lunch break. Contributions may be presented either in English or in French, fostering open discussion and interaction across disciplines and backgrounds.
(Français)
La Journée des Doctorants et Post-doctorants du programme DATA (I-SITE CAP 20-25) se tiendra le 26 février 2026 sur le campus des Cézeaux à Clermont-Ferrand. Cette journée constitue la première édition d’un événement scientifique dédié réunissant les doctorantes, doctorants et post-doctorants impliqués dans des travaux de recherche en lien avec les thématiques du programme DATA.
Le programme DATA de l’I-SITE CAP 20-25 a pour objectif de développer des équipements technologiques et des outils logiciels permettant la détection, la collecte, le transfert, l’intégration et l’exploitation d’ensembles massifs de données.
Le site clermontois dispose de compétences scientifiques et techniques reconnues, couvrant un large spectre de problématiques de recherche fondamentale et appliquée liées aux objets connectés et au Big Data.
Cette journée a pour objectifs de :
-
Présenter l’avancement des travaux de recherche des doctorants et post-doctorants,
-
Favoriser les échanges scientifiques et les interactions entre équipes et disciplines,
-
Mettre en lumière les projets interdisciplinaires en science des données développés sur le site clermontois.
Le programme scientifique comprendra des présentations orales de ainsi qu’une session posters, organisée durant la pause déjeuner. Les présentations pourront être réalisées en français ou en anglais, afin de favoriser une participation large et inclusive.
-
-
09:00
→
09:30
Welcome Coffee 30m Amphi 2
Amphi 2
Pôle Commun
Université Clermont Auvergne Campus des Cézeaux, 63170 Aubière -
09:30
→
10:30
Plenary Session: Pierre Noyel Limagrain Europe : Artificial Intelligence for plant genomic prediction Amphi 2
Amphi 2
Pôle Commun
Université Clermont Auvergne Campus des Cézeaux, 63170 AubièrePrésident de session: Tidenek Fekadu Kore -
10:30
→
10:50
Coffee Break 20m Amphi 2
Amphi 2
Pôle Commun
Université Clermont Auvergne Campus des Cézeaux, 63170 Aubière -
10:50
→
11:10
Applied Machine Learning Amphi 2
Amphi 2
Pôle Commun
Université Clermont Auvergne Campus des Cézeaux, 63170 AubièrePrésident de session: Durande Berluskoni KAMGA NGUIFO (Université clermont Auvregne)-
10:50
Jet Classification with Particle Transformers: A Multiclass Learning Approach 20m
In high-energy collisions, jets, which are collimated sprays of particles, can originate from various fundamental particles, including W and Z bosons, top quarks, and the Higgs boson. Accurately identifying these jets is crucial for studying Standard Model processes and investigating new physics beyond its framework. This study, conducted within the ATLAS collaboration at the Large Hadron Collider, focuses on multi-class jet tagging utilizing the Particle Transformer (ParT). ParT employs attention mechanisms to capture correlations among jet constituents, the particles that constitute a jet. By representing jets as unordered sets of particles, ParT achieves superior discriminative performance compared to other constituent-based architectures such as ParticleNet and PFN. Its performance is evaluated across multiple jet classes, demonstrating robustness under various Monte Carlo generators and against binary classifiers, thereby showcasing both high accuracy and stability. These findings underline the ability of attention-based transformers to efficiently process unordered data, unveil valuable insights into feature representation, and exhibit satisfactory performance when extended from binary to multi-class jet classification.
-
10:50
-
11:10
→
11:50
Anomaly Detection Amphi 2
Amphi 2
Pôle Commun
Université Clermont Auvergne Campus des Cézeaux, 63170 AubièrePrésident de session: Durande Berluskoni KAMGA NGUIFO (Université clermont Auvregne)-
11:10
OnlineBootKNN: An Unsupervised Framework for Detecting Anomalies in Spectral Data Streams 20m
Field: AI, Affiliation: UCA
Monitoring the elemental composition of materials in order to detect abnormal conditions in real-time is essential for applications like manufacturing quality control, environmental monitoring, and space exploration. This is achieved using sensors that analyze the interaction of a material with electromagnetic radiation, producing spectral data streams or a sequence of instances where each represents an ordered set of wavelengths with an associated intensity. While many unsupervised anomaly detection methods exist for tabular streaming data, their applicability to spectral streams remains underexplored. To address this gap, we consider our spectra in a multivariate stream setting and benchmark the performance of state-of-the-art tabular anomaly detection methods on this data. Furthermore, we introduce OnlineBootKNN, a novel unsupervised framework that combines k-nearest neighbors with online bootstrapping and a z-score test to detect anomalies in real-time. We demonstrate the high performance and robustness of our method, as well as the efficacy of the autoencoder-based method, KitNet, on newly simulated real-world spectral datasets. In addition, we compare their efficiency against the other tested techniques. Finally, we highlight the inherent interpretability of OnlineBootKNN, which is crucial for identifying the specific wavelengths, and thus elements, responsible for a detected anomaly.
-
11:30
Characterization and anomaly detection in daily cow activities using wavelet-based features 20m
Anomaly detection in the day-to-day activity of dairy cows
is challenging, as true abnormal behavior must be distinguished from
the individual variability and the animals’ endogenous rhythms. Current
algorithms for anomaly detection in times series include various tech
niques, with neural network-based methods being the most prominent.
However, these approaches lack interpretability, which is crucial in precision livestock farming. This work proposes extracting interpretable features using wavelet transforms, enabling better time-frequency analysis of the endogenous rhythm compared to abnormal rhythms. The results show that some wavelets have a positive impact on performance, and align with expert knowledge.
-
11:10
-
11:50
→
12:20
Poster Flash Talks Amphi 2
Amphi 2
Pôle Commun
Université Clermont Auvergne Campus des Cézeaux, 63170 AubièrePrésident de session: Emmanuel Gangler (LPCA) -
12:20
→
14:00
Lunch Break - Buffet & Poster Session 1h 40m Salle de pause pole commun (Campus des céseaux)
Salle de pause pole commun
Campus des céseaux
-
14:00
→
14:20
Anomaly Detection Amphi 2
Amphi 2
Pôle Commun
Université Clermont Auvergne Campus des Cézeaux, 63170 AubièrePrésident de session: Andrés Duque (PhD Student)-
14:00
Development of Statistical and Predictive Models for Anomaly Detection and Quality Control Improvement Based on Heterogeneous Industrial Data in Medical Device Manufacturing 20m
This postdoctoral research is conducted within the Laboratory of Computer Science, Modeling and Optimization of Systems (LIMOS) and falls within one of its research themes, Data, Services and Intelligence (DSI), in close alignment with the activities of the Department of Mathematical and Industrial Engineering. It focuses on the optimization of quality control processes in medical textile manufacturing through the analysis of large-scale, heterogeneous industrial data generated by increasingly digitalized production systems. The project is carried out in collaboration with the company Thuasne and the École des Mines de Saint-Étienne, within the Institut Henri Fayol.
The main objective of the project is the identification of anomalies and their root causes in order to better understand performance deviations, defect propagation across production stages, and interactions that influence the final quality of medical devices. By addressing defects such as appearance flaws and dimensional nonconformities, this work seeks to reduce losses related to non-quality, including waste of finished products, chemicals, and water, while improving the efficiency of quality control procedures and reducing the environmental impact of manufacturing processes, in line with sustainable development principles.
The first results obtained in this work correspond to an initial phase of the project, following the collection and analysis of a subset of the available production data. This preliminary study is based on annual production data from circular knitting machines used to manufacture compression socks for the treatment of chronic venous insufficiency. From indicators such as defect rate, machine event rate, and defect rate per event, a quadrant-based analysis was developed to provide an initial classification of machine performance and risk levels. This intuitive approach offers a first, interpretable framework for anomaly detection and establishes the foundation for more comprehensive analyses as additional data are progressively integrated.
To further enhance this initial analysis, multivariate data analysis techniques were applied. Principal Component Analysis (PCA) captured more than 96% of the total variance, revealing latent structures related to operational load and event severity. The combination of PCA with K-means clustering enabled a robust, data-driven classification of machine behavior, clearly isolating anomalous machines while identifying those with stable and near-optimal performance.
The machine classifications and performance indicators will be used to track defect propagation across production stages, to investigate relationships between different quality control processes, and to identify interactions that may contribute to the emergence of specific defects.
Overall, this work represents a first step toward advanced predictive quality control strategies and defect-specific monitoring. It highlights the potential of performance indicators, data-driven modeling, and artificial intelligence approaches to address concrete industrial challenges, while contributing to the digital and sustainable transition of the medical device manufacturing industry.
-
14:00
-
14:20
→
15:20
Foundations and Sustainability in Machine Learning Amphi 2
Amphi 2
Pôle Commun
Université Clermont Auvergne Campus des Cézeaux, 63170 AubièrePrésident de session: Andrés Duque (PhD Student)-
14:20
Apport des motifs gradual dans pour l'IA explicable 20m
Les modèles ensemblistes et les réseaux de neurones profonds démontrent de très bons résultats dans les tâches de classification. Cependant, leur nature « boîte noire » empêche leur déploiement généralisé dans des domaines critiques comme la santé. L'IA explicable vise à rendre ces modèles plus compréhensibles. Dans la littérature, les résultats de classification sont expliqués principalement par attribution de l'importance des caractéristiques ou par génération de contrefactuelles basées sur des voisinages générés aléatoirement, par algorithmes génétiques ou à partir des connaissances d'experts. Dans cet article, nous montrons comment les motifs graduels permettent de générer des voisinages plausibles sans expertise, produisant des explications mieux adaptées aux instances individuelles. Nous étendons notre méthode avec une analyse complète des résultats expérimentaux comparant notre approche à LIME, LORE et SHAP.
-
14:40
Fluctuations and concentration in two-layer neural networks 20m
We study the learning dynamics of wide two-layer neural networks trained by stochastic gradient descent (SGD), aiming to understand quantitatively how network width shapes both the typical training trajectory and the variability of the final predictor.
We adopt an interacting particle viewpoint in which neurons evolve under SGD as a large coupled system. As the width grows, this collective dynamics is well approximated by a deterministic mean-field limit, which provides an analytically tractable description of how the parameter distribution (and hence predictions) evolves during training.
We then quantify finite-width effects through two complementary results. First, we characterize fluctuations around the mean-field limit: after the natural rescaling, we show that the deviations converge to a Gaussian limiting process, yielding an explicit description of the variability induced by training randomness. Second, we establish finite-width concentration inequalities, uniform over training time, which control with high probability how close a width-N network remains to its mean-field proxy.
-
15:00
Contributing to reproducible software measurements of energy consumption for machine learning 20m
This work is part of the AI domain of the DATA program, and more specifically about sensors to obtain the energy consumption and machine learning algorithms. This work is being carried out in the LIMOS laboratory.
High-performance computing require careful management of power and energy budgets. Much of the work to achieve these energy-consumption goals will have to be done through hardware improvements. On the other hand, responsibility for using available hardware tools lies with both software and applications. Software measurements must be available to evaluate the consumption of applications and particularly those of high-performance computing in the context of machine learning applications. We will use these software measurement tools to generate reference data that will serve as the basis for further research into how to reduce energy consumption.
-
14:20
-
15:20
→
15:40
Coffee Break 20m Amphi 2
Amphi 2
Pôle Commun
Université Clermont Auvergne Campus des Cézeaux, 63170 Aubière -
15:40
→
16:00
Foundations and Sustainability in Machine Learning Amphi 2
Amphi 2
Pôle Commun
Université Clermont Auvergne Campus des Cézeaux, 63170 AubièrePrésident de session: Paul Stos (Université Clermont-Auvergne)-
15:40
Towards Sustainable DBMS: A Framework for Real-Time Energy Estimation and Query Categorization 20m
Abstract. Energy efficiency in database management systems (DBMS) is increasingly critical due to the rising computational demands of modern applica-tions. Our work proposes a complete framework to analyze energy consumption. We developed a real-time monitoring framework that captures CPU and memory utilization during query execution and estimates energy consumption. We have implemented a query logging mechanism to track and analyze execution time. We propose an energy estimation model that computes power consumption using CPU utilization metrics and query categorization based on energy usage profiles. We studied the correlation between execution time and energy consumption using Pearson correlation. We propose a power-based classification of SQL query types, enabling more energy-aware optimization strategies. The result of our analysis highlights the opportunities for power-aware query optimization, making DBMS operations green computing and efficient.
Keywords: Energy-Efficient Computing, Database Management Systems (DBMS), Query Optimization, Power Consumption Analysis, Energy Estima-tion Model, Green Computing, Workload Profiling.
-
15:40
-
16:00
→
16:40
Applied Machine Learning Amphi 2
Amphi 2
Pôle Commun
Université Clermont Auvergne Campus des Cézeaux, 63170 AubièrePrésident de session: Paul Stos (Université Clermont-Auvergne)-
16:00
Stylistic-STORM : Désenchevêtrement spectral auto-supervisé Utilisation de l’apprentissage antagoniste et JEPA pour l’analyse météorologique 20m
La conduite autonome exige des véhicules qu’ils perçoivent leur environnement (véhicules, piétons, feux, etc.) et qu’ils restent fiables sous des conditions changeantes. Lors d’un trajet, par exemple de Clermont-Ferrand à Paris, la météo peut basculer rapidement d’un temps clair à une pluie, du brouillard ou une neige intense. Ces conditions dégradées ne réduisent pas seulement la visibilité : elles modifient aussi l’adhérence (sol mouillé, neige, verglas) et influencent directement la sécurité. C’est précisément l’objectif du projet ROADVIEW (Robust Automated Driving in Extreme Weather) qui finance nos travaux : permettre une perception robuste en météo extrême, en estimant non seulement les objets, mais aussi la météo et ses effets. Or, les méthodes d’apprentissage auto-supervisé (SSL) comme MoCo ou DINO apprennent des features robustes en rendant les représentations invariantes aux variations d’apparence. Leurs augmentations et objectifs d’invariance poussent ainsi le modèle à atténuer des effets tels que l’illumination, la réflectance ou les micro-textures liées à la pluie et à la neige. Pourtant, pour l’analyse météorologique, ces indices d’apparence constituent souvent le signal discriminant (striures de pluie, granulosité de neige, diffusion atmosphérique, reflets et halos) : les rendre invariants peut donc supprimer l’information pertinente. Nous introduisons ST-STORM, un cadre SSL hybride qui modélise la météo comme une composante de style à désenchevêtrer du contenu. L’architecture apprend deux flux latents régulés par des mécanismes de gating : (i) une branche Contenu, fondée sur une architecture JEPA couplée à un objectif contrastif pour garantir la stabilité sémantique, et (ii) une branche Style, contrainte à encoder des signatures météorologiques (notamment spectrales) via une prédiction de features de type JEPA et une reconstruction guidée par synthèse adversariale et pertes fréquentielles (FFT). Nous évaluons ST-STORM sur la détection fine d’attributs météorologiques (type, intensité, visibilité, état du sol). Préentraîné sur 250000 images de Weather MultiTask Datasets puis gelé et fine-tuné en multitâche sur 10% des données (25000 images), notre modèle atteint 96% (score global) en exploitant ses features stylistiques, contre 87% (F1 global) pour JEPA et ∼90% (F1 global) pour MoCo-v3 à protocole identique. Ces résultats montrent que les indices d’apparence sont centraux pour la météorologie, et que le désenchevêtrement spectral permet de les préserver tout en maintenant une représentation de contenu robuste pour l’analyse de scènes complexes.
-
16:20
A simple self-supervised model for modeling human visual learning 20m
Children quickly develop powerful visual representations that support visual recognition, such as object recognition, with minimal supervision. However, the principles underpinning this development are poorly understood. In particular, it is unclear how natural experience interacts with unsupervised learning mechanisms to shape semantic representations.
This study explores whether the daily experience of adult humans, combined with two prominent theories of unsupervised learning, is sufficient to support the emergence of strong visual representations. We simulate key aspects of human visuomotor experience with 300 hours of videos collected with head-mounted cameras, combined with eye and body movements of the camera wearer. As a computational model of human learning, we train a bio-inspired class of machine learning models that learns temporally consistent visual representations and aligns visual representations with co-occurring body movements.
We show that the resulting representations achieve strong performance on a range of downstream tasks, including object recognition, scene recognition, and out-of-distribution generalization, despite the absence of explicit semantic supervision. Our analysis shows that naturalistic eye and body movements favor the encoding of shape-based features over texture-based cues, mirroring biases observed in early visual development in toddlers. Moreover, we observe that a bio-inspired emphasis on central vision promotes $16\times$ more computationally efficient learning, with minimal impact on the quality of semantic representations.
Taken together, these findings suggest that sensorimotor learning may be a key principle underlying the development of robust and generalizable visual representations. This research is part of the machine learning part of DATA program and is carried out at the Institut Pascal.
-
16:00
-
16:40
→
17:00
Final Discussion 20m Amphi 2
Amphi 2
Pôle Commun
Université Clermont Auvergne Campus des Cézeaux, 63170 Aubière
-
09:00
→
09:30