A selected list of my projects. Some were done as part of a university course and others as side-projects.
This course was composed of theoretical lectures and a machine learning challenge.
I worked on an intent prediction problem for drug-related question, proposed by the healthcare startup Posos. The goal was to classify real user questions into one of 52 classes representing the type of drug-related information they are looking for.
The goal of this project was to apply graph neural networks on the influence maximization (IM) problem for a graph with the independent cascade (IC) assumption (more details in the report). The project was in collaboration with Hind Dadoun
This project was part of the course Mathematical Foundations Of Data Science taught by Gabriel Peyré in 2019.
The goal was to reproduce and extend results from  to other optimization algorithms, including SAGA . After comparing the regularized optimal transport convergence results for these algorithms, I studied an optimal school placement and allocation problem in several French regions.
Data Journalism Extractor
This project is an attempt to create a tool to help journalists extract and process data at scale, from multiple heterogenous data sources while leveraging powerful and complex database, information extraction and NLP tools with limited programming knowledge.
This software is based on Apache Flink, a stream processing framework similar to Spark written in Java and Scala. It executes dataflow programs, is highly scalable and integrates easily with other Big Data frameworks and tools such as Kafka, HDFS, YARN, Cassandra or ElasticSearch.
Although you can work with custom dataflow programs that suits your specific needs, one doesn’t need to know programming, Flink or Scala to work with this tool and build complex dataflow programs to achieve some of the following operations:
- Extract data from relational databases (Postgres, MySQL, Oracle), NoSQL databases (MongoDB), CSV files, HDFS, etc.
- Use complex processing tools such as soft string-matching functions, link extractions, etc.
- Store outputs in multiple different data sinks (CSV files, databases, HDFS, etc.)
Documentation about the project is available at this link.
I used a concept from , which uses earth mover’s distance metric between documents represented as normalized bag-of-words. The underlying transport cost between two words is given by their distance in a pre-trained word vector space. The app trains word vectors on all Arxiv abstracts and uses the EMD based metric to compute similarities between papers.