Thesis code: 20008

Thesis Type: M.Sc. thesis in Machine Learning, Data Science, Computer Science, Mathematics, or equivalent

Research Area: Data Science for Industrial and Societal Application

Requirements:

  • Knowledge of Python
  • Software development skills
  • Basic concepts on data science, concerning data analysis, processing and machine learning
  • Basic concepts on Natural Language Processing

 Description:

One of the main applications in Natural Language Processing is the categorization of documents based on the topic. In the case of plural categorization, the task is named Multi-label classification, which aims to categorize instances into none, one or more classes. It is one of the first steps for knowledge representation, that is the process of transforming a sequence of unstructured data, into sets of linked and organized concepts.

The objective of this thesis consists in the study and implementation of machine learning and/or deep learning algorithms for extracting and representing the knowledge concealed in sentences and paragraphs. The work will be set to an incremental difficulty, starting from an initial categorization of the available documents through the Multi-label classification task, to the representation of concepts and their interconnections, by means of existing ontologies and services (such as WordNet, ConceptNet, FrameNet). Between the two ends, the candidate will explore the Natural Language processing chain, such as tokenization, pos tagging, lemmatization, stop-words filtering, dependency parsing and named entity recognition. The candidate will have both the task of collecting the data and evaluating the best algorithms to apply for the case of study.

The work has to be performed with NLP algorithms including deep learning algorithms using a popular framework (TensorFlow, PyTorch, Keras, etc..).

Contact: send a resume with attached the list of exams to edoardo.arnaudo@linksfoundation.com specifying the thesis code and title.