MatExt tool allows any researcher to extract material property information from a large scientific corpus of copyright-protected journal manuscripts. No coding skills or NLP knowledge needed.


MatExt (Material Extractor) web tool was developed to extract functional property information from a corpus of full-text scientific articles about crystalline materials. Currently, MatExt focuses on perovskite structures and operates on a corpus of 192,694 articles on all perovskite materials gathered from various publishers, including Springer, Elsevier, and arXiv. Note: The dataset is not available for download due to not being fully open access.


MatExt uses simple human language queries of the type "What is the numerical value of the property Y for material X". The user only needs to supply information about the material (X) and property (Y), with the unit as optional. By default, MatExt relies on BERT-based Question Answering (QA) tools for information extraction (QA MatSciBERT), with new tools coming soon. The user obtains a statistical distribution of values from the literature and a list of source manuscript DOIs to facilitate retrieval.


MatExt methodology and text corpus information are detailed in the article:

  title = {Question Answering Models for Information Extraction from Perovskite Materials Science Literature},
  author = {M. Sipilä and F. Mehryary and S. Pyysalo and F. Ginter and Milica Todorović},
  journal = {arXiv},
  year = {2024},
  eprint = {2405.15290},
  archivePrefix = {arXiv},
  primaryClass = {cs.CL},
  url = {}