matminer logo

matminer

matminer is an open-source Python library for performing data mining and analysis in the field of Materials Science. It is meant to make accessible the application of state-of-the-art statistical and machine learning algorithms to materials science data with just a few lines of code. It is currently in development, however it is a working code.

Installing matminer

Install matminer by following our short installation tutorial.

Overview

Below is a general workflow that shows the different tools and utilities available within matminer, and how they could be implemented with each other, as well as with external libraries, in your own materials data mining/analysis study.


_images/Flowchart.png


Take a tour of matminer’s features by scrolling down!

Data retrieval tools

Retrieve data from the biggest materials databases, such as the Materials Project and Citrine’s databases, in a Pandas dataframe format

The MPDataRetrieval and CitrineDataRetrieval classes can be used to retrieve data from the biggest open-source materials database collections of the Materials Project and Citrine Informatics, respectively, in a Pandas dataframe format. The data contained in these databases are a variety of material properties, obtained in-house or from other external databases, that are either calculated, measured from experiments, or learned from trained algorithms. The get_dataframe method of these classes executes the data retrieval by searching the respective database using user-specified filters, such as compound/material, property type, etc , extracting the selected data in a JSON/dictionary format through the API, parsing it and output the result to a Pandas dataframe with columns as properties/features measured or calculated and rows as data points.

For example, to compare experimental and computed band gaps of Si, one can employ the following lines of code:

from matminer.data_retrieval.retrieve_Citrine import CitrineDataRetrieval
from matminer.data_retrieval.retrieve_MP import MPDataRetrieval

df_citrine = CitrineDataRetrieval().get_dataframe(formula='Si', property='band gap', data_type='EXPERIMENTAL')
df_mp = MPDataRetrieval().get_dataframe(criteria='Si', properties=['band_gap'])

MongoDataRetrieval is another data retrieval tool developed that allows for the parsing of any MongoDB collection (which follows a flexible JSON schema), into a Pandas dataframe that has a format similar to the output dataframe from the above data retrieval tools. The arguments of the get_dataframe method allow to utilize MongoDB’s rich and powerful query/aggregation syntax structure. More information on customization of queries can be found in the MongoDB documentation.

Data descriptor tools

Decorate the dataframe with composition, structural, and/or band structure descriptors/features

We have developed utilities to help describe a material from its composition or structure, and represent them in number format such that they are readily usable as features.

The get_pymatgen_descriptor function is used to encode a material’s composition using tabulated elemental properties in the pymatgen library. There are about 50 attributes available in the pymatgen library for most elements in the periodic table, some of which include electronegativity, atomic numbers, atomic masses, sound velocity, boiling point, etc. The get_pymatgen_descriptor function takes as input a material composition and name of the desired property, and returns a list of floating point property values for each atom in that composition. This list can than be fed into a statistical function to obtain a single heuristic quantity representative of the entire composition. The following code block shows a few descriptors that can be obtained for LiFePO4:

from matminer.descriptors.composition_features import get_pymatgen_descriptor
import numpy as np

avg_mass = np.mean(get_pymatgen_descriptor('LiFePO4', 'atomic_mass'))    # Average atomic mass
std_num = np.std(get_pymatgen_descriptor('LiFePO4', 'Z'))    # Standard deviation of atomic numbers
range_elect = max(get_pymatgen_descriptor('LiFePO4', 'X')) - \
           min(get_pymatgen_descriptor('LiFePO4', 'X'))      # Maximum difference in electronegativity

The function get_magpie_descriptor operates in a similar way and obtains its data from the tables accumulated in the Magpie repository, some of which are sourced from elemental data compiled by Mathematica (more information can be found here). Some properties that don’t overlap with the pymatgen library include heat capacity, enthalpy of fusion of elements at melting points, pseudopotential radii, etc.

Other descriptors provided by matminer can be found in the Github repo.

Plotting tools

Plot data from either arrays or dataframes using either Plotly or matplotlib with figrecipes

In the figrecipes module of the matminer library, we have developed utilities that make it easier and faster to plot common figures with Plotly and matplotlib. The figrecipes module is aimed at making it easy for the user to create plots from their data using just a few lines of code, utilizing the wide and flexible functionality of Plotly and matplotlib, while at the same time sheilding the complexities involved.

The Plotly module contains the PlotlyFig class that wraps around Plotly’s Python API and follows its JSON schema. The matplotlib module contains plotting wrapper classes for each kind of popular plot, including XY-scatter plots and heat maps.

A few examples demonstrating usage can be found in the notebook hosted on Github. Note: these examples may be out of date.

Examples

Check out some examples of how to use matminer!

  1. Use matminer and sklearn to train/predict bulk moduli.
_images/example_bulkmod_rf.png
  1. Get all experimentally measured band gaps of PbTe from Citrine’s database (Jupyter Notebook)
  2. Compare and plot experimentally band gaps from Citrine with computed values from the Materials Project (Jupyter Notebook)
  3. Use matminer and sklearn to train/predict band gaps. (Jupyter Notebook)
  4. Analyze Uranium-Oxygen bond lengths from gathered from the MPDS database. (Jupyter Notebook)

Citing matminer

We are currently in the process of writing a paper on matminer - we will update the citation information once it is submitted.

Changelog

Check out our full changelog here.

Contributions and Bug Reports

Want to see something added or changed? Here’s a few ways you can!

  • Help us improve the documentation. Tell us where you got ‘stuck’ and improve the install process for everyone.
  • Let us know about areas of the code that are difficult to understand or use.
  • Contribute code! Fork our Github repo and make a pull request.

Submit all questions and contact to the Google group

A full list of contributors can be found here.