Welcome to humanmodels’s documentation!

This package provides human-designed, scikit-learn compatible models for classification and regression. humanmodels are initialized through a sympy-compatible text string, describing an equation (e.g. “y = 4x + 3z**2 + p_0”) or a rule for classification that must return True or False (e.g. “x > 2*y + 2”). If the string contains parameters not corresponding to problem variables, the parameters of the model are optimized on training data, using the .fit(X,y) method.

The objective of HumanModels is to provide a scikit-learn integrated way of comparing human-designed models to machine learning models.

Installing the package

On Linux, HumanModels can be installed through pip:

pip install humanmodels

You can also install the package by cloning or downloading this repository, cd into the directory and then execute:

python -m build
python -m pip install dist/humanmodels*whl

On Windows, HumanModels can be installed through the Anaconda Prompt:

pip install humanmodels



HumanRegressor is a regressor, initialized with a sympy-compatible text string describing an equation, and a dictionary mapping the correspondance between the variables named in the equation and the features in X. Let’s generate some data to test the algorithm:

import numpy as np
print("Creating data...")
X_train = np.zeros((100,3))
X_train[:,0] = np.linspace(0, 1, 100)
X_train[:,1] = np.random.rand(100)
X_train[:,2] = np.linspace(0, 1, 100)
y_train = np.array([0.5 + 1*x[0] + 1*x[2] + 2*x[0]**2 + 2*x[2]**2 for x in X_train])

An example of initialization:

from humanmodels import HumanRegressor
model_string = "y = 0.5 + a_1*x + a_2*z + a_3*x**2 + a_4*z**2"
variables_to_features = {"x": 0, "z": 2}
regressor = HumanRegressor(model_string, variables_to_features)

Printing the model as a string will return:

Model not initialized, call '.fit(X, y)'

We can now fit the model to the data:

print("Fitting data...")
regressor.fit(X_train, y_train)

The code will produce:

Fitting data...
Model: y = a_1*x + a_2*z + a_3*x**2 + a_4*z**2 + 0.5
Variables: ['x', 'z']
Parameters: {'a_1': 1.0000001886557832, 'a_2': 1.0000004533354703, 'a_3': 2.000000577731051, 'a_4': 2.0000005553527895}
Trained model: y = 2.0*x**2 + 1.0*x + 2.0*z**2 + 1.0*z + 0.5

As the only variables provided in the variables_to_features dictionary are named x, and z, all other alphabetic symbols (a_1, a_2, a_3, a_4) are interpreted as trainable parameters. The model also shows the optimized values of its parameters. Let’s now check the performance on the training data:

y_pred = regressor.predict(X_train)
from sklearn.metrics import mean_squared_error
print("Mean squared error:", mean_squared_error(y_train, y_pred))
Mean squared error: 7.72490931190691e-13

The regressor can also be tested on unseen data, and since in this case the equation used to generate the data has the same structure as the one given to the regressor, the generalization is of course satisfying:

X_test = np.zeros((100,3))
X_test[:,0] = np.linspace(1, 2, 100)
X_test[:,1] = np.random.rand(100)
X_test[:,2] = np.linspace(1, 2, 100)
y_test = np.array([0.5 + 1*x[0] + 1*x[2] + 2*x[0]**2 + 2*x[2]**2 for x in X_test])
y_pred = regressor.predict(X_test)
print("Mean squared error on test:", mean_squared_error(y_test, y_pred))
Mean squared error on test: 1.2055817248044523e-11


HumanClassifier also takes in input a sympy-compatible string (or dictionary of strings), defining a logic expression that can be evaluated to return True or False. If only one string is provided during initialization, the problem is assumed to be binary classification, with True corresponding to Class 0 and False corresponding to Class 1. Let’s test it on the classic Iris benchmark provided in scikit-learn, transformed into a binary classification problem.

from sklearn import datasets
X, y = datasets.load_iris(return_X_y=True)
for i in range(0, y.shape[0]) : if y[i] != 0 : y[i] = 1

from humanmodels import HumanClassifier
rule = "(sl < 6.0) & (sw > 2.7)"
variables_to_features = {"sl": 0, "sw": 1}
classifier = HumanClassifier(rule, variables_to_features)
Model not initialized, call '.fit(X, y)'

Even if there are no trainable parameters, the classifier must still be trained using .fit(X,y), for compatibility with the scikit-learn package:

classifier.fit(X, y)
Classifier: Class 0: (sw > 2.7) & (sl < 6.0); variables: sl -> 0 sw -> 1; parameters: None
Default class (if all other expressions are False): 1

And now, let’s test the classifier:

y_pred = classifier.predict(X)
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y, y_pred)
print("Final accuracy for the classifier is %.4f" % accuracy)
Final accuracy for the classifier is 0.9067

For multi-class classification problems, HumanClassifier can accept a dictionary of logic expressions in the form {label0 : "expression0", label1 : "expression1", ...}. As for HumanRegressor, expression can also have trainable parameters, optimized when .fit(X,y) is called. Let’s see an another example with Iris, this time using all three classes:

X, y = datasets.load_iris(return_X_y=True)
rules =     {0: "sw + p_0*sl > p_1",
        2: "pw > p_2",
        1: ""}  # this means that a sample will be associated to class 1 if both
                # the expression for class 0 and 2 return 'False'
variables_to_features = {'sl': 0, 'sw': 1, 'pw': 3}
classifier = HumanClassifier(rules, variables_to_features)

classifier.fit(X, y)
y_pred = classifier.predict(X)
accuracy = accuracy_score(y, y_pred)
print("Classification accuracy: %.4f" % accuracy)
Class 0: p_0*sl + sw > p_1; variables:sl -> 0 sw -> 1; parameters:p_0=-0.6491880968641275 p_1=-0.12490468490418744
Class 2: pw > p_2; variables:pw -> 3; parameters:p_2=1.7073348596674072
Default class (if all other expressions are False): 1
Classification accuracy: 0.9400

Depends on

numpy (for fast computations)

sympy (for symbolic mathematics)

scipy (for optimization)

cma (also for optimization of non-convex functions)

scikit-learn (for quality metrics, such as accuracy and mean squared error; also, HumanClassifier and HumanRegressor have the ambition of being compatible with scikit-learn estimators)

Indices and tables