PKBC#

class QuadratiK.spherical_clustering.PKBC(num_clust, max_iter=300, stopping_rule='loglik', init_method='sampledata', num_init=10, tol=1e-07, random_state=None, n_jobs=4)#

Poisson kernel-based clustering on the sphere. The class performs the Poisson kernel-based clustering algorithm on the sphere based on the Poisson kernel-based densities. It estimates the parameter of a mixture of Poisson kernel-based densities. The obtained estimates are used for assigning final memberships, identifying the data points.

Parameters#

num_clustint

Number of clusters.

max_iterint

Maximum number of iterations before a run is terminated.

stopping_rulestr, optional

String describing the stopping rule to be used within each run. Currently must be either ‘max’, ‘membership’, or ‘loglik’.

init_methodstr, optional

String describing the initialization method to be used. Currently must be ‘sampleData’.

num_initint, optional

Number of initializations.

tolfloat.

Constant defining threshold by which log likelihood must change to continue iterations, if applicable. Defaults to 1e-7.

random_stateint, None, optional.

Seed for random number generation. Defaults to None

n_jobsint

Used only for computing the WCSS efficiently. n_jobs specifies the maximum number of concurrently running workers. If 1 is given, no joblib parallelism is used at all, which is useful for debugging. For more information on joblib n_jobs refer to - https://joblib.readthedocs.io/en/latest/generated/joblib.Parallel.html. Defaults to 4.

Attributes#

alpha_numpy.ndarray of shape (n_clusters,)

Estimated mixing proportions

labels_numpy.ndarray of shape (n_samples,)

Final cluster membership assigned by the algorithm to each observation

log_lik_vecnumpy.ndarray of shape (num_init, )

Array of log-likelihood values for each initialization

loklik_float

Maximum value of the log-likelihood function

mu_numpy.ndarray of shape (n_clusters, n_features)

Estimated centroids

num_iter_per_runnumpy.ndarray of shape (num_init, )

Number of E-M iterations per run

post_probs_numpy.ndarray of shape (n_samples, n_features)

Posterior probabilities of each observation for the indicated clusters

rho_numpy.ndarray of shape (n_clusters,)

Estimated concentration parameters rho

euclidean_wcss_float

Values of within-cluster sum of squares computed with Euclidean distance.

cosine_wcss_float

Values of within-cluster sum of squares computed with cosine similarity.

References#

Golzy M. & Markatou M. (2020) Poisson Kernel-Based Clustering on the Sphere: Convergence Properties, Identifiability, and a Method of Sampling, Journal of Computational and Graphical Statistics, 29:4, 758-770, DOI: 10.1080/10618600.2020.1740713.

Examples#

>>> from QuadratiK.datasets import load_wireless_data
>>> from QuadratiK.spherical_clustering import PKBC
>>> from sklearn.preprocessing import LabelEncoder
>>> X, y = load_wireless_data(return_X_y=True)
>>> le = LabelEncoder()
>>> le.fit(y)
>>> y = le.transform(y)
>>> cluster_fit = PKBC(num_clust=4, random_state=42).fit(X)
>>> ari, macro_precision, macro_recall, avg_silhouette_Score = cluster_fit.validation(y)
>>> print("Estimated mixing proportions :", cluster_fit.alpha_)
>>> print("Estimated concentration parameters: ", cluster_fit.rho_)
>>> print("Adjusted Rand Index:", ari)
>>> print("Macro Precision:", macro_precision)
>>> print("Macro Recall:", macro_recall)
>>> print("Average Silhouette Score:", avg_silhouette_Score)
... Estimated mixing proportions : [0.23590339 0.24977919 0.25777522 0.25654219]
... Estimated concentration parameters:  [0.97773265 0.98348976 0.98226901 0.98572597]
... Adjusted Rand Index: 0.9403086353805835
... Macro Precision: 0.9771870612442508
... Macro Recall: 0.9769999999999999
... Average Silhouette Score: 0.3803089203572107

Methods

PKBC.fit(dat)

Performs Poisson Kernel-based Clustering.

PKBC.stats()

Function to generate descriptive statistics per variable (and per group if available).

PKBC.validation([y_true])

Computes validation metrics such as ARI, Macro Precision and Macro Recall when true labels are provided.


PKBC.fit(dat)#

Performs Poisson Kernel-based Clustering.

Parameters#

datnumpy.ndarray, pandas.DataFrame

A numeric array of data values.

Returns#

selfobject

Fitted estimator

PKBC.stats()#

Function to generate descriptive statistics per variable (and per group if available).

Returns#

summary_stats_dfpandas.DataFrame

Dataframe of descriptive statistics

PKBC.validation(y_true=None)#

Computes validation metrics such as ARI, Macro Precision and Macro Recall when true labels are provided.

Parameters#

y_truenumpy.ndarray.

Array of true memberships to clusters, Defaults to None.

Returns#

validation metricstuple

The tuple consists of the following:

  • Adjusted Rand Indexfloat (returned only when y_true is provided)

    Adjusted Rand Index computed between the true and predicted cluster memberships.

  • Macro Precisionfloat (returned only when y_true is provided)

    Macro Precision computed between the true and predicted cluster memberships.

  • Macro Recallfloat (returned only when y_true is provided)

    Macro Recall computed between the true and predicted cluster memberships.

  • Average Silhouette Scorefloat

    Mean Silhouette Coefficient of all samples.

References#

Rousseeuw, P.J. (1987) Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53–65.

Notes#

We have taken a naive approach to map the predicted cluster labels to the true class labels (if provided). This might not work in cases where num_clust is large. Please use sklearn.metrics for computing metrics in such cases, and provide the correctly matched labels.

See also#

sklearn.metrics : Scikit-learn metrics functionality support a wide range of metrics.