PKBC#
- class QuadratiK.spherical_clustering.PKBC(num_clust, max_iter=300, stopping_rule='loglik', init_method='sampledata', num_init=10, tol=1e-07, random_state=None, n_jobs=4)#
Poisson kernel-based clustering on the sphere. The class performs the Poisson kernel-based clustering algorithm on the sphere based on the Poisson kernel-based densities. It estimates the parameter of a mixture of Poisson kernel-based densities. The obtained estimates are used for assigning final memberships, identifying the data points.
Parameters#
- num_clustint
Number of clusters.
- max_iterint
Maximum number of iterations before a run is terminated.
- stopping_rulestr, optional
String describing the stopping rule to be used within each run. Currently must be either ‘max’, ‘membership’, or ‘loglik’.
- init_methodstr, optional
String describing the initialization method to be used. Currently must be ‘sampleData’.
- num_initint, optional
Number of initializations.
- tolfloat.
Constant defining threshold by which log likelihood must change to continue iterations, if applicable. Defaults to 1e-7.
- random_stateint, None, optional.
Seed for random number generation. Defaults to None
- n_jobsint
Used only for computing the WCSS efficiently. n_jobs specifies the maximum number of concurrently running workers. If 1 is given, no joblib parallelism is used at all, which is useful for debugging. For more information on joblib n_jobs refer to - https://joblib.readthedocs.io/en/latest/generated/joblib.Parallel.html. Defaults to 4.
Attributes#
- alpha_numpy.ndarray of shape (n_clusters,)
Estimated mixing proportions
- labels_numpy.ndarray of shape (n_samples,)
Final cluster membership assigned by the algorithm to each observation
- log_lik_vecnumpy.ndarray of shape (num_init, )
Array of log-likelihood values for each initialization
- loklik_float
Maximum value of the log-likelihood function
- mu_numpy.ndarray of shape (n_clusters, n_features)
Estimated centroids
- num_iter_per_runnumpy.ndarray of shape (num_init, )
Number of E-M iterations per run
- post_probs_numpy.ndarray of shape (n_samples, n_features)
Posterior probabilities of each observation for the indicated clusters
- rho_numpy.ndarray of shape (n_clusters,)
Estimated concentration parameters rho
- euclidean_wcss_float
Values of within-cluster sum of squares computed with Euclidean distance.
- cosine_wcss_float
Values of within-cluster sum of squares computed with cosine similarity.
References#
Golzy M. & Markatou M. (2020) Poisson Kernel-Based Clustering on the Sphere: Convergence Properties, Identifiability, and a Method of Sampling, Journal of Computational and Graphical Statistics, 29:4, 758-770, DOI: 10.1080/10618600.2020.1740713.
Examples#
>>> from QuadratiK.datasets import load_wireless_data >>> from QuadratiK.spherical_clustering import PKBC >>> from sklearn.preprocessing import LabelEncoder >>> X, y = load_wireless_data(return_X_y=True) >>> le = LabelEncoder() >>> le.fit(y) >>> y = le.transform(y) >>> cluster_fit = PKBC(num_clust=4, random_state=42).fit(X) >>> ari, macro_precision, macro_recall, avg_silhouette_Score = cluster_fit.validation(y) >>> print("Estimated mixing proportions :", cluster_fit.alpha_) >>> print("Estimated concentration parameters: ", cluster_fit.rho_) >>> print("Adjusted Rand Index:", ari) >>> print("Macro Precision:", macro_precision) >>> print("Macro Recall:", macro_recall) >>> print("Average Silhouette Score:", avg_silhouette_Score) ... Estimated mixing proportions : [0.23590339 0.24977919 0.25777522 0.25654219] ... Estimated concentration parameters: [0.97773265 0.98348976 0.98226901 0.98572597] ... Adjusted Rand Index: 0.9403086353805835 ... Macro Precision: 0.9771870612442508 ... Macro Recall: 0.9769999999999999 ... Average Silhouette Score: 0.3803089203572107
Methods
|
Performs Poisson Kernel-based Clustering. |
Function to generate descriptive statistics per variable (and per group if available). |
|
|
Computes validation metrics such as ARI, Macro Precision and Macro Recall when true labels are provided. |
- PKBC.fit(dat)#
Performs Poisson Kernel-based Clustering.
Parameters#
- datnumpy.ndarray, pandas.DataFrame
A numeric array of data values.
Returns#
- selfobject
Fitted estimator
- PKBC.stats()#
Function to generate descriptive statistics per variable (and per group if available).
Returns#
- summary_stats_dfpandas.DataFrame
Dataframe of descriptive statistics
- PKBC.validation(y_true=None)#
Computes validation metrics such as ARI, Macro Precision and Macro Recall when true labels are provided.
Parameters#
- y_truenumpy.ndarray.
Array of true memberships to clusters, Defaults to None.
Returns#
- validation metricstuple
The tuple consists of the following:
- Adjusted Rand Indexfloat (returned only when y_true is provided)
Adjusted Rand Index computed between the true and predicted cluster memberships.
- Macro Precisionfloat (returned only when y_true is provided)
Macro Precision computed between the true and predicted cluster memberships.
- Macro Recallfloat (returned only when y_true is provided)
Macro Recall computed between the true and predicted cluster memberships.
- Average Silhouette Scorefloat
Mean Silhouette Coefficient of all samples.
References#
Rousseeuw, P.J. (1987) Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53–65.
Notes#
We have taken a naive approach to map the predicted cluster labels to the true class labels (if provided). This might not work in cases where num_clust is large. Please use sklearn.metrics for computing metrics in such cases, and provide the correctly matched labels.
See also#
sklearn.metrics : Scikit-learn metrics functionality support a wide range of metrics.