{% extends "_base.html" %} {% block header_javascripts %} {% endblock %} {% block content %}
Interactive Clustering is a method intended to assist in the design of a training data set. The main objective it to create a labelled dataset without a prior/subjective definition of represented thema. In fact, it is based on an active learning methodology where the computer suggest a data partitionning and the expert correct iteratively the computer decision.
This iterative process begins with an unlabeled dataset, and it uses a sequence of two substeps :
Use this method avoid performing the following annoying tasks:
For more details, see Frequently Asked Questions and read articles in References sections.
It's an unsupervised algorithm aimed at group data by their similarities. In NLP, it can use common linguistic patterns, lexical or syntactical similarities, word vectors distance, ...
The main advantage of such algorithms is the ability to explore data in order to find topics. However, experts often consider raw results to be of low value (hard to distinguishing ambiguous formulations, to dealing with unbalanced topics, etc...). Thus, to have semantically relevant results, manual corrections are sometimes necessary.
It's information given by the expert on data similarity. We deal with two type of constraints:
It can be used in constrained clustering to guide the computer operation.
The sampling step is needed to determine the constraints to be annotated to most effectively correct the clustering functioning. Sampling strategies can be based on: