This module contains functions and classes for similarities across a corpus of documents in the Vector Space Model.
The documents are sparse vectors coming from the TF-IDF model, LSI model, LDA model etc.
The two main classes are :
memory independent)
by in-memory matrix-vector multiplication. This is much faster than the general Similarity, so use this when dealing with smaller corpora, that fit in RAM.
Once the similarity object has been initialized, you can query for document similarity simply by
>>> similarities = similarity_object[query_vector]
or iterate over within-corpus similarities with
>>> for similarities in similarity_object:
>>> ...
Compute cosine similary against a corpus of documents. This is done by a full sequential scan of the corpus.
If your corpus is reasonably small (fits in RAM), consider using SparseMatrixSimilarity instead of Similarity, for (much) faster similarity searches.
If numBest is left unspecified, similarity queries return a full list (one float for every document in the corpus, including the query document).
If numBest is set, queries return numBest most similar documents, as a sorted list, eg. [(docIndex1, 1.0), (docIndex2, 0.95), ..., (docIndexnumBest, 0.45)].
Compute similarity against a corpus of documents by storing its sparse term-document (or concept-document) matrix in memory. The similarity measure used is cosine between two vectors.
This allows for faster similarity searches (simple sparse matrix-vector multiplication), but loses the memory-independence of an iterative corpus.
The matrix is internally stored as a scipy.sparse.csr matrix.
If numBest is left unspecified, similarity queries return a full list (one float for every document in the corpus, including the query document):
If numBest is set, queries return numBest most similar documents, as a sorted list:
>>> sms = SparseMatrixSimilarity(corpus, numBest = 3)
>>> sms[vec12]
[(12, 1.0), (30, 0.95), (5, 0.45)]
Return similarity of sparse vector doc to all documents in the corpus.
doc may be either a bag-of-words iterable (standard corpus document), or a numpy array, or a scipy.sparse matrix. It is assumed to be of unit length.