native SenseClusters Methodology

The native SenseClusters methodology supports both context discrimination and word (unigram) clustering. Context discrimination is performed using either first or second order representations, and word clustering can be viewed as a side-effect of the second order representation.

A first order representation creates a vector for each context that indicates which features (unigrams, bigrams, co-occurrence, or target co-occurrences) occur in that context. This results in a context by feature matrix that can optionally be reduced by SVD prior to clustering.

A second order representation creates a vector for each context that indicates which words occur with the words in that context (i.e., the second order co-occurrences). A word by word matrix is created from a given set of bigrams or co-occurrences, where the rows correspond with the first word in the pair, and the columns with the second. This matrix can then optionally be reduced by SVD. Each word in the context to be discriminated is replaced by its corresponding vector (i.e., row) from this word by word matrix. All of these vectors are averaged together to represent the context. This averaged vector is the centroid of all the word vectors that make up the context.

Word clustering treats the word by word matrix created for the second order representation as input to the clustering process. Words are clustered based on the words with which they co-occur.

This is to be contrasted with the feature clustering supported by Latent Semantic Analysis, which clusters features based on the contexts in which they occur.