Thoughts following the 2015 "Text By The Bay" Conference
1. word2vec and doc2vec appear to be pervasive
Mikolov et al.’s work on embedding words as real-numbered vectors using a skip-gram, negative-sampling model (word2vec code) was mentioned in nearly every talk I attended. Either companies are using various word2vec implementations directly or they are building diffs off of the basic framework. Trained on large corpora, the vector representations encode concepts in a large dimensional space (usually 200-300 dim). Beyond the “king - man = queen - woman” analogy party trick, such embeddings are finding real-world applications throughout NLP. For example, Mike Tamir ("Classifying Text without (many) Labels"; slide shown below), discussed how he is using the average representation over entire docs as features for text classification, out-performing other bag-of-words (BoW) techniques by a large measure with heavily imbalanced classes. Marek Kolodziej ("Unsupervised NLP Tutorial using Apache Spark”) gave a wonderful talk about the long history of concept embeddings along with technical details of most of the salient papers. Chris Moody ("A Word is Worth a Thousand Vectors”) showed how word2vec was being used in conjunction with topic modeling for improved recommendation over standard cohort analysis. He also ended his talk about how word2vec can be extended beyond NLP to machine translation and graph analysis.