Comparison of FastText and Word2Vec
Facebook Research open sourced a great project yesterday - , a fast (no surprise) and effective method to learn word representations and perform text classification. I was curious about comparing these embeddings to other commonly used embeddings, so word2vec seemed like the obvious choice, especially considering fastText embeddings are based upon word2vec.
Download data
import nltknltk.download() # Only the brown corpus is needed in case you don't have it. # alternately, you can simply download the pretrained models below if you wish to avoid downloading and training # Generate brown corpus text file with open('brown_corp.txt', 'w+') as f: for word in nltk.corpus.brown.words(): f.write('{word} '.format(word=word))
# download the text8 corpus (a 100 MB sample of cleaned wikipedia text)# alternately, you can simply download the pretrained models below if you wish to avoid downloading and training!wget http://mattmahoney.net/dc/text8.zip
# download the file questions-words.txt to be used for comparing word embeddings!wget https://raw.githubusercontent.com/arfon/word2vec/master/questions-words.txt
Train models
If you wish to avoid training, you can download pre-trained models instead in the next section. For training the fastText models yourself, you'll have to follow the setup instructions for and run the training with -
!./fasttext skipgram -input brown_corp.txt -output brown_ft!./fasttext skipgram -input text8.txt -output text8_ft
For training the gensim models -
from nltk.corpus import brownfrom gensim.models import Word2Vec from gensim.models.word2vec import Text8Corpus import logging logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s') logging.root.setLevel(level=logging.INFO) MODELS_DIR = 'models/' brown_gs = Word2Vec(brown.sents()) brown_gs.save_word2vec_format(MODELS_DIR + 'brown_gs.vec') text8_gs = Word2Vec(Text8Corpus('text8')) text8_gs.save_word2vec_format(MODELS_DIR + 'text8_gs.vec')
Download models
In case you wish to avoid downloading the corpus and training the models, you can download pretrained models with -
# download the fastText and gensim models trained on the brown corpus and text8 corpus!wget https://www.dropbox.com/s/4kray3epy439gca/models.tar.gz?dl=1 -O models.tar.gz
Once you have downloaded or trained the models (make sure they're in the models/
directory, or that you've appropriately changed MODELS_DIR
) and downloaded questions-words.txt
, you're ready to run the comparison.
Comparisons
from gensim.models import Word2Vecdef print_accuracy(model, questions_file): print('Evaluating...\n') acc = model.accuracy(questions_file) for section in acc: correct = len(section['correct']) total = len(section['correct']) + len(section['incorrect']) total = total if total else 1 accuracy = 100*float(correct)/total print('{:d}/{:d}, {:.2f}%, Section: {:s}'.format(correct, total, accuracy, section['section'])) sem_correct = sum((len(acc[i]['correct']) for i in range(5))) sem_total = sum((len(acc[i]['correct']) + len(acc[i]['incorrect'])) for i in range(5)) print('\nSemantic: {:d}/{:d}, Accuracy: {:.2f}%'.format(sem_correct, sem_total, 100*float(sem_correct)/sem_total)) syn_correct = sum((len(acc[i]['correct']) for i in range(5, len(acc)-1))) syn_total = sum((len(acc[i]['correct']) + len(acc[i]['incorrect'])) for i in range(5,len(acc)-1)) print('Syntactic: {:d}/{:d}, Accuracy: {:.2f}%\n'.format(syn_correct, syn_total, 100*float(syn_correct)/syn_total)) MODELS_DIR = 'models/' word_analogies_file