博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Comparison of FastText and Word2Vec
阅读量:6413 次
发布时间:2019-06-23

本文共 3574 字,大约阅读时间需要 11 分钟。

Comparison of FastText and Word2Vec

 

Facebook Research open sourced a great project yesterday - , a fast (no surprise) and effective method to learn word representations and perform text classification. I was curious about comparing these embeddings to other commonly used embeddings, so word2vec seemed like the obvious choice, especially considering fastText embeddings are based upon word2vec.

 

Download data

In [ ]:
import nltknltk.download() # Only the brown corpus is needed in case you don't have it. # alternately, you can simply download the pretrained models below if you wish to avoid downloading and training # Generate brown corpus text file with open('brown_corp.txt', 'w+') as f: for word in nltk.corpus.brown.words(): f.write('{word} '.format(word=word))
In [ ]:
# download the text8 corpus (a 100 MB sample of cleaned wikipedia text)# alternately, you can simply download the pretrained models below if you wish to avoid downloading and training!wget http://mattmahoney.net/dc/text8.zip
In [ ]:
# download the file questions-words.txt to be used for comparing word embeddings!wget https://raw.githubusercontent.com/arfon/word2vec/master/questions-words.txt
 

Train models

 

If you wish to avoid training, you can download pre-trained models instead in the next section. For training the fastText models yourself, you'll have to follow the setup instructions for  and run the training with -

In [ ]:
!./fasttext skipgram -input brown_corp.txt -output brown_ft!./fasttext skipgram -input text8.txt -output text8_ft
 

For training the gensim models -

In [ ]:
from nltk.corpus import brownfrom gensim.models import Word2Vec from gensim.models.word2vec import Text8Corpus import logging logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s') logging.root.setLevel(level=logging.INFO) MODELS_DIR = 'models/' brown_gs = Word2Vec(brown.sents()) brown_gs.save_word2vec_format(MODELS_DIR + 'brown_gs.vec') text8_gs = Word2Vec(Text8Corpus('text8')) text8_gs.save_word2vec_format(MODELS_DIR + 'text8_gs.vec')
 

Download models

In case you wish to avoid downloading the corpus and training the models, you can download pretrained models with -

In [ ]:
# download the fastText and gensim models trained on the brown corpus and text8 corpus!wget https://www.dropbox.com/s/4kray3epy439gca/models.tar.gz?dl=1 -O models.tar.gz
 

Once you have downloaded or trained the models (make sure they're in the models/ directory, or that you've appropriately changed MODELS_DIR) and downloaded questions-words.txt, you're ready to run the comparison.

 

Comparisons

In [1]:
from gensim.models import Word2Vecdef print_accuracy(model, questions_file): print('Evaluating...\n') acc = model.accuracy(questions_file) for section in acc: correct = len(section['correct']) total = len(section['correct']) + len(section['incorrect']) total = total if total else 1 accuracy = 100*float(correct)/total print('{:d}/{:d}, {:.2f}%, Section: {:s}'.format(correct, total, accuracy, section['section'])) sem_correct = sum((len(acc[i]['correct']) for i in range(5))) sem_total = sum((len(acc[i]['correct']) + len(acc[i]['incorrect'])) for i in range(5)) print('\nSemantic: {:d}/{:d}, Accuracy: {:.2f}%'.format(sem_correct, sem_total, 100*float(sem_correct)/sem_total)) syn_correct = sum((len(acc[i]['correct']) for i in range(5, len(acc)-1))) syn_total = sum((len(acc[i]['correct']) + len(acc[i]['incorrect'])) for i in range(5,len(acc)-1)) print('Syntactic: {:d}/{:d}, Accuracy: {:.2f}%\n'.format(syn_correct, syn_total, 100*float(syn_correct)/syn_total)) MODELS_DIR = 'models/' word_analogies_file

转载地址:http://yadra.baihongyu.com/

你可能感兴趣的文章
2019-2-22集合作业
查看>>
页面导航
查看>>
算法:二叉搜索树的后序遍历序列
查看>>
System.Net.WebException: 请求因 HTTP 状态 503 失败
查看>>
拓展随记
查看>>
服务器远程链接
查看>>
所谓的日常 #6 - 焚金闕董卓行兇 匿玉璽孫堅背約
查看>>
[Winform]DataGridView列自适应宽度
查看>>
进程与线程
查看>>
编写css让div2在div1的右下角?
查看>>
将NSString写入到文件中
查看>>
SpringTask定时任务
查看>>
Log4j之使用demo
查看>>
会议02
查看>>
人月神话读后感
查看>>
PHP移植
查看>>
利用素数证明可数集的所有有限子集形成的集合是可数集
查看>>
我的 xelatex 模板
查看>>
《几何与代数导引》习题1.35.5
查看>>
20145222《信息安全系统设计基础》我的第1-6周考试错题汇总
查看>>