BERT (言語モデル)

Bidirectional Encoder Representations from Transformers（BERT）は、自然言語処理の事前学習用の Transformer ベースの機械学習手法である[1]。BERT は、Google の Jacob Devlin と彼の同僚によって2018年に作成され公開された[2][3]。2019年現在、Google は BERT を活用して、ユーザー検索の理解を深めている[4]。

背景

方向制約

BERT 以前の多くの言語モデルは事前学習に単方向性（英: unidirectional）のタスクを採用しており[5]、学習された表現も単方向の文脈しか考慮しないものであった。この制約は文脈レベルの表現が必要なタスクの性能へ大きなペナルティを与えうる。

手法

これらの背景に基づき、BERT は MLM事前タスクと双方向 Transformer encoder による言語表現モデルを提唱した。

双方向タスク/MLM

単方向制約を超えた双方向（Bidirectional）の言語モデルを構築するために、BERT では事前学習タスク/損失関数として masked language model (MLM) を採用した[6]。MLMでは部分マスクされた系列を入力としてマスク無し系列を予測し、マスク部に対応する出力に対して一致度を計算し学習する[7]。モデルはマスクされていない情報（周囲の文脈/context）のみからマスク部を予測する事前学習タスクを解くことになる[8]。

双方向ネットワーク/Bidirectional Transfomer

MLM により双方向に依存するモデルを採用可能になったことから、BERT ではネットワークとして双方向性の Transformerアーキテクチャ (Bidirectional Encoder[9] of Transformer) を採用した[10]。すなわち self-attention による前後文脈取り込みと位置限局全結合による変換を繰り返すネットワークを用いている。

性能

オリジナルの英語の BERT には、以下の2つのアーキテクチャに基づく事前学習モデルがある[11]。

BERT _BASEモデル – 12層、隠れ層 768、12ヘッド、1億1000万パラメータ
BERT _LARGEモデル – 24層、隠れ層 1024、16ヘッド、3億4000万パラメータ

どちらも BooksCorpus[12] の8億語と、英語版ウィキペディアの2億5000万語でトレーニングされた。

BERT が公開されたとき、BERT は多くの自然言語理解タスクで最先端の性能を達成した[2]。

GLUE（一般言語理解評価）タスクセット（9つのタスクで構成される）
SQuAD（スタンフォード質問回答データセット）v1.1およびv2.0
SWAG（敵対的生成の状況）

分析

これらの自然言語理解タスクにおける BERT の最先端のパフォーマンスの理由はまだよく理解されていない[13][14]。現在の研究は、慎重に選択された入力シーケンスの結果としての BERT の出力の背後にある関係の調査[15][16]、プロービング分類器による内部ベクトル表現の分析[17][18]、およびアテンションの重みによって表される関係に焦点を当てている。

BERT の起源は、半教師ありシーケンス学習[19]、生成的事前トレーニング、ELMo[20]、ULMFit などの事前トレーニングコンテキスト表現にある[21]。従来のモデルとは異なり、BERT は、プレーンテキストコーパスのみを使用して事前にトレーニングされた、双方向の教師なし言語表現である。 word2vecやGloVeなどの文脈自由モデルは、語彙の各単語に対して単一の単語埋め込み表現を生成する。BERT は、特定の単語の出現ごとにコンテキストを考慮する。たとえば、word2vec では「He is running a company」の「running（経営する）」も「He is running a marathon」の「running（走る）」も同じベクトル表現にしてしまうが、BERT ではコンテキスト化された埋め込みを行い、文によって異なるということになる。

2019年10月25日、Google検索は、米国内の英語検索クエリに BERTモデルの適用を開始したことを発表した[22]。2019年12月9日、BERT が 70を超える言語で Google検索に採用されたことが報告された[23]。2020年10月、ほぼすべての英語ベースのクエリが BERT によって処理された[24]。

認識

BERT は、計算言語学会（NAACL）の北米支部の2019年年次会議で Best Long PaperAward を受賞した[25]。

脚注

出典

"We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers." Devlin (2018)
Devlin, Jacob; Chang, Ming-Wei (11 October 2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". arXiv:1810.04805v2 [cs.CL]。
“Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing” (英語). Google AI Blog. 2019年11月27日閲覧。
“Understanding searches better than ever before” (英語). Google (2019年10月25日). 2019年11月27日閲覧。
"objective function during pre-training, where they use unidirectional language models to learn general language representations" Devlin (2018)
"BERT alleviates the previously mentioned unidirectionality constraint by using a 'masked language model' (MLM) pre-training objective" Devlin (2018)
"The masked language model randomly masks some of the tokens from the input, and the objective is to predict the original vocabulary id of the masked word" Devlin (2018)
"predict the original vocabulary id of the masked word based only on its context." Devlin (2018)
"Critically ... the BERT Transformer uses bidirectional self-attention ... We note that in the literature the bidirectional Transformer is often referred to as a 'Transformer encoder' while the left-context-only version is referred to as a 'Transformer decoder' since it can be used for text generation."
"the MLM objective enables the representation to fuse the left and the right context, which allows us to pretrain a deep bidirectional Transformer." Devlin (2018)
Devlin, Jacob; Chang, Ming-Wei (11 October 2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". arXiv:1810.04805v2 [cs.CL]。
Zhu, Yukun; Kiros, Ryan (2015). "Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books". arXiv:1506.06724 [cs.CV]。
Kovaleva, Olga; Romanov, Alexey; Rogers, Anna; Rumshisky, Anna (November 2019). “Revealing the Dark Secrets of BERT” (英語). Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). pp. 4364–4373. doi:10.18653/v1/D19-1445
Clark, Kevin; Khandelwal, Urvashi; Levy, Omer; Manning, Christopher D. (2019). “What Does BERT Look at? An Analysis of BERT's Attention”. Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP (Stroudsburg, PA, USA: Association for Computational Linguistics): 276–286. doi:10.18653/v1/w19-4828.
Khandelwal, Urvashi; He, He; Qi, Peng; Jurafsky, Dan (2018). “Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context”. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Stroudsburg, PA, USA: Association for Computational Linguistics): 284–294. arXiv:1805.04623. Bibcode: 2018arXiv180504623K. doi:10.18653/v1/p18-1027.
Gulordava, Kristina; Bojanowski, Piotr; Grave, Edouard; Linzen, Tal; Baroni, Marco (2018). “Colorless Green Recurrent Networks Dream Hierarchically”. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) (Stroudsburg, PA, USA: Association for Computational Linguistics): 1195–1205. arXiv:1803.11138. Bibcode: 2018arXiv180311138G. doi:10.18653/v1/n18-1108.
Giulianelli, Mario; Harding, Jack; Mohnert, Florian; Hupkes, Dieuwke; Zuidema, Willem (2018). “Under the Hood: Using Diagnostic Classifiers to Investigate and Improve how Language Models Track Agreement Information”. Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP (Stroudsburg, PA, USA: Association for Computational Linguistics): 240–248. arXiv:1808.08079. Bibcode: 2018arXiv180808079G. doi:10.18653/v1/w18-5426.
Zhang, Kelly; Bowman, Samuel (2018). “Language Modeling Teaches You More than Translation Does: Lessons Learned Through Auxiliary Syntactic Task Analysis”. Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP (Stroudsburg, PA, USA: Association for Computational Linguistics): 359–361. doi:10.18653/v1/w18-5448.
Dai, Andrew; Le, Quoc (4 November 2015). "Semi-supervised Sequence Learning". arXiv:1511.01432 [cs.LG]。
Peters, Matthew; Neumann, Mark (15 February 2018). "Deep contextualized word representations". arXiv:1802.05365v2 [cs.CL]。
Howard, Jeremy; Ruder, Sebastian (18 January 2018). "Universal Language Model Fine-tuning for Text Classification". arXiv:1801.06146v5 [cs.CL]。
Nayak (2019年10月25日). “Understanding searches better than ever before”. Google Blog. 2019年12月10日閲覧。
Montti (2019年12月10日). “Google's BERT Rolls Out Worldwide”. Search Engine Journal. Search Engine Journal. 2019年12月10日閲覧。
“Google: BERT now used on almost every English query”. Search Engine Land (2020年10月15日). 2020年11月24日閲覧。
“Best Paper Awards”. NAACL (2019年). 2020年3月28日閲覧。

BERT (言語モデル)

背景

方向制約

手法

双方向タスク/MLM

双方向ネットワーク/Bidirectional Transfomer

性能

分析

認識

関連項目

脚注

出典

関連文献