Note: GPT-2, which predicts very well, trained on 40GB of text, but I show below what also makes it tick besides dataset frequencies.

In this video I do tests to GPT-2 and show proof that the word to predict is heavily influenced by story words no matter where they are in the story. Watch as I force GPT-2 to change the %s as I add words to various places. Notice also, if it seen it follow in the story, it makes it even more likely, even though never seen in the dataset. Also, when you talk about trees etc it will also boost up not only trees candidates but leafs grass etc etc. It uses something like Word2vec not only to recognize the sentence (it also focuses on important words so to summarize it), but also for voting on the prediction candidates, see!

https://www.youtube.com/watch?v=RWd-...ature=youtu.be

What do you think about this? It allows it to consider/ recognize very very long unseen context by using something like word2vec to find context matches that aren't exact and by focusing on important words that are more rare, and then to stretch even longer and become more accurate as well it boosts candidate predictions using remaining energy from prior activations. More recent words boost the predictions the most, they have not lost energy as much.