CS224 lesson2—Main Line

2017-08-21 CS224学习阅读量

The first and arguably most important common denominator across all NLP tasks is how we represent words as input to any of our models.

本章节主要通过 ”How to represent words?“ 这一话题展开，借鉴了Note 1的编写思路，但是也掺杂了自己的一些小想法，重在理解:happy:

1. 思路

1.1 One-Hot Vector

There are an estimated 13 million(1.3亿）tokens for the English language.（1）

　　上面的 fact ，我们可以对word形成最简单的理解，若给每个word标记一个index，单词的表示方式为
$$
w_i = (0,0,…,0,1,0,…,0,0)^T ( 0<= i < |V|)
$$
　　|v|表示单词表中的单词总量，index = i 位置为 1 表示：wi 为单词表中index = i处的单词。

　　这种向量我们称为 one-hot vector

　　但是，我相信任何人都不希望用这种向量来represent a word，原因是：

All of the words are not completely unrelated.

If we encode all semantics of our language, perhaps there actually exists some N-dimensional space and N << 13 million.

　　然而，one-hot vector 引起的问题是：

One-hot vector represent each word as a completely independent entity. For instance,

$$
(w^{hotel})^Tw^{motel} = (w^{hotel})^Tw^{cat} = 0
$$
　　This word representation does not give us directly any idea of the relationship of two different words.

The dimensions of one-hot vector is really big. They are beyond the limits of computer ability.

　　此时恍然顿悟到我们需要的 representation of the word 有以下几个特点

The dimension should be as small as possible and the computer can easily handle these vectors.

This kind of word representation can give us any directly notion of the word similariy

1.2 Word Representation Based on SVD

　　由上面的叙述可知，我们更想找到的是一种 word embeddings 方法，也就是说我们更想知道的是能否存在某种向量可以 find a subspace that can encode the relationships between words，这种关系可以更形象的称之为“语义”（Semantic）。而SVD（Singular Value Decomposition)方法可以帮助我们。

　　为了形成解决思路，我们首先看一下SVD的表达式
$$
A = U\sum V^T
$$
　　通过上面这个公式，我们可以直观的看到SVD算法似乎是将一个矩阵分解成了几个向量的乘积的形式。而分解的结果我们可以通过下图看到，

　　如果我们将A矩阵看做是一个Corpus的所有word vector组成的矩阵，那么我们可以通过SVD来对这个庞大的A矩阵进行分解，采用U矩阵的行向量来作为word vector，从而达到降维的效果
$$
(0,0,…,0,1,0,…,0,0)^T —> (u_1,u_2,…,u_r)
$$
　　将一个维度为|V|的向量变换成维度为r的向量，其中 r << |V| 。在这里重申一遍降维的含义：降维意味着对语义进行编码，但是我们并不关心这种语义具体是什么，毕竟语义都是我们人为规定的。

SVD的原理可以查看我的下一篇博文 CS224 lesson2—Reference 1 : SVD (Singular Value Decomposition)

1.2.1 矩阵A的表示

现在思路已经清晰，我们现在的首要问题矩阵 A 的表示：

（1）Word-Document Matrix

A bold hypothesis is that related words will often appear in the same documents.

例如，

“banks”,”stocks”,”money” 这几个单词经常出现在一个document中。

“banks”,”bananas”,”rocky” 这几个单词不太可能出现在一个document中。

假设，document~i~ 我们用一个向量 a~i~来表示
$$
ai = (a{i1},a{i2},….,a{in})^T
$$
其中n = |V|，a~ij~ 表示document~i~ 含有的word~j~ 的个数。如果我们有 M 个documents，
$$
A= (a_1,a_2,…,a_M)
$$
此时，我们可能会担心是否又到了问题的原点，因为也许这里有more than billions of documents 等待着我们去 loop。看来我们需要另辟蹊径。

（2）Window based Co-occurence Matrix

我们仍然认为 co-occurences of words 有一定的关系，而另矩阵 A 来存储这种关系。我们的方法是：

Counting the number of times each word appears inside a window of a particular size around the word of interest.

也就是，计算我们感兴趣的词（称之为center word）的某个长度的窗口中所有词的出现个数。假如，我感兴趣的单词是“like”，设定的窗口长度为2。

eg1

可知当中心词为like时， “I” 出现2次，其他单词出现1次。

借用 Note1 中的例子：

eg2

这里的 X 矩阵就是我们要求的A矩阵，所以矩阵 A 的维度为 |V|*|V|。