Notes on Text Mining and Analytics - 1

Word Association Mining and Analysis

1. Basic word relations

  • Paradigmatic: A和B可以相互取代(substituted)而不影响句意,即A和B属于一个class。通常可以理解为A、B可能出现在一句话中的相似的位置(similar locations),具有context similarity。eg. cat, dog.
  • Syntagmatic: A和B结合(combine)后可以用于传达一个完整的句意,即A和B有句意上的联系(semantically related)。通常可以理解为words co-occurrent elements同时出现。eg. dog sit.

A和B不用必须是words,也可以是phrases

2. Applications of mining word associations

  • Text retrieval
  • Automatic construction of topic map
  • Compare and summarize opinions(比如,用Syntagmatic可以获得更加细节的描述)

3. Mining Word Associations: General Ideas

  • Paradigmatic - Represent each word by its context - Compute context similarity - Words with high context similarity likely have paradigmatic relation
  • Syntagmatic- Count how many times two words occur together in a context (e.g. sentence or paragraph) - Compare their co-occurrences with their individual occurrences - Words with high co-occurrences but relatively low individual occurrences likely have syntagmatic relation
  • Paradigmatically related words tend to have syntagmatic relation with the same word -> joint discovery of the two relations
  • These ideas can be implemented in many different ways

Paradigmatic Relation Discovery

  • Context = pseudo document = “bag of words”
  • Context may contain adjacent or non-adjacent words

1. Measuring Context Similarity

  1. 把bag of words转化为Vector Space Model(VSM),将每个词视为高维空间中的一个维度。如果有N个words,那么我们就有N个维度。
  2. 再定义一个表现语境的词频向量(frequency vector),里面包含每个单词在这个词库中出现的次数(frequency)。
  3. 词频向量(frequency vector)可以被放入向量空间模型(VSM)中,这样就把context转换到了向量空间模型(VSM)中。那么paradigmatic discovery 转换为计算向量及其相似度的问题。

2. 怎样计算相似度?Expected Overlap of Words in Context (EOWC)

\[Sim(d_{1}, d_{2}) = d_{1}\cdot d_{2} = x_{1}y_{1} + ... + x_{N}y_{N} = \sum_{i=1}^{N}x_{i}y_{i}\]
  • Intuitively, it makes sense: The more overlap the two context documents have, the higher the similarity would be.
  • Disadvantages: 1. It favors matching one frequent term very well over matching more distinct terms.可能在匹配常见词方面比匹配特有词做的更好,因为点积有一项很大可能就带动整体的结果都很大(contribute a lot),相比于另一个有很多不同词频低共有词的向量点积,可能会导致前者结果更大。 2. It treats every word equally (overlap on “the” isn’t as so meaningful as overlap on “eats”).对每一个词都是平等的。

3. Improving EOWC with Retrieval Heuristics

解决问题1,转换原始的词频 Sublinear transformation of Term Frequency (TF) 解决问题2,Reward matching a rare word: Inverse document frequency(IDF) term werighting