  3.1 WordNet是什么









名词网络的主干是蕴涵关系的层次(上位/下位关系),它占据了关系中的将近80%。层次中的最顶层是11个抽象概念,称为基本类别始点(unique beginners),例如实体(entity,“有生命的或无生命的具体存在”),心理特征(psychological feature,“生命有机体的精神上的特征)。名词层次中最深的层次是16个节点。


         通俗地来说,WordNet是一个结构化很好的知识库,它不但包括一般的词典功能,另外还有词的分类信息。目前,基于WordNet的方法相对来说比较成熟,比如路径方法 (lch)、基于信息论方法(res)等。(详见参考文献)


    3.2 WordNet的安装与配置

有了WordNet ,也就等于是有了我们所要的单词库。所以,暂时先不考虑相似度的计算,把WordNet下载下来再说。




    bin目录下,有可执行文件WordNet 2.1.exe:



           可以看到,WordNet对所有的英文单词都进行的分类,并且形成了一棵语义树。在本例中,entity——>abstract entity——>abstraction——>attribute——>state——>feeling——> emotion——>love;











我们首先以 WordNet 的词汇语义分类作为基础,抽取出其中的同义词,然后采用基于向量空间的方法计算出相似度。工作流程如下:




         5.1   特征提取


         5.2   意义相似度和词语相似度的计算











先看第一组:trade vs trade




再看第二组:trade#n#5   vs deal#n#1






trade#n#7   vs deal#n#2













  1. package JWordNetSim.test;
  2. import java.io.FileInputStream;
  3. import java.util.HashMap;
  4. import java.util.Map;
  5. import net.didion.jwnl.JWNL;
  6. import net.didion.jwnl.data.IndexWord;
  7. import net.didion.jwnl.data.POS;
  8. import net.didion.jwnl.dictionary.Dictionary;
  9. import shef.nlp.wordnet.similarity.SimilarityMeasure;
  10. /**
  11. * A simple test of this WordNet similarity library.
  12. * @author Mark A. Greenwood
  13. */
  14. public class Test
  15. {
  16. public static void main(String[] args) throws Exception
  17. {
  18. //在运行代码前,必须在本机上安装wordnet2.0,只能装2.0,装了2.1会出错
  19. JWNL.initialize(new FileInputStream("D:\\JAVAProjectWorkSpace\\jwnl\\JWordNetSim\\test\\wordnet.xml"));
  20. //建议一个映射去配置相关参数
  21. Map<String,String> params = new HashMap<String,String>();
  22. //the simType parameter is the class name of the measure to use
  23. params.put("simType","shef.nlp.wordnet.similarity.JCn");
  24. //this param should be the URL to an infocontent file (if required
  25. //by the similarity measure being loaded)
  26. params.put("infocontent","file:D:\\JAVAProjectWorkSpace\\jwnl\\JWordNetSim\\test\\ic-bnc-resnik-add1.dat");
  27. //this param should be the URL to a mapping file if the
  28. //user needs to make synset mappings
  29. params.put("mapping","file:D:\\JAVAProjectWorkSpace\\jwnl\\JWordNetSim\\test\\domain_independent.txt");
  30. //create the similarity measure
  31. SimilarityMeasure sim = SimilarityMeasure.newInstance(params);
  32. //取词
  33. // Dictionary dict = Dictionary.getInstance();
  34. // IndexWord word1 = dict.getIndexWord(POS.NOUN, "trade"); //这里把trade和dog完全定义为名词来进行处理
  35. // IndexWord word2 = dict.getIndexWord(POS.NOUN,"dog"); //
  36. //
  37. // //and get the similarity between the first senses of each word
  38. // System.out.println(word1.getLemma()+"#"+word1.getPOS().getKey()+"#1 " + word2.getLemma()+"#"+word2.getPOS().getKey()+"#1 " + sim.getSimilarity(word1.getSense(1), word2.getSense(1)));
  39. // //get similarity using the string methods (note this also makes use
  40. // //of the fake root node)
  41. // System.out.println(sim.getSimilarity("trade#n","deal#n"));
  42. //get a similarity that involves a mapping
  43. System.out.println(sim.getSimilarity("trade", "trade"));
  44. System.out.println(sim.getSimilarity("trade", "deal"));
  45. System.out.println(sim.getSimilarity("trade", "commerce"));
  46. System.out.println(sim.getSimilarity("trade", "transaction"));
  47. System.out.println(sim.getSimilarity("trade", "finance"));
  48. System.out.println(sim.getSimilarity("trade", "financial"));
  49. System.out.println(sim.getSimilarity("trade", "business"));
  50. System.out.println(sim.getSimilarity("trade", "economy"));
  51. System.out.println(sim.getSimilarity("trade", "school"));
  52. System.out.println(sim.getSimilarity("trade", "dog"));
  53. System.out.println(sim.getSimilarity("trade", "cat"));
  54. System.out.println(sim.getSimilarity("trade", "book"));
  55. System.out.println(sim.getSimilarity("trade", "sunshine"));
  56. System.out.println(sim.getSimilarity("trade", "smile"));
  57. System.out.println(sim.getSimilarity("trade", "nice"));
  58. System.out.println(sim.getSimilarity("trade", "hardly"));
  59. System.out.println(sim.getSimilarity("trade", "beautiful"));
  60. }
  61. }


  1. package shef.nlp.wordnet.similarity;
  2. import java.io.BufferedReader;
  3. import java.io.InputStreamReader;
  4. import java.net.URL;
  5. import java.util.Arrays;
  6. import java.util.HashMap;
  7. import java.util.HashSet;
  8. import java.util.LinkedHashMap;
  9. import java.util.Map;
  10. import java.util.Set;
  11. import net.didion.jwnl.JWNLException;
  12. import net.didion.jwnl.data.IndexWord;
  13. import net.didion.jwnl.data.POS;
  14. import net.didion.jwnl.data.Synset;
  15. import net.didion.jwnl.dictionary.Dictionary;
  16. /**
  17. * An abstract notion of a similarity measure that all provided
  18. * implementations extend.
  19. * @author Mark A. Greenwood
  20. */
  21. public abstract class SimilarityMeasure
  22. {
  23. /**
  24. * A mapping of terms to specific synsets. Usually used to map domain
  25. * terms to a restricted set of synsets but can also be used to map
  26. * named entity tags to appropriate synsets.
  27. */
  28. private Map<String,Set<Synset>> domainMappings = new HashMap<String,Set<Synset>>();
  29. /**
  30. * The maximum size the cache can grow to
  31. */
  32. private int cacheSize = 5000;
  33. /**
  34. * To speed up computation of the similarity between two synsets
  35. * we cache each similarity that is computed so we only have to
  36. * do each one once.
  37. */
  38. private Map<String,Double> cache = new LinkedHashMap<String,Double>(16,0.75f,true)
  39. {
  40. public boolean removeEldestEntry(Map.Entry<String,Double> eldest)
  41. {
  42. //if the size is less than zero then the user is asking us
  43. //not to limit the size of the cache so return false
  44. if (cacheSize < 0) return false;
  45. //if the cache has crown bigger than it's max size return true
  46. return size() > cacheSize;
  47. }
  48. };
  49. /**
  50. * Get a previously computed similarity between two synsets from the cache.
  51. * @param s1 the first synset between which we are looking for the similarity.
  52. * @param s2 the other synset between which we are looking for the similarity.
  53. * @return The similarity between the two sets or null
  54. * if it is not in the cache.
  55. */
  56. protected final Double getFromCache(Synset s1, Synset s2)
  57. {
  58. return cache.get(s1.getKey()+"-"+s2.getKey());
  59. }
  60. /**
  61. * Add a computed similarity between two synsets to the cache so that
  62. * we don't have to compute it if it is needed in the future.
  63. * @param s1 one of the synsets between which we are storring a similarity.
  64. * @param s2 the other synset between which we are storring a similarity.
  65. * @param sim the similarity between the two supplied synsets.
  66. * @return the similarity score just added to the cache.
  67. */
  68. protected final double addToCache(Synset s1, Synset s2, double sim)
  69. {
  70. cache.put(s1.getKey()+"-"+s2.getKey(),sim);
  71. return sim;
  72. }
  73. /**
  74. * Configures the similarity measure using the supplied parameters.
  75. * @param params a set of key-value pairs that are used to configure
  76. * the similarity measure. See concrete implementations for details
  77. * of expected/possible parameters.
  78. * @throws Exception if an error occurs while configuring the similarity measure.
  79. */
  80. protected abstract void config(Map<String,String> params) throws Exception;
  81. /**
  82. * Create a new instance of a similarity measure.
  83. * @param confURL the URL of a configuration file. Parameters are specified
  84. * one per line as key:value pairs.
  85. * @return a new instance of a similairy measure as defined by the
  86. * supplied configuration URL.
  87. * @throws Exception if an error occurs while creating the similarity measure.
  88. */
  89. public static SimilarityMeasure newInstance(URL confURL) throws Exception
  90. {
  91. //create map to hold the key-value pairs we are going to read from
  92. //the configuration file
  93. Map<String,String> params = new HashMap<String,String>();
  94. //create a reader for the config file
  95. BufferedReader in = null;
  96. try
  97. {
  98. //open the config file
  99. in = new BufferedReader(new InputStreamReader(confURL.openStream()));
  100. String line = in.readLine();
  101. while (line != null)
  102. {
  103. line = line.trim();
  104. if (!line.equals(""))
  105. {
  106. //if the line contains something then
  107. //split the data so we get the key and value
  108. String[] data = line.split("\\s*:\\s*",2);
  109. if (data.length == 2)
  110. {
  111. //if the line is valid add the two parts to the map
  112. params.put(data[0], data[1]);
  113. }
  114. else
  115. {
  116. //if the line isn't valid tell the user but continue on
  117. //with the rest of the file
  118. System.out.println("Config Line is Malformed: " + line);
  119. }
  120. }
  121. //get the next line ready to process
  122. line = in.readLine();
  123. }
  124. }
  125. finally
  126. {
  127. //close the config file if it got opened
  128. if (in != null) in.close();
  129. }
  130. //create and return a new instance of the similarity measure specified
  131. //by the config file
  132. return newInstance(params);
  133. }
  134. /**
  135. * Creates a new instance of a similarity measure using the supplied parameters.
  136. * @param params a set of key-value pairs which define the similarity measure.
  137. * @return the newly created similarity measure.
  138. * @throws Exception if an error occurs while creating the similarity measure.
  139. */
  140. public static SimilarityMeasure newInstance(Map<String,String> params) throws Exception
  141. {
  142. //get the class name of the implementation we need to load
  143. String name = params.remove("simType");
  144. //if the name hasn't been specified then throw an exception
  145. if (name == null) throw new Exception("Must specifiy the similarity measure to use");
  146. //Get hold of the class we need to load
  147. @SuppressWarnings("unchecked") Class<SimilarityMeasure> c = (Class<SimilarityMeasure>)Class.forName(name);
  148. //create a new instance of the similarity measure
  149. SimilarityMeasure sim = c.newInstance();
  150. //get the cache parameter from the config params
  151. String cSize = params.remove("cache");
  152. //if a cache size was specified then set it
  153. if (cSize != null) sim.cacheSize = Integer.parseInt(cSize);
  154. //get the url of the domain mapping file
  155. String mapURL = params.remove("mapping");
  156. if (mapURL != null)
  157. {
  158. //if a mapping file has been provided then
  159. //open a reader over the file
  160. BufferedReader in = new BufferedReader(new InputStreamReader((new URL(mapURL)).openStream()));
  161. //get the first line ready for processing
  162. String line = in.readLine();
  163. while (line != null)
  164. {
  165. if (!line.startsWith("#"))
  166. {
  167. //if the line isn't a comment (i.e. it doesn't start with #) then...
  168. //split the line at the white space
  169. String[] data = line.trim().split("\\s+");
  170. //create a new set to hold the mapped synsets
  171. Set<Synset> mappedTo = new HashSet<Synset>();
  172. for (int i = 1 ; i < data.length ; ++i)
  173. {
  174. //for each synset mapped to get the actual Synsets
  175. //and store them in the set
  176. mappedTo.addAll(sim.getSynsets(data[i]));
  177. }
  178. //if we have found some actual synsets then
  179. //store them in the domain mappings
  180. if (mappedTo.size() > 0) sim.domainMappings.put(data[0], mappedTo);
  181. }
  182. //get the next line from the file
  183. line = in.readLine();
  184. }
  185. //we have finished with the mappings file so close it
  186. in.close();
  187. }
  188. //make sure it is configured properly
  189. sim.config(params);
  190. //then return it
  191. return sim;
  192. }
  193. /**
  194. * This is the method responsible for computing the similarity between two
  195. * specific synsets. The method is implemented differently for each
  196. * similarity measure so see the subclasses for detailed information.
  197. * @param s1 one of the synsets between which we want to know the similarity.
  198. * @param s2 the other synset between which we want to know the similarity.
  199. * @return the similarity between the two synsets.
  200. * @throws JWNLException if an error occurs accessing WordNet.
  201. */
  202. public abstract double getSimilarity(Synset s1, Synset s2) throws JWNLException;
  203. /**
  204. * Get the similarity between two words. The words can be specified either
  205. * as just the word or in an encoded form including the POS tag and possibly
  206. * the sense number, i.e. cat#n#1 would specifiy the 1st sense of the noun cat.
  207. * @param w1 one of the words to compute similarity between.
  208. * @param w2 the other word to compute similarity between.
  209. * @return a SimilarityInfo instance detailing the similarity between the
  210. * two words specified.
  211. * @throws JWNLException if an error occurs accessing WordNet.
  212. */
  213. public final SimilarityInfo getSimilarity(String w1, String w2) throws JWNLException
  214. {
  215. //Get the (possibly) multiple synsets associated with each word
  216. Set<Synset> ss1 = getSynsets(w1);
  217. Set<Synset> ss2 = getSynsets(w2);
  218. //assume the words are not at all similar
  219. SimilarityInfo sim = null;
  220. for (Synset s1 : ss1)
  221. {
  222. for (Synset s2 : ss2)
  223. {
  224. //for each pair of synsets get the similarity
  225. double score = getSimilarity(s1, s2);
  226. if (sim == null || score > sim.getSimilarity())
  227. {
  228. //if the similarity is better than we have seen before
  229. //then create and store an info object describing the
  230. //similarity between the two synsets
  231. sim = new SimilarityInfo(w1, s1, w2, s2, score);
  232. }
  233. }
  234. }
  235. //return the maximum similarity we have found
  236. return sim;
  237. }
  238. /**
  239. * Finds all the synsets associated with a specific word.
  240. * @param word the word we are interested. Note that this may be encoded
  241. * to include information on POS tag and sense index.
  242. * @return a set of synsets that are associated with the supplied word
  243. * @throws JWNLException if an error occurs accessing WordNet
  244. */
  245. private final Set<Synset> getSynsets(String word) throws JWNLException
  246. {
  247. //get a handle on the WordNet dictionary
  248. Dictionary dict = Dictionary.getInstance();
  249. //create an emptuy set to hold any synsets we find
  250. Set<Synset> synsets = new HashSet<Synset>();
  251. //split the word on the # characters so we can get at the
  252. //upto three componets that could be present: word, POS tag, sense index
  253. String[] data = word.split("#");
  254. //if the word is in the domainMappings then simply return the mappings
  255. if (domainMappings.containsKey(data[0])) return domainMappings.get(data[0]);
  256. if (data.length == 1)
  257. {
  258. //if there is just the word
  259. for (IndexWord iw : dict.lookupAllIndexWords(data[0]).getIndexWordArray())
  260. {
  261. //for each matching word in WordNet add all it's senses to
  262. //the set we are building up
  263. synsets.addAll(Arrays.asList(iw.getSenses()));
  264. }
  265. //we have finihsed so return the synsets we found
  266. return synsets;
  267. }
  268. //the calling method specified a POS tag as well so get that
  269. POS pos = POS.getPOSForKey(data[1]);
  270. //if the POS tag isn't valid throw an exception
  271. if (pos == null) throw new JWNLException("Invalid POS Tag: " + data[1]);
  272. //get the word with the specified POS tag from WordNet
  273. IndexWord iw = dict.getIndexWord(pos, data[0]);
  274. if (data.length > 2)
  275. {
  276. //if the calling method specified a sense index then
  277. //add just that sysnet to the set we are creating
  278. synsets.add(iw.getSense(Integer.parseInt(data[2])));
  279. }
  280. else
  281. {
  282. //no sense index was specified so add all the senses of
  283. //the word to the set we are creating
  284. synsets.addAll(Arrays.asList(iw.getSenses()));
  285. }
  286. //return the set of synsets we found for the specified word
  287. return synsets;
  288. }
  289. }





  1. /************************************************************************
  2. * Copyright (C) 2006-2007 The University of Sheffield *
  3. * Developed by Mark A. Greenwood <m.greenwood@dcs.shef.ac.uk> *
  4. * *
  5. * This program is free software; you can redistribute it and/or modify *
  6. * it under the terms of the GNU General Public License as published by *
  7. * the Free Software Foundation; either version 2 of the License, or *
  8. * (at your option) any later version. *
  9. * *
  10. * This program is distributed in the hope that it will be useful, *
  11. * but WITHOUT ANY WARRANTY; without even the implied warranty of *
  13. * GNU General Public License for more details. *
  14. * *
  15. * You should have received a copy of the GNU General Public License *
  16. * along with this program; if not, write to the Free Software *
  17. * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. *
  18. ************************************************************************/
  19. package shef.nlp.wordnet.similarity;
  20. import net.didion.jwnl.JWNLException;
  21. import net.didion.jwnl.data.Synset;
  22. /**
  23. * An implementation of the WordNet similarity measure developed by Jiang and
  24. * Conrath. For full details of the measure see:
  25. * <blockquote>Jiang J. and Conrath D. 1997. Semantic similarity based on corpus
  26. * statistics and lexical taxonomy. In Proceedings of International
  27. * Conference on Research in Computational Linguistics, Taiwan.</blockquote>
  28. * @author Mark A. Greenwood
  29. */
  30. public class JCn extends ICMeasure
  31. {
  32. /**
  33. * Instances of this similarity measure should be generated using the
  34. * factory methods of {@link SimilarityMeasure}.
  35. */
  36. protected JCn()
  37. {
  38. //A protected constructor to force the use of the newInstance method
  39. }
  40. @Override public double getSimilarity(Synset s1, Synset s2) throws JWNLException
  41. {
  42. //if the POS tags are not the same then return 0 as this measure
  43. //only works with 2 nouns or 2 verbs.
  44. if (!s1.getPOS().equals(s2.getPOS())) return 0;
  45. //see if the similarity is already cached and...
  46. Double cached = getFromCache(s1, s2);
  47. //if it is then simply return it
  48. if (cached != null) return cached.doubleValue();
  49. //Get the Information Content (IC) values for the two supplied synsets
  50. double ic1 = getIC(s1);
  51. double ic2 = getIC(s2);
  52. //if either IC value is zero then cache and return a sim of 0
  53. if (ic1 == 0 || ic2 == 0) return addToCache(s1,s2,0);
  54. //Get the Lowest Common Subsumer (LCS) of the two synsets
  55. Synset lcs = getLCSbyIC(s1,s2);
  56. //if there isn't an LCS then cache and return a sim of 0
  57. if (lcs == null) return addToCache(s1,s2,0);
  58. //get the IC valueof the LCS
  59. double icLCS = getIC(lcs);
  60. //compute the distance between the two synsets
  61. //NOTE: This is the original JCN measure
  62. double distance = ic1 + ic2 - (2 * icLCS);
  63. //assume the similarity between the synsets is 0
  64. double sim = 0;
  65. if (distance == 0)
  66. {
  67. //if the distance is 0 (i.e. ic1 + ic2 = 2 * icLCS) then...
  68. //get the root frequency for this POS tag
  69. double rootFreq = getFrequency(s1.getPOS());
  70. if (rootFreq > 0.01)
  71. {
  72. //if the root frequency has a value then use it to generate a
  73. //very large sim value
  74. sim = 1/-Math.log((rootFreq - 0.01) / rootFreq);
  75. }
  76. }
  77. else
  78. {
  79. //this is the normal case so just convert the distance
  80. //to a similarity by taking the multiplicative inverse
  81. sim = 1/distance;
  82. }
  83. //cache and return the calculated similarity
  84. return addToCache(s1,s2,sim);
  85. }
  86. }


  1. package shef.nlp.wordnet.similarity;
  2. import net.didion.jwnl.JWNLException;
  3. import net.didion.jwnl.data.Synset;
  4. /**
  5. * An implementation of the WordNet similarity measure developed by Lin. For
  6. * full details of the measure see:
  7. * <blockquote>Lin D. 1998. An information-theoretic definition of similarity. In
  8. * Proceedings of the 15th International Conference on Machine
  9. * Learning, Madison, WI.</blockquote>
  10. * @author Mark A. Greenwood
  11. */
  12. public class Lin extends ICMeasure
  13. {
  14. /**
  15. * Instances of this similarity measure should be generated using the
  16. * factory methods of {@link SimilarityMeasure}.
  17. */
  18. protected Lin()
  19. {
  20. //A protected constructor to force the use of the newInstance method
  21. }
  22. @Override public double getSimilarity(Synset s1, Synset s2) throws JWNLException
  23. {
  24. //if the POS tags are not the same then return 0 as this measure
  25. //only works with 2 nouns or 2 verbs.
  26. if (!s1.getPOS().equals(s2.getPOS())) return 0;
  27. //see if the similarity is already cached and...
  28. Double cached = getFromCache(s1, s2);
  29. //if it is then simply return it
  30. if (cached != null) return cached.doubleValue();
  31. //Get the Information Content (IC) values for the two supplied synsets
  32. double ic1 = getIC(s1);
  33. double ic2 = getIC(s2);
  34. //if either IC value is zero then cache and return a sim of 0
  35. if (ic1 == 0 || ic2 == 0) return addToCache(s1,s2,0);
  36. //Get the Lowest Common Subsumer (LCS) of the two synsets
  37. Synset lcs = getLCSbyIC(s1,s2);
  38. //if there isn't an LCS then cache and return a sim of 0
  39. if (lcs == null) return addToCache(s1,s2,0);
  40. //get the IC valueof the LCS
  41. double icLCS = getIC(lcs);
  42. //caluclaue the similarity score
  43. double sim = (2*icLCS)/(ic1+ic2);
  44. //cache and return the calculated similarity
  45. return addToCache(s1,s2,sim);
  46. }
  47. }



《基于WordNet的英语词语相似度计算》颜伟,荀恩东(北京语言大学 语言信息处理研究所)


