当前位置:   article > 正文

基于WordNet的英文同义词、近义词相似度评估及代码实现_net.didion.jwnl

net.didion.jwnl

源码地址:https://github.com/XBWer/WordSimilarity

 

   1.确定要解决的问题及意义

在基于代码片段的分类过程中,由于程序员对数据变量名的选取可能具有一定的规范性,在某一特定业务处理逻辑代码中,可能多个变量名之间具有关联性或相似性(如“trade”(商品交易)类中,可能存在“business”,“transaction”,“deal”等同义词),在某些情况下,它们以不同的词语表达了相同的含义。因此,为了能够对代码片段做出更加科学的类别判断,更好地识别这些同义词,我们有必要寻找一种能够解决避免由于同义词的存在而导致误分类的方法。说白了,就是要去判断词语之间的相似度(即确定是否为近义词),并找出代码段中出现次数最多的一组语义。

 

2.要达到的效果

即在给定的代码段中,能够发现哪些词是属于同义词,并且能够实现分类。

Eg.public static void function(){

      String trade=”money”;

      Int deal=5;

      Long long business=0xfffffff;

      Boolen transaction=TRUE;

      ……

}

Output:同义词有:trade,deal,business,transaction

这段代码很可能与trade有关

 

3.初识WordNet

问题确定了之后,通过网上的搜索,发现了WordNetword2vec这两个相关的词汇。(后知后觉,这本身就是一个找近义词的过程)

 

  3.1 WordNet是什么

首先,来看WordNet。搜了一下相关介绍:

WordNet是一个由普林斯顿大学认识科学实验室在心理学教授乔治·A·米勒的指导下建立和维护的英语字典。开发工作从1985年开始,从此以后该项目接受了超过300万美元的资助(主要来源于对机器翻译有兴趣的政府机构)。

由于它包含了语义信息,所以有别于通常意义上的字典。WordNet根据词条的意义将它们分组,每一个具有相同意义的字条组称为一个synset(同义词集合)。WordNet为每一个synset提供了简短,概要的定义,并记录不同synset之间的语义关系。

WordNet的开发有两个目的:

它既是一个字典,又是一个辞典,它比单纯的辞典或词典都更加易于使用。

支持自动的文本分析以及人工智能应用。

WordNet内部结构

在WordNet中,名词动词形容词副词各自被组织成一个同义词的网络,每个同义词集合都代表一个基本的语义概念,并且这些集合之间也由各种关系连接。(一个多义词将出现在它的每个意思的同义词集合中)。在WordNet的第一版中(标记为1.x),四种不同词性的网络之间并无连接。WordNet的名词网络是第一个发展起来的。

名词网络的主干是蕴涵关系的层次(上位/下位关系),它占据了关系中的将近80%。层次中的最顶层是11个抽象概念,称为基本类别始点(unique beginners),例如实体(entity,“有生命的或无生命的具体存在”),心理特征(psychological feature,“生命有机体的精神上的特征)。名词层次中最深的层次是16个节点。

                                                                                                                               (wikipedia)

         通俗地来说,WordNet是一个结构化很好的知识库,它不但包括一般的词典功能,另外还有词的分类信息。目前,基于WordNet的方法相对来说比较成熟,比如路径方法 (lch)、基于信息论方法(res)等。(详见参考文献)

 

    3.2 WordNet的安装与配置

有了WordNet ,也就等于是有了我们所要的单词库。所以,暂时先不考虑相似度的计算,把WordNet下载下来再说。

    参考https://wordnet.princeton.edu/download。顺利地下载,安装以及跑demo。

    之后,一起来看一下WordNet的文件结构:

 

    bin目录下,有可执行文件WordNet 2.1.exe:

 

  

           可以看到,WordNet对所有的英文单词都进行的分类,并且形成了一棵语义树。在本例中,entity——>abstract entity——>abstraction——>attribute——>state——>feeling——> emotion——>love;

从叶子节点到根节点

     WordNet名次分类中的25个基本类:

 

dict目录里面存放的就是资源库了,可以看到,它以形容词,副词,名词,动词来分类:

 

doc为WordNet的用户手册文件文件夹

lib为WordNet软件使用Windows资源的函数库

src为源码文件夹

 

4.解决问题的大致思路

我们首先以 WordNet 的词汇语义分类作为基础,抽取出其中的同义词,然后采用基于向量空间的方法计算出相似度。工作流程如下:

 

5.基于WordNet的相似度计算

         以下摘自:《基于WordNet的英语词语相似度计算》

         5.1   特征提取

 

         5.2   意义相似度和词语相似度的计算

  

 

  

 

 

 6.实现效果

   

与“trade”的相似度比较:

 分析:

 

先看第一组:trade vs trade

 

自己和自己当然是相似度100%

 

再看第二组:trade#n#5   vs deal#n#1

 

相似度竟然和第一组是一样的!根据结果,trade作为名词时,它的第5种含义和deal作为名词时的第1种含义是完全相似的。让我们去库里看个究竟:

  trade#n#5:

deal#n#1:

再来看一组不是很好理解的:

trade#n#7   vs deal#n#2

 

    他们的相似度达到了0.14+,算是比较高的了,这是为什么呢?

 

  trade#n#7:

sunshine#n#2:

相信聪明的你一定明白了为什么。

 

与“cat”的相似度比较:

  7.代码分析

工程结构图:

                

 test.java

  1. package JWordNetSim.test;
  2. import java.io.FileInputStream;
  3. import java.util.HashMap;
  4. import java.util.Map;
  5. import net.didion.jwnl.JWNL;
  6. import net.didion.jwnl.data.IndexWord;
  7. import net.didion.jwnl.data.POS;
  8. import net.didion.jwnl.dictionary.Dictionary;
  9. import shef.nlp.wordnet.similarity.SimilarityMeasure;
  10. /**
  11. * A simple test of this WordNet similarity library.
  12. * @author Mark A. Greenwood
  13. */
  14. public class Test
  15. {
  16. public static void main(String[] args) throws Exception
  17. {
  18. //在运行代码前,必须在本机上安装wordnet2.0,只能装2.0,装了2.1会出错
  19. JWNL.initialize(new FileInputStream("D:\\JAVAProjectWorkSpace\\jwnl\\JWordNetSim\\test\\wordnet.xml"));
  20. //建议一个映射去配置相关参数
  21. Map<String,String> params = new HashMap<String,String>();
  22. //the simType parameter is the class name of the measure to use
  23. params.put("simType","shef.nlp.wordnet.similarity.JCn");
  24. //this param should be the URL to an infocontent file (if required
  25. //by the similarity measure being loaded)
  26. params.put("infocontent","file:D:\\JAVAProjectWorkSpace\\jwnl\\JWordNetSim\\test\\ic-bnc-resnik-add1.dat");
  27. //this param should be the URL to a mapping file if the
  28. //user needs to make synset mappings
  29. params.put("mapping","file:D:\\JAVAProjectWorkSpace\\jwnl\\JWordNetSim\\test\\domain_independent.txt");
  30. //create the similarity measure
  31. SimilarityMeasure sim = SimilarityMeasure.newInstance(params);
  32. //取词
  33. // Dictionary dict = Dictionary.getInstance();
  34. // IndexWord word1 = dict.getIndexWord(POS.NOUN, "trade"); //这里把trade和dog完全定义为名词来进行处理
  35. // IndexWord word2 = dict.getIndexWord(POS.NOUN,"dog"); //
  36. //
  37. // //and get the similarity between the first senses of each word
  38. // System.out.println(word1.getLemma()+"#"+word1.getPOS().getKey()+"#1 " + word2.getLemma()+"#"+word2.getPOS().getKey()+"#1 " + sim.getSimilarity(word1.getSense(1), word2.getSense(1)));
  39. // //get similarity using the string methods (note this also makes use
  40. // //of the fake root node)
  41. // System.out.println(sim.getSimilarity("trade#n","deal#n"));
  42. //get a similarity that involves a mapping
  43. System.out.println(sim.getSimilarity("trade", "trade"));
  44. System.out.println(sim.getSimilarity("trade", "deal"));
  45. System.out.println(sim.getSimilarity("trade", "commerce"));
  46. System.out.println(sim.getSimilarity("trade", "transaction"));
  47. System.out.println(sim.getSimilarity("trade", "finance"));
  48. System.out.println(sim.getSimilarity("trade", "financial"));
  49. System.out.println(sim.getSimilarity("trade", "business"));
  50. System.out.println(sim.getSimilarity("trade", "economy"));
  51. System.out.println(sim.getSimilarity("trade", "school"));
  52. System.out.println(sim.getSimilarity("trade", "dog"));
  53. System.out.println(sim.getSimilarity("trade", "cat"));
  54. System.out.println(sim.getSimilarity("trade", "book"));
  55. System.out.println(sim.getSimilarity("trade", "sunshine"));
  56. System.out.println(sim.getSimilarity("trade", "smile"));
  57. System.out.println(sim.getSimilarity("trade", "nice"));
  58. System.out.println(sim.getSimilarity("trade", "hardly"));
  59. System.out.println(sim.getSimilarity("trade", "beautiful"));
  60. }
  61. }

SimilarityMeasure.java

  1. package shef.nlp.wordnet.similarity;
  2. import java.io.BufferedReader;
  3. import java.io.InputStreamReader;
  4. import java.net.URL;
  5. import java.util.Arrays;
  6. import java.util.HashMap;
  7. import java.util.HashSet;
  8. import java.util.LinkedHashMap;
  9. import java.util.Map;
  10. import java.util.Set;
  11. import net.didion.jwnl.JWNLException;
  12. import net.didion.jwnl.data.IndexWord;
  13. import net.didion.jwnl.data.POS;
  14. import net.didion.jwnl.data.Synset;
  15. import net.didion.jwnl.dictionary.Dictionary;
  16. /**
  17. * An abstract notion of a similarity measure that all provided
  18. * implementations extend.
  19. * @author Mark A. Greenwood
  20. */
  21. public abstract class SimilarityMeasure
  22. {
  23. /**
  24. * A mapping of terms to specific synsets. Usually used to map domain
  25. * terms to a restricted set of synsets but can also be used to map
  26. * named entity tags to appropriate synsets.
  27. */
  28. private Map<String,Set<Synset>> domainMappings = new HashMap<String,Set<Synset>>();
  29. /**
  30. * The maximum size the cache can grow to
  31. */
  32. private int cacheSize = 5000;
  33. /**
  34. * To speed up computation of the similarity between two synsets
  35. * we cache each similarity that is computed so we only have to
  36. * do each one once.
  37. */
  38. private Map<String,Double> cache = new LinkedHashMap<String,Double>(16,0.75f,true)
  39. {
  40. public boolean removeEldestEntry(Map.Entry<String,Double> eldest)
  41. {
  42. //if the size is less than zero then the user is asking us
  43. //not to limit the size of the cache so return false
  44. if (cacheSize < 0) return false;
  45. //if the cache has crown bigger than it's max size return true
  46. return size() > cacheSize;
  47. }
  48. };
  49. /**
  50. * Get a previously computed similarity between two synsets from the cache.
  51. * @param s1 the first synset between which we are looking for the similarity.
  52. * @param s2 the other synset between which we are looking for the similarity.
  53. * @return The similarity between the two sets or null
  54. * if it is not in the cache.
  55. */
  56. protected final Double getFromCache(Synset s1, Synset s2)
  57. {
  58. return cache.get(s1.getKey()+"-"+s2.getKey());
  59. }
  60. /**
  61. * Add a computed similarity between two synsets to the cache so that
  62. * we don't have to compute it if it is needed in the future.
  63. * @param s1 one of the synsets between which we are storring a similarity.
  64. * @param s2 the other synset between which we are storring a similarity.
  65. * @param sim the similarity between the two supplied synsets.
  66. * @return the similarity score just added to the cache.
  67. */
  68. protected final double addToCache(Synset s1, Synset s2, double sim)
  69. {
  70. cache.put(s1.getKey()+"-"+s2.getKey(),sim);
  71. return sim;
  72. }
  73. /**
  74. * Configures the similarity measure using the supplied parameters.
  75. * @param params a set of key-value pairs that are used to configure
  76. * the similarity measure. See concrete implementations for details
  77. * of expected/possible parameters.
  78. * @throws Exception if an error occurs while configuring the similarity measure.
  79. */
  80. protected abstract void config(Map<String,String> params) throws Exception;
  81. /**
  82. * Create a new instance of a similarity measure.
  83. * @param confURL the URL of a configuration file. Parameters are specified
  84. * one per line as key:value pairs.
  85. * @return a new instance of a similairy measure as defined by the
  86. * supplied configuration URL.
  87. * @throws Exception if an error occurs while creating the similarity measure.
  88. */
  89. public static SimilarityMeasure newInstance(URL confURL) throws Exception
  90. {
  91. //create map to hold the key-value pairs we are going to read from
  92. //the configuration file
  93. Map<String,String> params = new HashMap<String,String>();
  94. //create a reader for the config file
  95. BufferedReader in = null;
  96. try
  97. {
  98. //open the config file
  99. in = new BufferedReader(new InputStreamReader(confURL.openStream()));
  100. String line = in.readLine();
  101. while (line != null)
  102. {
  103. line = line.trim();
  104. if (!line.equals(""))
  105. {
  106. //if the line contains something then
  107. //split the data so we get the key and value
  108. String[] data = line.split("\\s*:\\s*",2);
  109. if (data.length == 2)
  110. {
  111. //if the line is valid add the two parts to the map
  112. params.put(data[0], data[1]);
  113. }
  114. else
  115. {
  116. //if the line isn't valid tell the user but continue on
  117. //with the rest of the file
  118. System.out.println("Config Line is Malformed: " + line);
  119. }
  120. }
  121. //get the next line ready to process
  122. line = in.readLine();
  123. }
  124. }
  125. finally
  126. {
  127. //close the config file if it got opened
  128. if (in != null) in.close();
  129. }
  130. //create and return a new instance of the similarity measure specified
  131. //by the config file
  132. return newInstance(params);
  133. }
  134. /**
  135. * Creates a new instance of a similarity measure using the supplied parameters.
  136. * @param params a set of key-value pairs which define the similarity measure.
  137. * @return the newly created similarity measure.
  138. * @throws Exception if an error occurs while creating the similarity measure.
  139. */
  140. public static SimilarityMeasure newInstance(Map<String,String> params) throws Exception
  141. {
  142. //get the class name of the implementation we need to load
  143. String name = params.remove("simType");
  144. //if the name hasn't been specified then throw an exception
  145. if (name == null) throw new Exception("Must specifiy the similarity measure to use");
  146. //Get hold of the class we need to load
  147. @SuppressWarnings("unchecked") Class<SimilarityMeasure> c = (Class<SimilarityMeasure>)Class.forName(name);
  148. //create a new instance of the similarity measure
  149. SimilarityMeasure sim = c.newInstance();
  150. //get the cache parameter from the config params
  151. String cSize = params.remove("cache");
  152. //if a cache size was specified then set it
  153. if (cSize != null) sim.cacheSize = Integer.parseInt(cSize);
  154. //get the url of the domain mapping file
  155. String mapURL = params.remove("mapping");
  156. if (mapURL != null)
  157. {
  158. //if a mapping file has been provided then
  159. //open a reader over the file
  160. BufferedReader in = new BufferedReader(new InputStreamReader((new URL(mapURL)).openStream()));
  161. //get the first line ready for processing
  162. String line = in.readLine();
  163. while (line != null)
  164. {
  165. if (!line.startsWith("#"))
  166. {
  167. //if the line isn't a comment (i.e. it doesn't start with #) then...
  168. //split the line at the white space
  169. String[] data = line.trim().split("\\s+");
  170. //create a new set to hold the mapped synsets
  171. Set<Synset> mappedTo = new HashSet<Synset>();
  172. for (int i = 1 ; i < data.length ; ++i)
  173. {
  174. //for each synset mapped to get the actual Synsets
  175. //and store them in the set
  176. mappedTo.addAll(sim.getSynsets(data[i]));
  177. }
  178. //if we have found some actual synsets then
  179. //store them in the domain mappings
  180. if (mappedTo.size() > 0) sim.domainMappings.put(data[0], mappedTo);
  181. }
  182. //get the next line from the file
  183. line = in.readLine();
  184. }
  185. //we have finished with the mappings file so close it
  186. in.close();
  187. }
  188. //make sure it is configured properly
  189. sim.config(params);
  190. //then return it
  191. return sim;
  192. }
  193. /**
  194. * This is the method responsible for computing the similarity between two
  195. * specific synsets. The method is implemented differently for each
  196. * similarity measure so see the subclasses for detailed information.
  197. * @param s1 one of the synsets between which we want to know the similarity.
  198. * @param s2 the other synset between which we want to know the similarity.
  199. * @return the similarity between the two synsets.
  200. * @throws JWNLException if an error occurs accessing WordNet.
  201. */
  202. public abstract double getSimilarity(Synset s1, Synset s2) throws JWNLException;
  203. /**
  204. * Get the similarity between two words. The words can be specified either
  205. * as just the word or in an encoded form including the POS tag and possibly
  206. * the sense number, i.e. cat#n#1 would specifiy the 1st sense of the noun cat.
  207. * @param w1 one of the words to compute similarity between.
  208. * @param w2 the other word to compute similarity between.
  209. * @return a SimilarityInfo instance detailing the similarity between the
  210. * two words specified.
  211. * @throws JWNLException if an error occurs accessing WordNet.
  212. */
  213. public final SimilarityInfo getSimilarity(String w1, String w2) throws JWNLException
  214. {
  215. //Get the (possibly) multiple synsets associated with each word
  216. Set<Synset> ss1 = getSynsets(w1);
  217. Set<Synset> ss2 = getSynsets(w2);
  218. //assume the words are not at all similar
  219. SimilarityInfo sim = null;
  220. for (Synset s1 : ss1)
  221. {
  222. for (Synset s2 : ss2)
  223. {
  224. //for each pair of synsets get the similarity
  225. double score = getSimilarity(s1, s2);
  226. if (sim == null || score > sim.getSimilarity())
  227. {
  228. //if the similarity is better than we have seen before
  229. //then create and store an info object describing the
  230. //similarity between the two synsets
  231. sim = new SimilarityInfo(w1, s1, w2, s2, score);
  232. }
  233. }
  234. }
  235. //return the maximum similarity we have found
  236. return sim;
  237. }
  238. /**
  239. * Finds all the synsets associated with a specific word.
  240. * @param word the word we are interested. Note that this may be encoded
  241. * to include information on POS tag and sense index.
  242. * @return a set of synsets that are associated with the supplied word
  243. * @throws JWNLException if an error occurs accessing WordNet
  244. */
  245. private final Set<Synset> getSynsets(String word) throws JWNLException
  246. {
  247. //get a handle on the WordNet dictionary
  248. Dictionary dict = Dictionary.getInstance();
  249. //create an emptuy set to hold any synsets we find
  250. Set<Synset> synsets = new HashSet<Synset>();
  251. //split the word on the # characters so we can get at the
  252. //upto three componets that could be present: word, POS tag, sense index
  253. String[] data = word.split("#");
  254. //if the word is in the domainMappings then simply return the mappings
  255. if (domainMappings.containsKey(data[0])) return domainMappings.get(data[0]);
  256. if (data.length == 1)
  257. {
  258. //if there is just the word
  259. for (IndexWord iw : dict.lookupAllIndexWords(data[0]).getIndexWordArray())
  260. {
  261. //for each matching word in WordNet add all it's senses to
  262. //the set we are building up
  263. synsets.addAll(Arrays.asList(iw.getSenses()));
  264. }
  265. //we have finihsed so return the synsets we found
  266. return synsets;
  267. }
  268. //the calling method specified a POS tag as well so get that
  269. POS pos = POS.getPOSForKey(data[1]);
  270. //if the POS tag isn't valid throw an exception
  271. if (pos == null) throw new JWNLException("Invalid POS Tag: " + data[1]);
  272. //get the word with the specified POS tag from WordNet
  273. IndexWord iw = dict.getIndexWord(pos, data[0]);
  274. if (data.length > 2)
  275. {
  276. //if the calling method specified a sense index then
  277. //add just that sysnet to the set we are creating
  278. synsets.add(iw.getSense(Integer.parseInt(data[2])));
  279. }
  280. else
  281. {
  282. //no sense index was specified so add all the senses of
  283. //the word to the set we are creating
  284. synsets.addAll(Arrays.asList(iw.getSenses()));
  285. }
  286. //return the set of synsets we found for the specified word
  287. return synsets;
  288. }
  289. }

  每个函数都有详细注解,大家应该都看的明白。

262~277的循环过程如下:

    

  JCN.java

  1. /************************************************************************
  2. * Copyright (C) 2006-2007 The University of Sheffield *
  3. * Developed by Mark A. Greenwood <m.greenwood@dcs.shef.ac.uk> *
  4. * *
  5. * This program is free software; you can redistribute it and/or modify *
  6. * it under the terms of the GNU General Public License as published by *
  7. * the Free Software Foundation; either version 2 of the License, or *
  8. * (at your option) any later version. *
  9. * *
  10. * This program is distributed in the hope that it will be useful, *
  11. * but WITHOUT ANY WARRANTY; without even the implied warranty of *
  12. * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the *
  13. * GNU General Public License for more details. *
  14. * *
  15. * You should have received a copy of the GNU General Public License *
  16. * along with this program; if not, write to the Free Software *
  17. * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. *
  18. ************************************************************************/
  19. package shef.nlp.wordnet.similarity;
  20. import net.didion.jwnl.JWNLException;
  21. import net.didion.jwnl.data.Synset;
  22. /**
  23. * An implementation of the WordNet similarity measure developed by Jiang and
  24. * Conrath. For full details of the measure see:
  25. * <blockquote>Jiang J. and Conrath D. 1997. Semantic similarity based on corpus
  26. * statistics and lexical taxonomy. In Proceedings of International
  27. * Conference on Research in Computational Linguistics, Taiwan.</blockquote>
  28. * @author Mark A. Greenwood
  29. */
  30. public class JCn extends ICMeasure
  31. {
  32. /**
  33. * Instances of this similarity measure should be generated using the
  34. * factory methods of {@link SimilarityMeasure}.
  35. */
  36. protected JCn()
  37. {
  38. //A protected constructor to force the use of the newInstance method
  39. }
  40. @Override public double getSimilarity(Synset s1, Synset s2) throws JWNLException
  41. {
  42. //if the POS tags are not the same then return 0 as this measure
  43. //only works with 2 nouns or 2 verbs.
  44. if (!s1.getPOS().equals(s2.getPOS())) return 0;
  45. //see if the similarity is already cached and...
  46. Double cached = getFromCache(s1, s2);
  47. //if it is then simply return it
  48. if (cached != null) return cached.doubleValue();
  49. //Get the Information Content (IC) values for the two supplied synsets
  50. double ic1 = getIC(s1);
  51. double ic2 = getIC(s2);
  52. //if either IC value is zero then cache and return a sim of 0
  53. if (ic1 == 0 || ic2 == 0) return addToCache(s1,s2,0);
  54. //Get the Lowest Common Subsumer (LCS) of the two synsets
  55. Synset lcs = getLCSbyIC(s1,s2);
  56. //if there isn't an LCS then cache and return a sim of 0
  57. if (lcs == null) return addToCache(s1,s2,0);
  58. //get the IC valueof the LCS
  59. double icLCS = getIC(lcs);
  60. //compute the distance between the two synsets
  61. //NOTE: This is the original JCN measure
  62. double distance = ic1 + ic2 - (2 * icLCS);
  63. //assume the similarity between the synsets is 0
  64. double sim = 0;
  65. if (distance == 0)
  66. {
  67. //if the distance is 0 (i.e. ic1 + ic2 = 2 * icLCS) then...
  68. //get the root frequency for this POS tag
  69. double rootFreq = getFrequency(s1.getPOS());
  70. if (rootFreq > 0.01)
  71. {
  72. //if the root frequency has a value then use it to generate a
  73. //very large sim value
  74. sim = 1/-Math.log((rootFreq - 0.01) / rootFreq);
  75. }
  76. }
  77. else
  78. {
  79. //this is the normal case so just convert the distance
  80. //to a similarity by taking the multiplicative inverse
  81. sim = 1/distance;
  82. }
  83. //cache and return the calculated similarity
  84. return addToCache(s1,s2,sim);
  85. }
  86. }

LIN.java

  1. package shef.nlp.wordnet.similarity;
  2. import net.didion.jwnl.JWNLException;
  3. import net.didion.jwnl.data.Synset;
  4. /**
  5. * An implementation of the WordNet similarity measure developed by Lin. For
  6. * full details of the measure see:
  7. * <blockquote>Lin D. 1998. An information-theoretic definition of similarity. In
  8. * Proceedings of the 15th International Conference on Machine
  9. * Learning, Madison, WI.</blockquote>
  10. * @author Mark A. Greenwood
  11. */
  12. public class Lin extends ICMeasure
  13. {
  14. /**
  15. * Instances of this similarity measure should be generated using the
  16. * factory methods of {@link SimilarityMeasure}.
  17. */
  18. protected Lin()
  19. {
  20. //A protected constructor to force the use of the newInstance method
  21. }
  22. @Override public double getSimilarity(Synset s1, Synset s2) throws JWNLException
  23. {
  24. //if the POS tags are not the same then return 0 as this measure
  25. //only works with 2 nouns or 2 verbs.
  26. if (!s1.getPOS().equals(s2.getPOS())) return 0;
  27. //see if the similarity is already cached and...
  28. Double cached = getFromCache(s1, s2);
  29. //if it is then simply return it
  30. if (cached != null) return cached.doubleValue();
  31. //Get the Information Content (IC) values for the two supplied synsets
  32. double ic1 = getIC(s1);
  33. double ic2 = getIC(s2);
  34. //if either IC value is zero then cache and return a sim of 0
  35. if (ic1 == 0 || ic2 == 0) return addToCache(s1,s2,0);
  36. //Get the Lowest Common Subsumer (LCS) of the two synsets
  37. Synset lcs = getLCSbyIC(s1,s2);
  38. //if there isn't an LCS then cache and return a sim of 0
  39. if (lcs == null) return addToCache(s1,s2,0);
  40. //get the IC valueof the LCS
  41. double icLCS = getIC(lcs);
  42. //caluclaue the similarity score
  43. double sim = (2*icLCS)/(ic1+ic2);
  44. //cache and return the calculated similarity
  45. return addToCache(s1,s2,sim);
  46. }
  47. }

参考文献:

《基于维基百科的语义相似度计算》盛志超,陶晓鹏(复旦大学计算机科学技术学院);

《基于WordNet的英语词语相似度计算》颜伟,荀恩东(北京语言大学 语言信息处理研究所)

WordNet中的名词:http://ccl.pku.edu.cn/doubtfire/semantics/wordnet/c-wordnet/nouns-in-wordnet.htm

MIT的JWI(Java WordNet Interface)和JWNL(Java WordNet Library)比较  

http://jxr19830617.blog.163.com/blog/static/163573067201301985219857/

http://jxr19830617.blog.163.com/blog/static/1635730672013019105255295/

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/盐析白兔/article/detail/676979
推荐阅读
相关标签
  

闽ICP备14008679号