赞
踩
目录
下面示例演示如何以字符串数组形式存储文件中的文本、按单词频率对其进行排序、绘制结果图,以及收集文件中找到的单词的基本统计信息。
使用 fileread
函数读取莎士比亚的十四行诗中的文本。fileread
会以 1×100266 字符向量的形式返回文本。
- sonnets = fileread('sonnets.txt');
- sonnets(1:35)
-
- ans =
- 'THE SONNETS
-
- by William Shakespeare'
使用 string
函数将文本转换为字符串。然后,使用 splitlines
函数按换行符对其进行拆分。sonnets
将变成一个 2625×1 字符串数组,其中每个字符串都包含这些诗中的一行。显示 sonnets
的前五行。
- sonnets = string(sonnets);
- sonnets = splitlines(sonnets);
- sonnets(1:5)
-
- ans = 5x1 string array
- "THE SONNETS"
- ""
- "by William Shakespeare"
- ""
- ""
要计算 sonnets
中的单词的频率,请首先删除空字符串和标点符号对其进行清理。然后,将其重构为一个以元素形式包含单个单词的字符串数组。
从该字符串数组中删除不含字符 (""
) 的字符串。将 sonnets
的每个元素都与 ""
(空字符串)进行比较。从 R2017a 开始,可以使用双引号创建字符串(包括空字符串)。TF
是一个逻辑向量,如果 sonnets
包含一个不含字符的字符串,该向量相应位置即包含一个 true 值。使用 TF
对 sonnets
进行索引,然后删除所有不含字符的字符串。
- TF = (sonnets == "");
- sonnets(TF) = [];
- sonnets(1:10)
-
- ans = 10x1 string array
- "THE SONNETS"
- "by William Shakespeare"
- " I"
- " From fairest creatures we desire increase,"
- " That thereby beauty's rose might never die,"
- " But as the riper should by time decease,"
- " His tender heir might bear his memory:"
- " But thou, contracted to thine own bright eyes,"
- " Feed'st thy light's flame with self-substantial fuel,"
- " Making a famine where abundance lies,"
将一些标点符号替换为空格字符。例如,替换句点、逗号和分号。保留撇号,因为它们可能是十四行诗中的某些单词的一部分,例如 light's。
- p = [".","?","!",",",";",":"];
- sonnets = replace(sonnets,p," ");
- sonnets(1:10)
-
- ans = 10x1 string array
- "THE SONNETS"
- "by William Shakespeare"
- " I"
- " From fairest creatures we desire increase "
- " That thereby beauty's rose might never die "
- " But as the riper should by time decease "
- " His tender heir might bear his memory "
- " But thou contracted to thine own bright eyes "
- " Feed'st thy light's flame with self-substantial fuel "
- " Making a famine where abundance lies "
去除 sonnets
的每个元素中的前导和尾随空格字符。
- sonnets = strip(sonnets);
- sonnets(1:10)
-
- ans = 10x1 string array
- "THE SONNETS"
- "by William Shakespeare"
- "I"
- "From fairest creatures we desire increase"
- "That thereby beauty's rose might never die"
- "But as the riper should by time decease"
- "His tender heir might bear his memory"
- "But thou contracted to thine own bright eyes"
- "Feed'st thy light's flame with self-substantial fuel"
- "Making a famine where abundance lies"
将 sonnets
拆分为以单个单词为元素的字符串数组。可以使用 split
函数,根据空白字符或所指定的分隔符拆分字符串数组的元素。然而,split
要求字符串数组的每个元素都能拆分为相同数目的新字符串。sonnets
的元素包含不同数目的空格,因此不能拆分为相同数目的字符串。要对 sonnets
使用 split
函数,请编写一个 for 循环,以便每次对一个元素调用 split
。
使用 strings
函数创建空字符串数组 sonnetWords
。编写一个 for 循环,以使用 split
函数拆分 sonnets
的每个元素。将 split
的输出串联到 sonnetWords
中。sonnetWords
的每个元素都是 sonnets
中的单个单词。
- sonnetWords = strings(0);
- for i = 1:length(sonnets)
- sonnetWords = [sonnetWords ; split(sonnets(i))];
- end
- sonnetWords(1:10)
- ans = 10x1 string array
- "THE"
- "SONNETS"
- "by"
- "William"
- "Shakespeare"
- "I"
- "From"
- "fairest"
- "creatures"
- "we"
查找 sonnetWords
中的唯一单词。计算单词的数量并根据其频率进行排序。
要将只有大小写不同的单词计算为同一个单词,请将 sonnetWords
转换为小写。例如,The
和 the
计算为同一个单词。使用 unique
函数查找唯一的单词。然后,使用 histcounts
函数计算每个唯一单词出现的次数。
- sonnetWords = lower(sonnetWords);
- [words,~,idx] = unique(sonnetWords);
- numOccurrences = histcounts(idx,numel(words));
对 sonnetWords
中的单词按出现的次数(从最常见到最不常见)进行排序。
- [rankOfOccurrences,rankIndex] = sort(numOccurrences,'descend');
- wordsByFrequency = words(rankIndex);
从最常见的单词到最不常见的单词绘制 Sonnets 中单词的出现情况图。Zipf 定律指出,大型主体文本中的单词出现情况分布遵循幂律分布。
- loglog(rankOfOccurrences);
- xlabel('Rank of word (most to least common)');
- ylabel('Number of Occurrences');
显示 Sonnets 中最常见的十个单词。
- wordsByFrequency(1:10)
-
- ans = 10x1 string array
- "and"
- "the"
- "to"
- "my"
- "of"
- "i"
- "in"
- "that"
- "thy"
- "thou"
计算 sonnetWords
中每个单词的总出现次数。计算出现次数占单词总数的百分比,并计算从最常见到最不常见的累积百分比。将单词和单词的基本统计信息写入表中。
- numOccurrences = numOccurrences(rankIndex);
- numOccurrences = numOccurrences';
- numWords = length(sonnetWords);
- T = table;
- T.Words = wordsByFrequency;
- T.NumOccurrences = numOccurrences;
- T.PercentOfText = numOccurrences / numWords * 100.0;
- T.CumulativePercentOfText = cumsum(numOccurrences) / numWords * 100.0;
显示最常见的十个单词的统计信息。
- T(1:10,:)
-
- ans=10×4 table
- Words NumOccurrences PercentOfText CumulativePercentOfText
- ______ ______________ _____________ _______________________
-
- "and" 490 2.7666 2.7666
- "the" 436 2.4617 5.2284
- "to" 409 2.3093 7.5377
- "my" 371 2.0947 9.6324
- "of" 370 2.0891 11.722
- "i" 341 1.9254 13.647
- "in" 321 1.8124 15.459
- "that" 320 1.8068 17.266
- "thy" 280 1.5809 18.847
- "thou" 233 1.3156 20.163
十四行诗中的最常见单词 and 出现了 490 次。最常见的十个单词加起来占文本的 20.163%。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。