当前位置:   article > 正文

找出一个文件中出现次数最多的10个单词_找出文本内重复最多的单词

找出文本内重复最多的单词

记得当初面试的时候,一哥们给我出的上机题。

咋一下感觉很简单,可做着觉得有了些问题,反正当时做的挺糟糕的。

这几天又时间,又把这个问题总结了下。

总体来说,它涉及到两点主要的算法问题: 查找(相同单词,统计出现次数),排序(按 次数排序)

这个题目有几点很容易忽略:

1)单靠出现次数能唯一确定前10个单词么?如果有个数重复的怎么办? 随机抽取? 那么程序的输出岂不是也随机? 但算法的一个很重要的概念就是 对于特定的输入,应该有固定的输出。否则谁敢用你的算法。

2)既然一个纬度不足够唯一确定结果,则需要再添加一个纬度。比如,当出现次数一致的情况下,按字母顺序排序。这样就可以了。

3)所以,一定要排序。我当时的想法时,首先统计出所有出现的单词以及他们出现的次数。(这个可以有很多种方法,但用hashtable是最方便的应该)。 然后呢? 选定前10个作为基准值,按顺序排好,从第11个开始,逐个同10个当中按顺序比较,如果遇见比它小的,则从当前位置把元素往后移动(最后一个被挤出去),然后把它加在这个空出的位置上。虽然实现了,但是结果是不确定的。取决与他们出现在hashtable中的次序。而且,反复移动,效率也不高

4)现在觉得,算法一定得排序,然后才能得到唯一确定的输出结果。

 

下面是用C#实现的算法,基本需求的是根据输入文件,把统计后的单词按出现次数从小到大写到输入文件中

    class Program
    {
        static void Main(string[] args)
        {
            QuickSortUtil sortUitl = new QuickSortUtil();
            sortUitl.CountWord(@"../data/input.txt", @"../data/output.txt");
            Console.WriteLine("Done!");
            Console.Read();
        }
    }

 

 class QuickSortUtil
    {
        public void CountWord(string inputFilePath, string outputFilePath)
        {
            Hashtable allWords = getAllWords(inputFilePath);
            List<WordInfo> allWordInfos = new List<WordInfo>();
            foreach (string key in allWords.Keys)
            {
                allWordInfos.Add(new WordInfo(key, (int)allWords[key]));
            }
            qucikSort(allWordInfos, 0, allWordInfos.Count - 1);
            writeToFile(allWordInfos, outputFilePath);
        }

        private Hashtable getAllWords(string filePath)
        {
            //since if the hashtable need to increase it's capacity frequently,
            //it will affect the efficence,so give it a large enough basic capacity
            Hashtable allWords = new Hashtable(10240);
            using (StreamReader sr = new StreamReader(filePath, Encoding.Default))
            {
                string line = null;
                char[] seperators = new char[] { ' ', ',', ';', '.','!','"' };
                string[] words = null;
                while ((line = sr.ReadLine()) != null)
                {
                    //you can decide whether to add the ToLower() function
                    line = line.ToLower();
                    //do use the StringSplitOptions.RemoveEmptyEntries when use Split
                    words = line.Split(seperators, StringSplitOptions.RemoveEmptyEntries);
                    if (words != null && words.Length > 0)
                    {
                        for (int i = 0; i < words.Length; i++)
                        {
                            if (allWords.ContainsKey(words[i]))
                            {
                                allWords[words[i]] = (int)allWords[words[i]] + 1;
                            }
                            else
                            {
                                allWords.Add(words[i], 1);
                            }
                        }
                    }
                }
            }

            return allWords;
        }

        private void qucikSort(List<WordInfo> allWordInfos, int low, int high)
        {
            if (low >= high)
            {
                return;
            }

            //use the local varaiable for index.
            //do not change the original low and high because we will use them later
            int pLow = low;
            int pHigh = high;
            //record the stand value
            WordInfo value = allWordInfos[low];

            while (pLow < pHigh)
            {
                //note: if two value equals, do not move.
                //we only swap the value if it's bigger than the stand value
                while ((WordInfo.Compare(allWordInfos[pHigh],value) >=0) && pHigh > pLow)
                {
                    pHigh--;
                }
                //so here, if pHigh == pLow, the "<" check will be false
                if (WordInfo.Compare(allWordInfos[pHigh], value) < 0)
                {
                    allWordInfos[pLow] = allWordInfos[pHigh];
                    allWordInfos[pHigh] = value;
                    pLow++;
                }

                while ((WordInfo.Compare(allWordInfos[pLow],value) <= 0) && pHigh > pLow)
                {
                    pLow++;
                }
                if (WordInfo.Compare(allWordInfos[pLow], value) > 0)
                {
                    allWordInfos[pHigh] = allWordInfos[pLow];
                    allWordInfos[pLow] = value;
                    pHigh--;
                }
            }

            //pLow and pHigh is expected to be equal here
            System.Diagnostics.Trace.Assert(pLow == pHigh);

            qucikSort(allWordInfos, low, pLow - 1);
            qucikSort(allWordInfos, pLow + 1, high);
        }

        private void writeToFile(List<WordInfo> allWordInfos, string outputFilePath)
        {
            using (StreamWriter sw = new StreamWriter(outputFilePath, false, Encoding.Default))
            {
                foreach (WordInfo wi in allWordInfos)
                {
                    sw.WriteLine("{0}:{1}", wi.Word, wi.Count);
                }
            }
        }
    }

    public class WordInfo
    {
        private string _word = string.Empty;

        public string Word
        {
            get { return _word; }
            set { _word = value; }
        }
        private int _count = 0;

        public int Count
        {
            get { return _count; }
            set { _count = value; }
        }

        public WordInfo()
        {
        }

        public WordInfo(string word, int count)
        {
            this._word = word;
            this._count = count;
        }

        public WordInfo(WordInfo info)
        {
            this._word = info.Word;
            this._count = info.Count;
        }

        public static int Compare(WordInfo info1, WordInfo info2)
        {

            if (info1 == null || info2 == null)
            {
                if (info1 == null)
                {
                    if (info2 == null)
                    {
                        return 0;
                    }
                    else
                    {
                        return -1;
                    }
                }

                return 1;
            }

            if (info1.Count > info2.Count)
            {
                return 1;
            }

            if (info1.Count < info2.Count)
            {
                return -1;
            }

            return string.Compare(info1.Word, info2.Word);
        }
    }

 

算法中几点注意的:

1)使用hashtable时给它设置个相对合理的初始容量。否则hashtable扩容时需要重新对以前的元素计算位置,很影响效率

2)排序的时候用快速排序

3)Split字符串的时候,自动的去掉空内容StringSplitOptions.RemoveEmptyEntries

4)是否考虑字符串的大小写。 即大小写不一样的单词是否视为同一单词

 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/在线问答5/article/detail/864118
推荐阅读
相关标签
  

闽ICP备14008679号