用程序实现在大文件中出现次数为Top N的数字_一个大文件统计整数top

作者：码创造者 | 2024-07-22 08:04:09

踩

一个大文件统计整数top

有一个问题：有一个很大的文件（如20GB），内存装不下，其中存了很多个数字（也可能是URL之类的），找出出现次数最多的3个数字。

解题思路有这么3个点：

1. Top N的问题自然是用最小堆来解。不过如果只是找Top 3而已，也不用构造堆那么麻烦，直接几行比较代码应该就可以了。
2. 文件很大内存装不下，也就意味着不可能一次把文件整个读入内存再做处理。其实各个编程语言都有读大文件的方法，就是一行一行或一块一块地读，如C++有getline()函数，Python、Perl等都有类似的方法。

3. 形成一个Hash表，key是数字，value是该数字出现的次数。然后遍历整个Hash表，找到Top 3的数字。但是注意，这个Hash表的大小也可能会超出内存的限制。

假设数字的取值范围为0--2^31-1，那么可能的数字个数为21亿多（即2G），而每个数字占4个字节；假设所有的数字都出现了，并且每个数字只出现1次，然后再算上value的存储，那么占用的内存为: 2G * 4B * 2 =16GB，所以这个Hash表是有可能超出普通计算机的可用内存限制的。当然，如果数字个数不变而数字重复率高，自然Hash表就会小很多了。可是谁知道重复率怎样呢？如果是用生成随机数的库函数来生成这个大文件的，那么重复率还是比较低的。

在这种情况下，就要用到分而治之的思想了。如果把原来的大文件分成若干小文件，且每个小文件中出现的数字不会出现在任何其他的小文件中，那么只要统计每个小文件中出现次数的前3名，然后比较所有小文件的前3名，就能找到整个大文件中出现次数最高的前3名了。那么如何制作这样的小文件呢？对数字取模就可以了。

下面用程序来实现大文件的构造和问题的解答。

1. 我们需要创建一个很大的文件，它拥有1亿行，每行20个数字，即总共20亿的数字，即近2G个数字。

这里说一个与本题无关的数学问题。这个文件会有多大？

考虑到每个数字作为十进制保存在文件里的时候，占用的位数不一定是4位，即并不是4个字节，那么，平均每个数字在磁盘文件中占用几个字节呢？

首先，我们采用C语言的rand()来生成随机数，那么随机数的取值范围是0-RAND_MAX，这个RAND_MAX是在cstdlib中定义的一个值，就是2^31-1 = 2147483647，然后采用加权平均的算法来计算：


(1*10 + 2*90 + 3*900 + 4*9000 + 5*9e4 + 
 6*9e5 + 7*9e6 + 8*9e7 + 9*9e8 + 
 10*(2147483647-9e8))/2147483647 
= 9.95

这就是说，平均每个数字在磁盘文件中占据了差不多10个字节！那么近2G个数字，就会占据磁盘空间近20GB. 这个理论推测和实际观察也是一致的。下面就是生成这个大文件的C++程序：


// gen_rand.cpp
// g++ gen_rand.cpp -std=c++11 -o gen
 
#include <iostream>
#include <string>
#include <sstream>
#include <cstdlib> 
#include <ctime>
#include <vector>
#include <fstream>
using namespace std;
 
// For convenience, make all configurable variables Global 
const int total_rows = 2e8;  // 1e8 is close to 100M 
const int max_rows_in_buf = 1e4; // 1e4 rows are about 1M bytes
const int max_cols_in_buf = 20;  // 20 * 1e8 = 2e9 numbers
const string file = "./tmp.txt";
 
 
void gen_rand_int(ofstream & fout)
{
    srand((unsigned)time(NULL));
    int row_count = 0;
    string line; 
    stringstream ss(line);
    
    while (row_count < total_rows) {
        for (int rows=0; rows < max_rows_in_buf; ++rows) {
            for(int cols = 0; cols < max_cols_in_buf; ++cols) {
                ss << rand() << " ";
            }
            ss << "\n";
        }
        // using stringstream as buffer is 14% faster than using fout directly
        fout << ss.str();  
        ss.str("");  // clear string stream 
        
        row_count += max_rows_in_buf;
        // cout << row_count << " rows has been generated" << endl;
    }
}
 
 
int main() 
{  
    ofstream fout(file.c_str(), ios::out);
    if (fout.good()) {
        gen_rand_int(fout);
    }
    fout.close();
    
    return 0; 
}

总结一下：
a. 本人又写了一段Python代码，做同样的事情，只是为了对比一下效率。结果是，C++比Python快70倍左右。
b. 以上代码中使用了stringstream做了缓存，缓存为10000行，即每20万个数字写入文件一次，而不是每生成一行数字就写入文件。使用了缓存之后，速度快了约14%.
c. 代码运行总时间为 166秒左右。
d. 如果要做并行化提高速度，可以多个线程写入多个文件，完了用 cat file1 file2 file3 > big_file 的方式合并。

2. 假设我们想把大文件分成40个小文件。这里需要保证的是，对于每个小文件所形成的Hash表，不会超过内存的限制。总共是2G个数字，如果分布比较平均的话，40个小文件，每个小文件包含50M个数字，那么占用内存大约为： 50M*4B*2 = 400MB ,但是假设分布不够平均的话，那么即使再double一下，也不过占用800MB，还是承受的住的。其实这里分成40个小文件还是有点武断了，也许应该做更多更深入的思考。只是鉴于我们在以上第一部分中，采用计算机伪随机算法来生成的随机数，还算是比较平均分布的，因此不会有什么问题。

因此，将大文件分成40个小文件的代码如下：


// read_rand.cpp
// g++ read_rand.cpp -std=c++11 -o rr
 
#include <vector>
#include <utility>
#include <fstream>
#include <sstream>
#include <string>
#include <iostream>
using namespace std;
 
 
// For convenience, make all configurable variables Global
const int file_num = 10;
const string infile_name = "./tmp.txt";
const string out_file_prefix = "out_file_";
// The buffer is critical to performance
const int max_num_in_bufline = 1000;
 
 
int main()
{
    // Preapre OUT files 
    fstream outfiles[file_num];
    vector<string> vec_outfile_names;
    vec_outfile_names.reserve(file_num);
    
    for (int i=0; i<file_num; ++i) {
        string tmp_str;
        stringstream tmp_ss(tmp_str);
        tmp_ss << out_file_prefix << i << ".txt";
        string file_name = tmp_ss.str();
        vec_outfile_names.push_back(file_name);
        
        outfiles[i].open(file_name.c_str(), ios::app);
        if (!outfiles[i].good()) {
            cout << "Cannot open " << i << " file\n";
            exit(-1);
        }
    }
    
    stringstream ss_array[file_num];
    // each line takes how many numbers (less than line_num_in_buf), will be initialized as 0
    vector<int> buf_line_count(file_num);  
    
    
    // Open IN file 
    fstream infile(infile_name, ios::in);
    if (!infile.good()) {
        cout << "Failed to open " << infile_name << endl;
        exit(-1);
    }
    
    // Reading and then Writing
    string line;
    int count = 0;
    while (!infile.eof()) {
        count ++;
        getline(infile, line);
        stringstream ss(line);
        
        int tmp_int; 
        while (ss >> tmp_int) {
            int id = tmp_int % file_num;
            buf_line_count[id] ++;
            ss_array[id] << tmp_int << " ";
            
            if (buf_line_count[id] == max_num_in_bufline) {
                outfiles[id] << ss_array[id].str() << endl;
                // clear
                buf_line_count[id] = 0;
                ss_array[id].str("");
            }
        }
    }
    infile.close();
    
    // Write the last line (the number of figures in this line could be less than max_num_in_bufline)
    for (int i=0; i<file_num; ++i) {
        if (buf_line_count[i] > 0) {
            outfiles[i] << ss_array[i].str() << endl;            
        }
    }
    
    for (int i=0; i<file_num; ++i) {
        outfiles[i].close();
    }
 
    return 0;
}

本步骤运行时间为 520秒左右。因为涉及单个文件的读写，这一步不太好做并行化，但也不是完全不能做。这里省略了。

3. 最后一步，读取每个小文件里的数字，对每个小文件都分别形成Hash表，然后找出每个小文件中出现次数排前3的数字；最后找出整个大文件的出现次数排名前3的数字。代码如下：


// get_numbers.cpp
// g++ get_numbers.cpp -std=c++11 -o gn
 
#include <iostream>
#include <fstream>
#include <sstream>
#include <string>
#include <unordered_map>
#include <vector>
using namespace std;
 
 
template <typename T>
vector<pair<int, int> > get_max_3_numbers(const T& pairs)
{
    vector<pair<int, int> > array; 
    for (int i=0; i<3; ++i) {
        array.push_back({0,0});
    }
    
    for (auto item : pairs) {
        int index = -1;
        if (array[0].second <= array[1].second) {
            if (array[0].second <= array[2].second) {
                index = 0;
            } else {
                index = 2;
            }
        } else {
            if (array[1].second <= array[2].second) {
                index = 1;
            } else {
                index = 2;
            }
        }
        if (array[index].second < item.second) {
            array[index] = item;
        }
    }
    
    return array;
}
 
vector<pair<int, int> > get_numbers_from_one_file(const string& infile_name) {
    unordered_map<int, int> db; 
    fstream infile(infile_name.c_str(), ios::in);
    if (!infile.good()) {
        cout << "Failed to read " << infile_name << endl; 
        exit(-1);
    }
    
    string line;
    while(!infile.eof()) {
        getline(infile, line);
        stringstream ssline(line);
        int tmp_int;
        while (ssline >> tmp_int) {
            if (db.find(tmp_int) == db.end()) {
                db[tmp_int] = 1;
            }
            else {
                db[tmp_int] += 1;
            }
        }
    }
    infile.close();
    
    return get_max_3_numbers<unordered_map<int, int> >(db);
}
 
int main()
{
    const int file_number = 40;
    vector<string> infilenames;
    for (int i=0; i<file_number; ++i) {
        stringstream ss(string(""));
        ss << "./out_file_" << i << ".txt";
        infilenames.push_back(ss.str());
    }
    
    vector<pair<int, int> > multi_file_result;
    
    for (auto infile_name : infilenames) {
        vector<pair<int, int> > result = get_numbers_from_one_file(infile_name);
        for (auto item : result) {
            multi_file_result.push_back(item);
        }
        cout << infile_name << " is done..." << endl;
    }
    
    vector<pair<int, int> > final_result = get_max_3_numbers<vector<pair<int, int> >>(multi_file_result);
    for (auto item : final_result) {
        cout << item.first << ": " << item.second << endl;
    }
    return 0;
}

最终的结果如下：


1201971198: 12
542861170: 11
49242340: 11
 
real	61m1.689s
user	60m11.015s
sys	0m48.395s

总结一下：
a. 使用了模板，确实需要，一次填的值是vector<pair<int, int> >类型，另一次填的是 unordered_map<int, int> 类型；
b. 这一步的执行时间很久，约1个小时，所以应该使用并行处理去做，后续会去实现。

用真正的程序来实现了一个以前只会纸上写写画画的面试题，大约算是做到了“Show me the code”了吧。不过，并没有结束，除了对优化的思考之外，至少还应该去实现并行化的处理。这留作不久将来的实现吧。

（未完待续）

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/码创造者/article/detail/864113