All of us have used a search engine, in example Google, in every single day for searching everything, even on simple things. But have you ever imagined, how that search engine can retrieve all of our documents based on what we want to search (query)?
我们所有人每天都使用搜索引擎(例如Google)来搜索所有内容,即使是简单的东西。 但是您是否曾想过,该搜索引擎如何根据我们要搜索(查询)的内容来检索我们所有的文档?
In this article, I will show you on how to build a simple search engine from scratch using Python and its supporting library. After you read the article, I hope you can understand how to build your own search engine based on what you need. Without further, let’s go!
在本文中,我将向您展示如何使用Python及其支持库从头构建一个简单的搜索引擎。 阅读本文后,希望您能了解如何根据需要构建自己的搜索引擎。 没有进一步,我们走吧!
Side note: I’ve also created a notebook of the code, so if you want to follow along with me you can click on this link here. Also, the documents that I will use is in Indonesian. But don’t worry, you can use any documents regardless of the language.
旁注:我还创建了代码笔记本,因此,如果您想跟我一起学习,可以单击此处的链接。 另外,我将使用的文档是印尼文。 但请放心,无论使用哪种语言,都可以使用任何文档。
大纲 (Outline)
Before we get our hands dirty, let me give you the steps on how to implement this, and on each section, I will explain on how to build it. They are,
在开始动手之前,让我为您提供如何实现此步骤的步骤,并在每个部分中说明如何构建它。 他们是,
- Preparing the documents 准备文件
- Create a Term-Document Matrix with TF-IDF weighting创建具有TF-IDF权重的术语文档矩阵
- Calculate the similarities between query and documents using Cosine Similarity使用余弦相似度计算查询和文档之间的相似度
- Retrieve the articles that have the highest similarity on it.检索相似度最高的文章。
流程 (The Process)
检索文件(Retrieve the documents)
The first thing that we have to do is to retrieve the documents from the Internet. In this case, we can use web scraping to extract documents from a website. I will scrape documents from kompas.com on sport category, especially on the popular articles. Because of the documents are using HTML format, we initialize a BeautifulSoup object to parse the HTML file, so we can extract each element that we want much easier.
我们要做的第一件事是从Internet检索文档。 在这种情况下,我们可以使用网络抓取从网站中提取文档。 我将从kompas.com上抓取有关体育类别的文档,尤其是有关热门文章的文档。 由于文档使用的是HTML格式,我们初始化了BeautifulSoup对象以解析HTML文件,因此我们可以轻松提取每个想要的