安然数据集分析处理
介绍 (Intro)
Natural Language Processing (NLP) has been gaining tractions in recent years, allowing us to understand unstructured text data in a way that was never possible before. One of the promises of NLP is to use relevant techniques to detect fraud in companies and shed light on potential violations in the early phase.
近年来,自然语言处理(NLP)受到越来越多的关注,这使我们能够以前所未有的方式理解非结构化文本数据。 NLP的承诺之一是使用相关技术来检测公司中的欺诈行为,并在早期阶段揭示潜在的违规行为。
关于数据集 (About the dataset)
I’ve only managed to find two earnings call transcripts online. And only one ofthem is readable when converted from PDF to text. You can find the originaldocument here.
我只设法在网上找到两个收入电话会议记录。 从PDF转换为文本时,只有其中之一是可读的。 您可以在此处找到原始文档 。
The earnings call transcript used in this article is from Enron’s conference call hold on November 14, 2001. Enron filed for bankruptcy on December 2, 2001.
本文使用的收入电话会议记录来自2001年11月14日举行的安然电话会议。安然于2001年12月2日申请破产。
预处理数据集 (Pre-processing the dataset)
As you can see from the original Earnings, call PDF document, the documentis not digital and contains numbers in between the conversations.
从原始收入中可以看到,调用PDF文档,该文档不是数字文档,并且在对话之间包含数字。
To pump the spoken sentences into R programming for analysis, I use Robotic Process Automation (RPA) to massage the text data into a more structured format. Below is a snapshot of the organized text data in CSV format.
为了将口语句子输入到R编程中进行分析,我使用了机器人过程自动化(RPA)来将文本数据压缩为更加结构化的格式。 以下是CSV格式的组织文本数据的快照。
I then tokenize and remove common stop words from the dataset. To make the results more insightful, I also dropped all the numbers and a fewfiller words such as “um,” “uh,” etc. from the dataset. After cleaning the dataset, I was left with around 1942 words to work with.
然后,我标记化并从数据集中删除常见的停用词。 为了使结果更具洞察力,我还从数据集中删除了所有数字和一些填充词,例如“ um”,“ uh”等。 清理数据集后,剩下大约1942个单词供我使用。
目标 (Goal)
In this article, we will look at Enron’s accounting scandal from a DataScience perspective. I wanted to answer three questions from the dataset:
在本文中,我们将从数据科学的角度审视安然的会计丑闻。 我想回答数据集中的三个问题:
What are the sentiments of the company officials when the company
公司成立时公司官员的感想是什么
is in hot water?
在热水里吗?
- What are the words the officials used the most in the conference call? 官员在电话会议中最常用的词是什么?
- Is it possible to find out what the ‘trouble’ is from the words of the officials? 有可能从官员的话中找出“麻烦”是什么?
Now, let’s get started!
现在,让我们开始吧!
公司官员的感想 (Sentiments of the company officials)
First, let’s look at the overall polarity and sentiment of the conference call. Based on the plot chart at the bottom, we can see that the distribution of positive (positive polarity range) and negative opinions are almost equal.
首先,让我们看一下电话会议的整体极性和气氛。 根据底部的绘图图,我们可以看到正极(正极范围)和负极意见的分布几乎相等。
Now, let’s look at the polarity by each spoke person in the meeting.
现在,让我们看看会议中每个发言人的极性。
The number of statements by Enron officials is almost equallydistributed in the positive and negative ranges. When I compared the result with other companies (Netflix and Luckin Coffee earnings call analysis articles I wrote), it seems that Enron’s representatives gave more negative opinions, which makes sense as the company was in trouble at the time. Hence, we can assume that if the company officials are makingequal or more negative opinions in the earnings call, then it is worthdoing more research to see if the company is going through difficulties. However, more NLP projects need to be done to support this claim.
安然公司官员的发言数量几乎在正负范围内平均分配。 当我将结果与其他公司(我撰写的Netflix和Luckin Coffee收益电话分析文章)进行比较时,似乎安然公司的代表给出了更多的负面意见,这在当时公司陷入困境时是有道理的。 因此,我们可以假设,如果公司官员在财报电话会议上发表了相同或更多的负面意见,那么值得做更多的研究,以查看公司是否正经历困境。 但是,需要做更多的NLP项目来支持该主张。
收益电话中最常用的字词 (Most used words in the earnings call)
The charts below show the most frequent words used in the conference call.
下图显示了电话会议中最常用的词。
A few observations we can see based on the charts above:
根据上面的图表,我们可以看到一些观察结果:
- “financial” was mentioned 21 times “金融”被提及21次
- “Special Committee” was mentioned 8 times “特别委员会”被提及8次
“question” and “questions” together were mentioned 28 times throughout
整个过程中,“问题”和“问题”一起被提及28次
the transcript
成绩单
- Even though broken into 2 terms, “related party” and” party transactions” can be bind into “related party transactions” 即使分为两个术语,“关联方”和“关联方交易”也可以绑定为“关联方交易”
最常用的情感词 (Most used sentiment words)
In this section, you will find word distribution of Enron officials in theearnings call. The words are categorized as either emotion or sentiment.According to this article, emotion is the psychological state experiencedby a person, while sentiment is the mental attitude created based on theemotions.
在本节中,您将在收入电话会议中找到安然公司官员的单词分布。 这句话被归类为情感或sentiment.According到这个文章,情感是experiencedby一个人的心理状态,而情绪是基于theemotions创建的心态。
In our case, joy, sadness, anger, fear, surprise, and disgust are emotion categories, whereas anticipation and trust fall under the sentiment categories.
在我们的案例中,喜悦,悲伤,愤怒,恐惧,惊奇和厌恶是情感类别,而期望和信任属于情感类别。
For the sake of simplicity, we will make the assumption that the emotioncategory consists of the participants’ behavioral responses towards aparticular item or matter. On the other hand, sentiments are thoughts oractions influenced by the emotions.
为了简单起见,我们将假设情感类别由参与者对特定项目或事物的行为React组成。 另一方面,情感是受情感影响的思想或行为。
Based on the chart above, we can make a few high-level assumptions:
根据上表,我们可以进行一些高级假设:
financials: “cash,” “debt,” “assets” were heavily mentioned. The two words
财务:“现金”,“债务”,“资产”被大量提及。 两个词
may be related.
可能相关。
- Fraud: “investigation”,” regulatory” words were mentioned 欺诈:提到“调查”,“监管”等字眼
Sentiment: The word “confidence” was mentioned a couple of times at the
感想:“信心”一词曾在
conference. Note that this earnings call was held about half a month before Enron declared bankruptcy on December 2, 2001.
会议。 请注意,这个电话会议是在2001年12月2日安然宣布破产之前大约半个月举行的。
Positive sentiment bar chart: Notice how there are not much of future-
正面情绪柱状图:请注意,未来没有多少
oriented keywords.
面向关键字。
Typically, network graphs would tell us much about the company’s plan moving forward. It may be due to the small dataset used in this article, but the network graph created based on Enron’s earnings call did not give much insight. However, it is also interesting that not many/none future-oriented words can be found in the network graphs.
通常,网络图会告诉我们有关公司计划向前发展的很多信息。 可能是由于本文使用的数据集较小,但是基于安然的收益电话创建的网络图并没有提供太多见解。 但是,有趣的是,在网络图中找不到很多/没有面向未来的单词。
Only terms that form the position of Enron’s officials are strong, with
只有构成安然公司官员职位的术语才是有力的,
darker lines’ that link the words together.
将字词连在一起的深色线条”。
Notice that the word “debt” is related to other asset and
请注意,“债务”一词与其他资产和
equity-related words.
与权益相关的词。
- No future-oriented or solution-related keywords are found in the graphs 在图中找不到面向未来或与解决方案相关的关键字
结论 (Conclusion)
- Using sentiment derived from company officials’ word may be challenging as company officials would typically spend more time to boost investors’ confidence rather than explaining the issues or troubles the company is facing 使用公司官员的话语表达的情感可能具有挑战性,因为公司官员通常会花费更多的时间来增强投资者的信心,而不是解释公司面临的问题或麻烦
- The word ‘financial’ was mentioned many times in Enron’s earnings call. This observation suggests that it is possible to identify the company’s focus, which can be a future action or issue being tackled 在安然(Enron)的收益电话中,“金融”一词被多次提及。 这种观察表明,有可能确定公司的重点,这可能是未来的行动或正在解决的问题
- Earnings call without or with too little future-oriented keywords may suggest that the company is going through trouble 没有或很少使用面向未来的关键字的收益电话可能表明该公司正在遇到麻烦
安然数据集分析处理