当前位置:   article > 正文

安然数据集分析处理_用自然语言处理分析安然会计丑闻

安然数据集

安然数据集分析处理

介绍 (Intro)

Natural Language Processing (NLP) has been gaining tractions in recent years, allowing us to understand unstructured text data in a way that was never possible before. One of the promises of NLP is to use relevant techniques to detect fraud in companies and shed light on potential violations in the early phase.

近年来,自然语言处理(NLP)受到越来越多的关注,这使我们能够以前所未有的方式理解非结构化文本数据。 NLP的承诺之一是使用相关技术来检测公司中的欺诈行为,并在早期阶段揭示潜在的违规行为。

关于数据集 (About the dataset)

I’ve only managed to find two earnings call transcripts online. And only one ofthem is readable when converted from PDF to text. You can find the originaldocument here.

我只设法在网上找到两个收入电话会议记录。 从PDF转换为文本时,只有其中之一是可读的。 您可以在此处找到原始文档

The earnings call transcript used in this article is from Enron’s conference call hold on November 14, 2001. Enron filed for bankruptcy on December 2, 2001.

本文使用的收入电话会议记录来自2001年11月14日举行的安然电话会议。安然于2001年12月2日申请破产。

预处理数据集 (Pre-processing the dataset)

As you can see from the original Earnings, call PDF document, the documentis not digital and contains numbers in between the conversations.

从原始收入中可以看到,调用PDF文档,该文档不是数字文档,并且在对话之间包含数字。

Image for post
A snapshot of Enron’s earnings call in PDF format.
PDF格式的Enron收入电话快照。

To pump the spoken sentences into R programming for analysis, I use Robotic Process Automation (RPA) to massage the text data into a more structured format. Below is a snapshot of the organized text data in CSV format.

为了将口语句子输入到R编程中进行分析,我使用了机器人过程自动化(RPA)来将文本数据压缩为更加结构化的格式。 以下是CSV格式的组织文本数据的快照。

Image for post
Enron’s earnings call in CSV format.
Enron的CSV格式的收益电话。

I then tokenize and remove common stop words from the dataset. To make the results more insightful, I also dropped all the numbers and a fewfiller words such as “um,” “uh,” etc. from the dataset. After cleaning the dataset, I was left with around 1942 words to work with.

然后,我标记化并从数据集中删除常见的停用词。 为了使结果更具洞察力,我还从数据集中删除了所有数字和一些填充词,例如“ um”,“ uh”等。 清理数据集后,剩下大约1942个单词供我使用。

Image for post
Number of words left after pre-processing the dataset
预处理数据集后剩余的单词数

目标 (Goal)

In this article, we will look at Enron’s accounting scandal from a DataScience perspective. I wanted to answer three questions from the dataset:

在本文中,我们将从数据科学的角度审视安然的会计丑闻。 我想回答数据集中的三个问题:

  1. What are the sentiments of the company officials when the company

    公司成立时公司官员的感想是什么

    is in hot water?

    在热水里吗?

  2. What are the words the officials used the most in the conference call?

    官员在电话会议中最常用的词是什么?
  3. Is it possible to find out what the ‘trouble’ is from the words of the officials?

    有可能从官员的话中找出“麻烦”是什么?

Now, let’s get started!

现在,让我们开始吧!

公司官员的感想 (Sentiments of the company officials)

First, let’s look at the overall polarity and sentiment of the conference call. Based on the plot chart at the bottom, we can see that the distribution of positive (positive polarity range) and negative opinions are almost equal.

首先,让我们看一下电话会议的整体极性和气氛。 根据底部的绘图图,我们可以看到正极(正极范围)和负极意见的分布几乎相等。

Image for post
The overall polarity of Enron’s earnings call transcript.
安然的电话会议记录的整体极性。

Now, let’s look at the polarity by each spoke person in the meeting.

现在,让我们看看会议中每个发言人的极性。

Image for post
The overall polarity distributed by spokespersons in Enron’s earnings call
发言人在安然(Enron)财报电话会议中分配的总体极性

The number of statements by Enron officials is almost equallydistributed in the positive and negative ranges. When I compared the result with other companies (Netflix and Luckin Coffee earnings call analysis articles I wrote), it seems that Enron’s representatives gave more negative opinions, which makes sense as the company was in trouble at the time. Hence, we can assume that if the company officials are makingequal or more negative opinions in the earnings call, then it is worthdoing more research to see if the company is going through difficulties. However, more NLP projects need to be done to support this claim.

安然公司官员的发言数量几乎在正负范围内平均分配。 当我将结果与其他公司(我撰写的NetflixLuckin Coffee收益电话分析文章)进行比较时,似乎安然公司的代表给出了更多的负面意见,这在当时公司陷入困境时是有道理的。 因此,我们可以假设,如果公司官员在财报电话会议上发表了相同或更多的负面意见,那么值得做更多的研究,以查看公司是否正经历困境。 但是,需要做更多的NLP项目来支持该主张。

收益电话中最常用的字词 (Most used words in the earnings call)

The charts below show the most frequent words used in the conference call.

下图显示了电话会议中最常用的词。

Image for post
Image for post
Most used uni-gram and bigram bar chart
最常用的单字形和双字形条形图

A few observations we can see based on the charts above:

根据上面的图表,我们可以看到一些观察结果:

  1. “financial” was mentioned 21 times

    “金融”被提及21次
  2. “Special Committee” was mentioned 8 times

    “特别委员会”被提及8次
  3. “question” and “questions” together were mentioned 28 times throughout

    整个过程中,“问题”和“问题”一起被提及28次

    the transcript

    成绩单

  4. Even though broken into 2 terms, “related party” and” party transactions” can be bind into “related party transactions”

    即使分为两个术语,“关联方”和“关联方交易”也可以绑定为“关联方交易”

最常用的情感词 (Most used sentiment words)

In this section, you will find word distribution of Enron officials in theearnings call. The words are categorized as either emotion or sentiment.According to this article, emotion is the psychological state experiencedby a person, while sentiment is the mental attitude created based on theemotions.

在本节中,您将在收入电话会议中找到安然公司官员的单词分布。 这句话被归类为情感或sentiment.According到这个文章,情感是experiencedby一个人的心理状态,而情绪是基于theemotions创建的心态。

In our case, joy, sadness, anger, fear, surprise, and disgust are emotion categories, whereas anticipation and trust fall under the sentiment categories.

在我们的案例中,喜悦,悲伤,愤怒,恐惧,惊奇和厌恶是情感类别,而期望和信任属于情感类别。

For the sake of simplicity, we will make the assumption that the emotioncategory consists of the participants’ behavioral responses towards aparticular item or matter. On the other hand, sentiments are thoughts oractions influenced by the emotions.

为了简单起见,我们将假设情感类别由参与者对特定项目或事物的行为React组成。 另一方面,情感是受情感影响的思想或行为。

Image for post
Sentiment word frequency distribution of Enron officials in the earnings call on November 14, 2001
安然公司官员在2001年11月14日的收益电话中的情感词频数分布

Based on the chart above, we can make a few high-level assumptions:

根据上表,我们可以进行一些高级假设:

  • financials: “cash,” “debt,” “assets” were heavily mentioned. The two words

    财务:“现金”,“债务”,“资产”被大量提及。 两个词

    may be related.

    可能相关。

  • Fraud: “investigation”,” regulatory” words were mentioned

    欺诈:提到“调查”,“监管”等字眼
  • Sentiment: The word “confidence” was mentioned a couple of times at the

    感想:“信心”一词曾在

    conference. Note that this earnings call was held about half a month before Enron declared bankruptcy on December 2, 2001.

    会议。 请注意,这个电话会议是在2001年12月2日安然宣布破产之前大约半个月举行的。

  • Positive sentiment bar chart: Notice how there are not much of future-

    正面情绪柱状图:请注意,未来没有多少

    oriented keywords.

    面向关键字。

Image for post
Image for post
Image for post
Image for post
Sentiment word distribution by spokeperson in Enron’s earnings call
发言人在安然(Enron)的电话会议中散布情感词

将单词缝入网络 (Stitching words into a network)

Image for post
Image for post
Word network graphs created based on Enron’s earnings call back in 2001
根据2001年安然的收益回拨创建的Word网络图

Typically, network graphs would tell us much about the company’s plan moving forward. It may be due to the small dataset used in this article, but the network graph created based on Enron’s earnings call did not give much insight. However, it is also interesting that not many/none future-oriented words can be found in the network graphs.

通常,网络图会告诉我们有关公司计划向前发展的很多信息。 可能是由于本文使用的数据集较小,但是基于安然的收益电话创建的网络图并没有提供太多见解。 但是,有趣的是,在网络图中找不到很多/没有面向未来的单词。

  1. Only terms that form the position of Enron’s officials are strong, with

    只有构成安然公司官员职位的术语才是有力的,

    darker lines’ that link the words together.

    将字词连在一起的深色线条”。

  2. Notice that the word “debt” is related to other asset and

    请注意,“债务”一词与其他资产和

    equity-related words.

    与权益相关的词。

  3. No future-oriented or solution-related keywords are found in the graphs

    在图中找不到面向未来或与解决方案相关的关键字

结论 (Conclusion)

  1. Using sentiment derived from company officials’ word may be challenging as company officials would typically spend more time to boost investors’ confidence rather than explaining the issues or troubles the company is facing

    使用公司官员的话语表达的情感可能具有挑战性,因为公司官员通常会花费更多的时间来增强投资者的信心,而不是解释公司面临的问题或麻烦
  2. The word ‘financial’ was mentioned many times in Enron’s earnings call. This observation suggests that it is possible to identify the company’s focus, which can be a future action or issue being tackled

    在安然(Enron)的收益电话中,“金融”一词被多次提及。 这种观察表明,有可能确定公司的重点,这可能是未来的行动或正在解决的问题
  3. Earnings call without or with too little future-oriented keywords may suggest that the company is going through trouble

    没有或很少使用面向未来的关键字的收益电话可能表明该公司正在遇到麻烦

翻译自: https://medium.com/swlh/analyze-enrons-accounting-scandal-with-natural-language-processing-3216ecfd7b00

安然数据集分析处理

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/Monodyee/article/detail/630868
推荐阅读
相关标签
  

闽ICP备14008679号