apache spark
大数据意义(Making Sense of Big Data)
This is part 2 of a series on data engineering in a big data environment. It will reflect my personal journey of lessons learnt and culminate in the open source tool Flowman I created to take the burden of reimplementing all the boiler plate code over and over again in a couple of projects.
这是大数据环境中数据工程系列的第2部分。 这将反映我个人的经验教训,并在我创建的开源工具Flowman中达到顶峰, Flowman承担了在几个项目中一遍又一遍地重新实现所有样板代码的负担。
- Part 2: Big Data Engineering — Apache Spark第2部分:大数据工程-Apache Spark
期待什么(What to expect)
This series is about building data pipelines with Apache Spark for batch processing. But some aspects are also valid for other frameworks or for stream processing. Eventually I will introduce Flowman, an Apache Spark based application that simplifies the implementation of data pipelines for batch processing.
本系列涉及使用Apache Spark构建数据管道以进行批处理。 但是某些方面对于其他框架或流处理也有效。 最后,我将介绍Flowman ,它是一个基于Apache Spark的应用程序,可简化用于批处理的数据管道的实现。
介绍 (Introduction)
This second part highlights the reason why Apache Spark is so well suited as a framework for implementing data processing pipelines. There are many other alternatives, especially in the domain of stream processing. But from my point of view when working in a batch world (and there are good reasons to do that, especially if many non-trivial transformations are involved that require a larger amount of history, like grouped aggregations and huge joins) Apache Spark is an almost unrivaled framework that excels specifically in the domain of batch processing.
第二部分重点介绍了Apache Spark之所以非常适合用作实现数据处理管道的框架的原因。 还有许多其他选择,尤其是在流处理领域。 但是从我的角度来看,在批处理环境中工作时(这样做有充分的理由,尤其是如果涉及许多需要大量历史记录的非平凡转换,例如分组聚合和巨大的联接),Apache Spark是几乎无与伦比的框架,在批处理领域特别出色。
This article tries to shed some light on the capabilities Spark offers that provides a solid foundation for batch processing.
本文试图阐明Spark提供的功能,这些功能为批处理提供了坚实的基础。
数据工程技术要求 (Technical requirements for data engineering)
I already commented in the first part on the typical parts of a data processing pipeline. Let’s just repeat those steps:
我已经在第一部分中评论了数据处理管道的典型部分。 让我们重复这些步骤:
Extraction. Read data from some source system (be it a shared filesystem like HDFS or in an object store like S3 or some database like MySQL or MongoDB)
萃取。 从某些源系统(例如共享文件系统(如HDFS)或对象存储(如S3)或某些数据库(如MySQL或MongoDB)中读取数据)
Transformation. Apply some transformations like data extraction, filtering, joining or even aggregation.
转型。 应用一些转换,例如数据提取,过滤,联接甚至聚合。
Loading. Store the results back again into some target system. Again this can be a shared filesystem, object store or some database.
正在加载。 将结果再次存储回某个目标系统。 同样,这可以是共享文件系统,对象存储或某些数据库。
We can now deduce some requirements of the framework or tool to be used for data engineering by mapping each of these steps to a desired capability — with some additional requirements added to the end.
现在,我们可以通过将每个步骤映射到所需的功能来推断出用于数据工程的框架或工具的某些要求-最后还添加了一些其他要求。
Broad range of connectors. We need a framework that is able to read in from a broad range of data sources like files in a distributed file system, records from a relational database or a column store or even a key value store.
连接器种类繁多。 我们需要一个能够从广泛的数据源中读取的框架,例如分布式文件系统中的文件,关系数据库,列存储或键值存储中的记录。
Broad and extensible range of transformations. In order to “apply transformations” the framework should clearly support and implement transformations. Typical transformations are simple column-wise transformations like string operations, filtering, joins, grouped aggregations — all the stuff that is offered by traditional SQL. On top of that the framework should offer a clean and simple API to extend the set of transformations, specifically the column-wise transformations. This is important for implementing custom logic that cannot be implemented with the core functionality.
广泛而可扩展的转换范围。 为了“应用转换”,框架应明确支持和实施转换。 典型的转换是简单的按列转换,例如字符串操作,过滤,联接,分组聚合-所有这些都是传统SQL提供的。 最重要的是,该框架应提供一个简洁的API,以扩展转换集,尤其是列式转换集。 这对于实现无法使用核心功能实现的自定义逻辑很重要。
Broad range of connectors. Again we need a broad range of connectors for writing the results back into the desired target storage system.
连接器种类繁多。 同样,我们需要各种各样的连接器,以将结果写回到所需的目标存储系统中。
Extensibility. I already mentioned this in the second requirement above, but I feel this aspect is important enough for an explicit point. Extensibility may not only be limited to the kind of transformations, but it should also include extension points for new input/output formats and new connectors.
可扩展性。 我已经在上面的第二个要求中提到了这一点,但是我觉得这方面对于一个明确的观点足够重要。 可扩展性不仅限于转换类型,还应包括新输入/输出格式和新连接器的扩展点。
Scalability. Whatever solution is chosen, it should be able to handle an every growing amount of data. First in many scenarios you should be prepared to handle more data than what would fit into RAM. This helps to avoid getting completely stuck by the amount of data. Second you might want to be able to distribute the workload onto multiple machines if the amount of data slows down processing too much.
可扩展性。 无论选择哪种解决方案,它都应该能够处理不断增长的数据量。 首先,在许多情况下,您应该准备处理比RAM中容纳的更多的数据。 这有助于避免完全被数据量卡住。 其次,如果数据量减慢了处理速度,那么您可能希望能够将工作负载分配到多台计算机上。
什么是Apache Spark (What is Apache Spark)
Apache Spark provides good solutions to all these requirements above. Apache Spark itself is a collection of libraries, a framework for developing custom data processing pipelines. This means that Apache Spark itself is not a full-blown application, but requires you to write programs which contains the transformation logic, while Spark takes care of executing the logic in an efficient way distributed on multiple machines in a cluster.
Apache Spark为上述所有这些需求提供了很好的解决方案。 Apache Spark本身是一个库集合,这是一个用于开发自定义数据处理管道的框架。 这意味着Apache Spark本身不是一个成熟的应用程序,但是要求您编写包含转换逻辑的程序,而Spark负责以高效的方式执行逻辑,这些逻辑分布在集群中的多台计算机上。
Spark was initially started at UC Berkeley’s AMPLab in 2009, and open sourced in 2010. Eventually in 2013, the project was donated to the Apache Software Foundation. The project soon caught on traction, especially from people used to work with Hadoop Map Reduce before. Initially Spark offered its core API around so called RDDs (Resilient Distributed Datasets) which provide a much higher level of abstraction in comparison to Hadoop and thereby helped developers to work much more efficiently.
Spark最初于2009年在加州大学伯克利分校的AMPLab启动,并于2010年开源。最终在2013年,该项目捐赠给了Apache Software Foundation。 该项目很快就吸引了人们的眼球,尤其是以前曾使用Hadoop Map Reduce的人们。 最初,Spark围绕所谓的RDD(弹性分布式数据集)提供了其核心API,与Hadoop相比,RDD提供了更高级别的抽象,从而帮助开发人员提高了工作效率。
Later on the newer on preferred DataFrame API was added, which implements a relational algebra with an expressiveness comparable to SQL. This API provides concepts very similar to tables in a database with named and strongly typed columns.
后来又添加了更新的首选DataFrame API,该API实现了关系代数,其表达能力与SQL相当。 该API提供的概念与具有命名和强类型列的数据库中的表非常相似。
While Apache Spark itself is developed in Scala (a mixed functional and object oriented programming language running on the JVM), it provides APIs to write applications using Scala, Java, Python or R. When looking at the official examples, you quickly realize that the API is really expressive and simple.
尽管Apache Spark本身是用Scala(一种在JVM上运行的功能和面向对象的混合编程语言)开发的,但它提供了使用Scala,Java,Python或R编写应用程序的API。查看官方示例时,您很快就会意识到API确实具有表达力和简单性。
Connectors. With Apache Spark only being a processing framework with no built in persistence layer, it always relied on connectivity to storage systems like HDFS, S3 or relational databases via JDBC. This implies that a clean connectivity design was built in from the beginning, specifically with the advent of DataFrames. Nowadays almost every storage or database technology simply needs to provide an adaptor for Apache Spark to be considered as a possible choice on many environments.
连接器。 由于Apache Spark只是一个处理框架,没有内置的持久层,因此它始终依赖于通过JDBC与HDFS,S3或关系数据库之类的存储系统的连接。 这意味着从一开始就内置了干净的连接设计,特别是随着DataFrames的出现。 如今,几乎每种存储或数据库技术都只需要为Apache Spark提供适配器,就可以将其视为许多环境中的可能选择。
Transformations. The original core library provides the RDD abstraction with many common transformations like filtering, joining and grouped aggregations. But nowadays the newer DataFrame API is to be preferred and provides a huge set of transformations mimicking SQL. This should be enough for most needs.
转变。 原始的核心库为RDD抽象提供了许多常见的转换,例如过滤,联接和分组聚合。 但是如今,较新的DataFrame API将成为首选,并且提供了大量模仿SQL的转换。 对于大多数需求,这应该足够了。
Extensibility. New transformations can be easily implemented with so called user defined functions (UDFs), where you only need to provide a small snippet of code working on an individual record or column and Spark wraps it up such that the function can be executed in parallel and distributed in a cluster of computers.
可扩展性。 可以使用所谓的用户定义函数(UDF)轻松实现新的转换,您只需在其中提供一小段处理单个记录或列的代码,Spark对其进行包装,以便可以并行执行和分布该函数在计算机集群中。
Since Spark has a very high code quality, you can even go down one or two layers and implement new functionality using the internal developers API. This might be a little bit more difficult, but can be very rewarding for those rare cases which cannot be implemented using UDFs.
由于Spark具有很高的代码质量,因此您甚至可以深入一层或两层,并使用内部开发人员API实施新功能。 这可能会有点困难,但是对于那些无法使用UDF实现的罕见情况,这可能会非常有益。
Scalability. Spark was designed to be a Big Data tool from the very beginning, and as such it can scale to many hundreds nodes within different types of clusters (Hadoop YARN, Mesos and lately Kubernetes, of course). It can process data much bigger than what would fit into RAM. One very nice aspect is that Spark applications can also run very efficiently on a single node without any cluster infrastructure, which is nice from a developers point of view for testing, but which also enables to use Spark for not-so-huge amounts of data and still benefit from Sparks features and flexibility.
可扩展性。 Spark从一开始就被设计为大数据工具,因此它可以扩展到不同类型的集群(当然是Hadoop YARN,Mesos和最近的Kubernetes)中的数百个节点。 它可以处理比RAM更大的数据。 一个非常好的方面是,Spark应用程序还可以在没有任何群集基础结构的情况下在单个节点上非常高效地运行,从开发人员的角度来看,这很不错,可以进行测试,但是也可以使用Spark来处理数量不多的数据并且仍然受益于Sparks的功能和灵活性。
By these four aspects Apache Spark is very well suited to typical data transformation tasks formerly done with dedicated and expensive ETL software from vendors like Talend or Informatica. By using Spark instead, you get all the benefits of a vivid open source community and the freedom of tailoring applications precisely to your needs.
通过这四个方面,Apache Spark非常适合于以前使用Talend或Informatica等供应商的专用且昂贵的ETL软件完成的典型数据转换任务。 通过使用Spark,您将获得生动的开源社区的所有好处,以及可以完全根据自己的需求定制应用程序的自由。
Although Spark was created with huge amounts of data in mind, I would always consider it even for smaller amounts of data simply because of its flexibility and the option to seamlessly grow with the amount of data.
尽管创建Spark时会考虑到海量数据,但即使是少量数据,我也总是会考虑使用它,因为它的灵活性以及可以随数据量无缝增长的选项。
备择方案 (Alternatives)
Of course Apache Spark isn’t the only option for implementing data processing pipelines. Software vendors like Informatica and Talend also provide very solid products for people who prefer to buy in into complete eco systems (with all the pros and cons).
当然,Apache Spark不是实现数据处理管道的唯一选择。 像Informatica和Talend这样的软件供应商也为喜欢购买完整生态系统(具有所有优点和缺点)的人们提供了非常可靠的产品。
But even in the Big Data open source world, there are some projects which could seem to be alternatives at the first glance.
但是即使在大数据开源世界中,乍一看也有一些项目似乎是替代项目。
First we still have Hadoop around. But Hadoop actually consists of three components, which have been split up cleanly: First we have the distributed file system HDFS which is capable of storing really huge amounts of data (petabytes to say). Next we have the cluster scheduler YARN for running distributed applications. Finally we have the Map Reduce framework for developing a very specific type of distributed data processing applications. While the first two components HDFS and YARN are still being widely used and deployed (although they feel the pressure from cloud storage and Kubernetes are possible replacements), the Map Reduce framework nowadays simply shouldn’t be used by any project and more. The programming model is much too complicated and writing non-trivial transformations can become really hard. So, yes, HDFS and YARN are fine as infrastructure services (storage and compute) and Spark is well integrated with both.
首先,我们仍然有Hadoop。 但是Hadoop实际上包含三个组成部分,这些组成部分已被清楚地拆分:首先,我们拥有分布式文件系统HDFS,它能够存储非常大量的数据(说是PB)。 接下来,我们有用于运行分布式应用程序的群集调度程序YARN。 最后,我们有了Map Reduce框架,用于开发非常特定类型的分布式数据处理应用程序。 尽管HDFS和YARN的前两个组件仍在广泛使用和部署(尽管它们感到云存储和Kubernetes带来的压力是可能的替代品),但如今Map Reduce框架根本不应该被任何项目或更多项目使用。 编程模型过于复杂,编写非平凡的转换会变得非常困难。 因此,是的,HDFS和YARN可以很好地用作基础结构服务(存储和计算),并且Spark与两者很好地集成在一起。
Other alternatives could be SQL execution engines (without integrated persistence layer) like Hive, Presto, Impala, etc. While these tools often also provide a broad connectivity to different data sources they are all limited to SQL. For one, SQL queries itself can become quite tricky for long chains of transformations with many common table expressions (CTEs). Second it is often more difficult to extend SQL with new features. I wouldn’t say that Spark is better than these tools in general, but I say that Spark is better for data processing pipelines. These tools really shine for querying existing data. But I would not want to use these tools for creating data — that was never their primary scope. On the other hand, while you can use Spark via Spark Thrift Server for executing SQL for serving data, it wasn’t really created for that scenario.
其他替代方案可以是SQL执行引擎(不具有集成的持久层),例如Hive,Presto,Impala等。尽管这些工具通常还提供与不同数据源的广泛连接,但它们都限于SQL。 一方面,对于具有许多常用表表达式(CTE)的长转换链,SQL查询本身可能会变得非常棘手。 其次,使用新功能扩展SQL通常更加困难。 我不会说Spark通常比这些工具更好,但是我说Spark对于数据处理管道来说更好。 这些工具确实可以查询现有数据。 但是我不想使用这些工具来创建数据-从来都不是它们的主要范围。 另一方面,虽然您可以通过Spark Thrift Server使用Spark来执行SQL以提供数据,但实际上并不是针对该场景创建的。
发展历程 (Development)
One question I often hear is what programming language should be used for accessing the power of Spark. As I wrote above, Spark out of the box provides bindings for Scala, Java, Python and R —so the question really makes sense.
我经常听到的一个问题是,应使用哪种编程语言来访问Spark的功能。 如我上面所述,Spark开箱即用地提供了Scala,Java,Python和R的绑定-因此,这个问题确实很有意义。
My advise is either to use Scala or Python (maybe R — I don’t have experience with that) depending on the task. Never use Java (it really feels much more complicated than the clean Scala API), invest some time to learn some basic Scala instead.
我的建议是根据任务使用Scala或Python(也许是R –我没有经验)。 永远不要使用Java(它确实比干净的Scala API复杂得多),而是花一些时间来学习一些基本的Scala。
Now that leaves us with the question “Python or Scala”.
现在剩下的问题是“ Python或Scala”。
- If you are doing data engineering (read, transform, store), then I strongly advise to use Scala. First since Scala is a statically typed language, it is actually simpler to write correct programs than with Python. Second whenever you need to implement new functionality not found in Spark, you are better off with the native language of Spark. Although Spark well supports UDFs in Python, you will pay a performance penalty and you cannot dive any deeper. Implementing new connectors or file formats with Python will be very difficult, maybe even impossible. 如果您正在进行数据工程(读取,转换,存储),那么我强烈建议您使用Scala。 首先,由于Scala是一种静态类型的语言,实际上编写正确的程序比使用Python更简单。 其次,每当您需要实现Spark中找不到的新功能时,使用Spark的本地语言会更好。 尽管Spark很好地支持Python中的UDF,但是您将付出性能损失,并且无法深入研究。 用Python实现新的连接器或文件格式将非常困难,甚至可能是不可能的。
- If you are doing Data Science (which is not the scope of this article series), then Python is the much better option with all those Python packages like Pandas, SciPy, SciKit Learn, Tensorflow etc. 如果您正在做数据科学(这不在本系列文章的范围之内),那么Python是所有这些Python软件包(如Pandas,SciPy,SciKit Learn,Tensorflow等)的更好选择。
Except for the different libraries in those two scenarios above, the typical development workflow is also much different: The applications developed by data engineers often run in production every day or even every hour. Data Scientists on the other hand often work interactively with data and some insight is the final deliverable. So production readiness is much more critical for data engineers than for data scientists. And even though many people will disagree, “production readiness” is much harder with Python or any other dynamically typed language.
除了上述两种情况下的不同库外,典型的开发工作流程也有很大不同:数据工程师开发的应用程序通常每天甚至每小时都在生产中运行。 另一方面,数据科学家经常与数据交互工作,而某些洞察力是最终的成果。 因此,对于数据工程师而言,准备就绪对数据科学家而言至关重要。 即使许多人不同意,使用Python或任何其他动态类型化的语言,“生产准备就绪”的难度也要大得多。
框架的缺点 (Shortcomings of a Framework)
Now since Apache Spark is such a nice framework for complex data transformations, we can simply start implementing our pipelines. Within a few lines of code, we can instruct Spark to perform all the magic to process our multi terabytes data set into something more accessible.
现在,由于Apache Spark是用于复杂数据转换的好框架,因此我们可以简单地开始实施管道。 在几行代码中,我们可以指示Spark执行所有魔术操作,以将我们的数TB的数据集处理为更易于访问的内容。
Wait, not so fast! I did that multiple times in the past for different companies and after some time I found out that many aspects have to be implemented over and over again. While Spark excels at data processing itself, I argued in the first part of this series that robust data engineering is about more than only the processing itself. Logging, monitoring, scheduling, schema management all come to my mind and all these aspects need to be addressed for every serious project.
等等,不是那么快! 我过去曾为不同的公司做过多次,一段时间后,我发现许多方面必须一遍又一遍地实施。 尽管Spark擅长于数据处理本身,但在本系列的第一部分中,我认为健壮的数据工程不仅涉及处理本身。 日志,监视,调度,模式管理全都浮现在我的脑海,所有这些方面都需要在每个严肃的项目中解决。
Those non-functional aspects often require non-trivial code to be written, some of which can be very low level and technical. Therefore Spark is not enough to implement a production quality data pipeline. Since those issues arise independent of the specific project and company, I propose to split up the application into two layers: One top layer containing the business logic encoded in data transformations and the specifications of the data source and data target. One lower layer then should take care of executing the whole data flow, providing relevant logging and monitoring metrics, taking care of schema management.
这些非功能性方面通常需要编写非平凡的代码,其中一些代码可能是非常底层的和技术性的。 因此,Spark不足以实现生产质量数据管道。 由于这些问题的出现独立于特定的项目和公司,因此我建议将应用程序分为两层:一层包含在数据转换中编码的业务逻辑以及数据源和数据目标的规范。 然后,下一层应负责执行整个数据流,提供相关的日志记录和监视指标,并负责模式管理。
最后的话 (Final Words)
This was the second part of a series about building robust data pipelines with Apache Spark. We had a strong focus on why Apache Spark is very well suited for replacing traditional ETL tools. Next time I will discuss why another layer of abstraction will help you to focus on business logic instead of technical details.
这是关于使用Apache Spark构建健壮的数据管道的系列文章的第二部分。 我们非常关注Apache Spark为什么非常适合替换传统ETL工具。 下次,我将讨论为什么另一层抽象将帮助您专注于业务逻辑而不是技术细节。
翻译自: https://medium.com/@kupferk/big-data-engineering-apache-spark-d67be2d9b76f
apache spark