赞
踩
1、datax3.0部署与验证
2、mysql相关同步-mysql同步到mysql、mysql和hdfs相互同步
3、oracle相关同步-oracle到hdfs
4、sybase相关同步-sybase到hdfs
5、ETL工具的比较(DataPipeline,Kettle,Talend,Informatica,Datax ,Oracle Goldeng
本文主要是介绍datax3.0功能与部署,为该系列文章打下基础。
本文分为四部分,即介绍、部署、验证与运行示例。
DataX 是阿里云 DataWorks数据集成 的开源版本,在阿里巴巴集团内被广泛使用的离线数据同步工具/平台。DataX 实现了包括 MySQL、Oracle、OceanBase、SqlServer、Postgre、HDFS、Hive、ADS、HBase、TableStore(OTS)、MaxCompute(ODPS)、Hologres、DRDS, databend 等各种异构数据源之间高效的数据同步功能。
DataX本身作为数据同步框架,将不同数据源的同步抽象为从源头数据源读取数据的Reader插件,以及向目标端写入数据的Writer插件,理论上DataX框架可以支持任意数据源类型的数据同步工作。同时DataX插件体系作为一套生态系统, 每接入一套新数据源该新加入的数据源即可实现和现有的数据源互通。
DataX目前已经有了比较全面的插件体系,主流的RDBMS数据库、NOSQL、大数据计算系统都已经接入,目前支持数据如下图,详情请点击:DataX数据源参考指南
DataX本身作为离线数据同步框架,采用Framework + plugin架构构建。将数据源读取和写入抽象成为- – Reader/Writer插件,纳入到整个同步框架中。
举例来说,用户提交了一个DataX作业,并且配置了20个并发,目的是将一个100张分表的mysql数据同步到odps里面。
DataX的调度决策思路是:
DataX作为一个服务于大数据的ETL工具,除了提供数据快照搬迁功能之外,还提供了丰富数据转换的功能,让数据在传输过程中可以轻松完成数据脱敏,补全,过滤等数据转换功能,另外还提供了自动groovy函数,让用户自定义转换函数。详情请看DataX3的transformer详细介绍。
还在为同步过程对在线存储压力影响而担心吗?新版本DataX3.0提供了包括通道(并发)、记录流、字节流三种流控模式,可以随意控制你的作业速度,让你的作业在库可以承受的范围内达到最佳的同步速度。
"speed": {
"channel": 5,
"byte": 1048576,
"record": 10000
}
DataX3.0每一种读插件都有一种或多种切分策略,都能将作业合理切分成多个Task并行执行,单机多线程执行模型可以让DataX速度随并发成线性增长。在源端和目的端性能都足够的情况下,单个作业一定可以打满网卡。另外,DataX团队对所有的已经接入的插件都做了极致的性能优化,并且做了完整的性能测试。性能测试相关详情可以参照每单个数据源的详细介绍:DataX数据源指南
DataX作业是极易受外部因素的干扰,网络闪断、数据源不稳定等因素很容易让同步到一半的作业报错停止。因此稳定性是DataX的基本要求,在DataX 3.0的设计中,重点完善了框架和插件的稳定性。目前DataX3.0可以做到线程级别、进程级别(暂时未开放)、作业级别多层次局部/全局的重试,保证用户的作业稳定运行。
需要安装需要的最低版本。
java -v
由于datax是以python脚本形式的语言,所以需要python的运行环境。
python -v
如果下载源码编译的话,则需要安装maven,本文略。
如果系统提供的功能不满足需要的话,可下载源码进行编译。
mvn -v
## 下载DataX源码
git clone git@github.com:alibaba/DataX.git
## 通过maven打包
cd {DataX_source_code_home}
mvn -U clean package assembly:assembly -Dmaven.test.skip=true
## 打包成功,日志显示如下:
[INFO] BUILD SUCCESS
[INFO] -----------------------------------------------------------------
[INFO] Total time: 08:12 min
[INFO] Finished at: 2015-12-13T16:26:48+08:00
[INFO] Final Memory: 133M/960M
[INFO] -----------------------------------------------------------------
## 打包成功后的DataX包位于 {DataX_source_code_home}/target/datax/datax/ ,结构如下:
cd {DataX_source_code_home}
ls ./target/datax/datax/
bin conf job lib log log_perf plugin script tmp
下载地址:https://github.com/alibaba/DataX
cd /usr/local
tar -zxvf datax.tar.gz -C /usr/local
使用命令验证即可
语法:
cd {YOUR_DATAX_HOME}/bin
python datax.py {YOUR_JOB.json}
自检脚本:
python {YOUR_DATAX_HOME}/bin/datax.py {YOUR_DATAX_HOME}/job/job.json
python datax.py ../job/job.json
[root@server2 bin]# python datax.py ../job/job.json
DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.
2023-04-03 02:20:52.850 [main] INFO VMInfo - VMInfo# operatingSystem class => sun.management.OperatingSystemImpl
2023-04-03 02:20:52.858 [main] INFO Engine - the machine info =>
osInfo: Oracle Corporation 1.8 25.144-b01
jvmInfo: Linux amd64 2.6.32-754.35.1.el6.x86_64
cpu num: 8
totalPhysicalMemory: -0.00G
freePhysicalMemory: -0.00G
maxFileDescriptorCount: -1
currentOpenFileDescriptorCount: -1
GC Names [PS MarkSweep, PS Scavenge]
MEMORY_NAME | allocation_size | init_size
PS Eden Space | 256.00MB | 256.00MB
Code Cache | 240.00MB | 2.44MB
Compressed Class Space | 1,024.00MB | 0.00MB
PS Survivor Space | 42.50MB | 42.50MB
PS Old Gen | 683.00MB | 683.00MB
Metaspace | -0.00MB | 0.00MB
2023-04-03 02:20:52.877 [main] INFO Engine -
{
"content":[
{
"reader":{
"name":"streamreader",
"parameter":{
"column":[
{
"type":"string",
"value":"DataX"
},
{
"type":"long",
"value":19800804
},
{
"type":"date",
"value":"1980-08-04 00:00:00"
},
{
"type":"bool",
"value":true
},
{
"type":"bytes",
"value":"test"
}
],
"sliceRecordCount":100000
}
},
"writer":{
"name":"streamwriter",
"parameter":{
"encoding":"UTF-8",
"print":false
}
}
}
],
"setting":{
"errorLimit":{
"percentage":0.02,
"record":0
},
"speed":{
"byte":10485760
}
}
}
2023-04-03 02:20:52.896 [main] WARN Engine - prioriy set to 0, because NumberFormatException, the value is: null
2023-04-03 02:20:52.898 [main] INFO PerfTrace - PerfTrace traceId=job_-1, isEnable=false, priority=0
2023-04-03 02:20:52.898 [main] INFO JobContainer - DataX jobContainer starts job.
2023-04-03 02:20:52.900 [main] INFO JobContainer - Set jobId = 0
2023-04-03 02:20:52.916 [job-0] INFO JobContainer - jobContainer starts to do prepare ...
2023-04-03 02:20:52.917 [job-0] INFO JobContainer - DataX Reader.Job [streamreader] do prepare work .
2023-04-03 02:20:52.917 [job-0] INFO JobContainer - DataX Writer.Job [streamwriter] do prepare work .
2023-04-03 02:20:52.917 [job-0] INFO JobContainer - jobContainer starts to do split ...
2023-04-03 02:20:52.918 [job-0] INFO JobContainer - Job set Max-Byte-Speed to 10485760 bytes.
2023-04-03 02:20:52.919 [job-0] INFO JobContainer - DataX Reader.Job [streamreader] splits to [1] tasks.
2023-04-03 02:20:52.920 [job-0] INFO JobContainer - DataX Writer.Job [streamwriter] splits to [1] tasks.
2023-04-03 02:20:52.939 [job-0] INFO JobContainer - jobContainer starts to do schedule ...
2023-04-03 02:20:52.944 [job-0] INFO JobContainer - Scheduler starts [1] taskGroups.
2023-04-03 02:20:52.946 [job-0] INFO JobContainer - Running by standalone Mode.
2023-04-03 02:20:52.954 [taskGroup-0] INFO TaskGroupContainer - taskGroupId=[0] start [1] channels for [1] tasks.
2023-04-03 02:20:52.957 [taskGroup-0] INFO Channel - Channel set byte_speed_limit to -1, No bps activated.
2023-04-03 02:20:52.958 [taskGroup-0] INFO Channel - Channel set record_speed_limit to -1, No tps activated.
2023-04-03 02:20:52.968 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] attemptCount[1] is started
2023-04-03 02:20:53.269 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] is successed, used[302]ms
2023-04-03 02:20:53.269 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] completed it's tasks.
2023-04-03 02:21:02.962 [job-0] INFO StandAloneJobContainerCommunicator - Total 100000 records, 2600000 bytes | Speed 253.91KB/s, 10000 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.018s | All Task WaitReaderTime 0.035s | Percentage 100.00%
2023-04-03 02:21:02.962 [job-0] INFO AbstractScheduler - Scheduler accomplished all tasks.
2023-04-03 02:21:02.963 [job-0] INFO JobContainer - DataX Writer.Job [streamwriter] do post work.
2023-04-03 02:21:02.963 [job-0] INFO JobContainer - DataX Reader.Job [streamreader] do post work.
2023-04-03 02:21:02.963 [job-0] INFO JobContainer - DataX jobId [0] completed successfully.
2023-04-03 02:21:02.964 [job-0] INFO HookInvoker - No hook invoked, because base dir not exists or is a file: /usr/local/datax/hook
2023-04-03 02:21:02.966 [job-0] INFO JobContainer -
[total cpu info] =>
averageCpu | maxDeltaCpu | minDeltaCpu
-1.00% | -1.00% | -1.00%
[total gc info] =>
NAME | totalGCCount | maxDeltaGCCount | minDeltaGCCount | totalGCTime | maxDeltaGCTime | minDeltaGCTime
PS MarkSweep | 0 | 0 | 0 | 0.000s | 0.000s | 0.000s
PS Scavenge | 0 | 0 | 0 | 0.000s | 0.000s | 0.000s
2023-04-03 02:21:02.966 [job-0] INFO JobContainer - PerfTrace not enable!
2023-04-03 02:21:02.966 [job-0] INFO StandAloneJobContainerCommunicator - Total 100000 records, 2600000 bytes | Speed 253.91KB/s, 10000 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.018s | All Task WaitReaderTime 0.035s | Percentage 100.00%
2023-04-03 02:21:02.967 [job-0] INFO JobContainer -
任务启动时刻 : 2023-04-03 02:20:52
任务结束时刻 : 2023-04-03 02:21:02
任务总计耗时 : 10s
任务平均流量 : 253.91KB/s
记录写入速度 : 10000rec/s
读出记录总数 : 100000
读写失败总数 : 0
如果centos 7 出现以下异常,则执行命令 :
rm -rf /usr/local/datax/plugin/*/._*
[root@bd-node-05 bin]# python datax.py ../job/job.json
DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.
2023-04-03 10:30:49.298 [main] WARN ConfigParser - 插件[streamreader,streamwriter]加载失败,1s后重试... Exception:Code:[Common-00], Describe:[您提供的配置文件存在错误信息,请检查您的作业配置 .] - 配置信息错误,您提供的配置文件[/usr/local/datax/plugin/reader/._drdsreader/plugin.json]不存在. 请检查您的配置文件.
2023-04-03 10:30:50.304 [main] ERROR Engine -
经DataX智能分析,该任务最可能的错误原因是:
com.alibaba.datax.common.exception.DataXException: Code:[Common-00], Describe:[您提供的配置文件存在错误信息,请检查您的作业配置 .] - 配置信息错误,您提供的配置文件[/usr/local/datax/plugin/reader/._drdsreader/plugin.json]不存在. 请检查您的配置文件.
at com.alibaba.datax.common.exception.DataXException.asDataXException(DataXException.java:26)
at com.alibaba.datax.common.util.Configuration.from(Configuration.java:95)
at com.alibaba.datax.core.util.ConfigParser.parseOnePluginConfig(ConfigParser.java:153)
at com.alibaba.datax.core.util.ConfigParser.parsePluginConfig(ConfigParser.java:125)
at com.alibaba.datax.core.util.ConfigParser.parse(ConfigParser.java:63)
at com.alibaba.datax.core.Engine.entry(Engine.java:137)
at com.alibaba.datax.core.Engine.main(Engine.java:204)
配置文件模板如下,json最外层是一个job,job包含setting和content两部分,其中setting用于对整个job进行配置,content用户配置数据源和目的地。
Reader和Writer的具体参数可参考官方文档,地址如下:
https://github.com/alibaba/DataX/blob/master/README.md
可以通过命令查看配置模板: python datax.py -r {YOUR_READER} -w {YOUR_WRITER}
cd {YOUR_DATAX_HOME}/bin
# 1、通过命令查看模板:
python datax.py -r streamreader -w streamwriter
[root@bd-node-05 bin]# python datax.py -r streamreader -w streamwriter
DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.
Please refer to the streamreader document:
https://github.com/alibaba/DataX/blob/master/streamreader/doc/streamreader.md
Please refer to the streamwriter document:
https://github.com/alibaba/DataX/blob/master/streamwriter/doc/streamwriter.md
Please save the following configuration as a json file and use
python {DATAX_HOME}/bin/datax.py {JSON_FILE_NAME}.json
to run the job.
{
"job": {
"content": [
{
"reader": {
"name": "streamreader",
"parameter": {
"column": [],
"sliceRecordCount": ""
}
},
"writer": {
"name": "streamwriter",
"parameter": {
"encoding": "",
"print": true
}
}
}
],
"setting": {
"speed": {
"channel": ""
}
}
}
}
# 2、创建stream2stream.json文件,其内容如下:
{
"job": {
"content": [
{
"reader": {
"name": "streamreader",
"parameter": {
"sliceRecordCount": 10,
"column": [
{
"type": "long",
"value": "10"
},
{
"type": "string",
"value": "hello,你好,世界-DataX"
}
]
}
},
"writer": {
"name": "streamwriter",
"parameter": {
"encoding": "UTF-8",
"print": true
}
}
}
],
"setting": {
"speed": {
"channel": 5
}
}
}
}
# 3、运行同步命令:
python datax.py ../job/stream2stream.json
# 4、运行结果如下:
2023-04-03 15:20:23.919 [job-0] INFO JobContainer -
任务启动时刻 : 2023-04-03 15:20:13
任务结束时刻 : 2023-04-03 15:20:23
任务总计耗时 : 10s
任务平均流量 : 95B/s
记录写入速度 : 5rec/s
读出记录总数 : 50
读写失败总数 : 0
以上则完成了DataX3.0在centos7 环境下的部署与验证。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。