oracle goldengate adapter for bigdata 12.2.0.1.0 to hadoop 2.6 hdfs
As a first step, we need to prepare our source database for replication. It becomes easier with every new GoldenGate version. We will need to perform several steps:a) Enable archive logging on our database. This particular step requires downtime.
orcl> alter database mount;
Database altered.
orcl> alter database archivelog;
Database altered.
orcl> alter database open;
b) Enable force logging and minimal supplemental logging. No need to shutdown database for this.
orcl> alter database add supplemental log data;
Database altered.
orcl> alter database force logging;
Database altered.
orcl> SELECT supplemental_log_data_min, force_logging FROM v$database;
SUPPLEME FORCE_LOGGING
-------- ---------------------------------------
YESYES
c) Switch parameter “enable_goldengate_replication” to “TRUE”. Can be done online.
orcl> alter system set enable_goldengate_replication=true sid='*' scope=both;
System altered.
orcl>
And we are almost done. Now we can create a schema for a GoldenGate administrator, and provide required privileges.
I’ve just granted DBA role to the user to simplify process. In any case you will need it in case of integrated capture.
For a production installation I advise you to have a look at the documentation to verify necessary privileges and roles.
orcl> create user ogg identified by welcome1 default tablespace users temporary tablespace temp;
orcl> grant connect, dba to ogg;
Let’s create a test schema to be replicated. We will call it schema on the source as ggtest and I will name the destination schema as bdtest.
It will allow us also to check how the mapping works in our replication.
orcl> create tablespace ggtest; -- optional step
orcl> create user ggtest identified by welcome1 default tablespace ggtest temporary tablespace temp;
orcl> grant connect, resource to ggtest;
Everything is ready on our source database for the replication.
Now we are installing Oracle GoledenGate for Oracle to our database server. We can get the software from the Oracle site on
the download page in the Middleware section, GoldenGate, Oracle GoldenGate for Oracle databases.
We are going to use 12.2.0.1.1 version of the software. The installation is easy –
you need to unzip the software and run Installer which will guide you through couple of simple steps.
The installer will unpack the software to the destination location, create subdirectories, and register GoldenGate in the Oracle global registry.
$ unzip fbo_ggs_Linux_x64_shiphome.zip
$ cd fbo_ggs_Linux_x64_shiphome/Disk1/
$ ./runInstaller
We continue by setting up parameters for Oracle GoldenGate (OGG) manager and starting it up.
You can see that I’ve used a default blowfish encryption for the password.
In a production environment you may consider another encryption like AES256.
I’ve also used a non-default port for the manager since I have more than one GoldenGate installation on my test sandbox.
$ export GGHOME=/u01/ogg
$ cd $GGHOME
$ ./ggsci
GGSCI (sandbox.localdomain) 2> edit params mgr
PORT 7809
GGSCI (sandbox.localdomain) 3> start manager
Manager started.
Let’s prepare everything for initial load, and later online replication.
I’ve decided to use GoldenGate initial load extract as the way for initial load for the sake of consistency for the resulted dataset on Hadoop.
I prepared the parameter file to replicate my ggtest schema and upload all data to the trail file on remote site.
I’ve used a minimum number of options for all my processes, providing only handful of parameters required for replication.
Extract options is a subject deserving a dedicated blog post. Here is my simple initial extract:
$ cat /u01/ogg/dirprm/ini_ext.prm
SOURCEISTABLE
userid ogg, password ogg
--RMTHOSTOPTIONS
RMTHOST Master, MGRPORT 7809
RMTFILE /ogg/dirdat/initld, MEGABYTES 2, PURGE
--DDL include objname ggtest.*
TABLE HR.*;
Then we run the initial load extract in passive node and it will create a trail file with the data. The trail file will be used later for our initial load on the target side.
$ ./extract paramfile dirprm/ini_ext.prm reportfile dirrpt/ini_ext.rpt
$ ll /u01/oggbd/dirdat/initld*
-rw-r-----. 1 oracle oinstall 3028 Feb 16 14:17 /u01/oggbd/dirdat/initld
$
We can also prepare our extract on the source site as well. I haven’t used datapump in my configuration
limiting the topology only by simplest and strait-forward extract to replicat configuration. Of course,
in any production configuration I would advise using datapump on source for staging our data.
Here are my extract parameters, and how I added it. I am not starting it yet because I must have an Oracle
GoldenGate Manager running on the target, and the directory for the trail file should be created.
You may have guessed that the Big Data GoldenGate will be located in /u01/oggbd directory.
$ ggsci
Oracle GoldenGate Command Interpreter for Oracle
Version 12.2.0.1.1 OGGCORE_12.2.0.1.0_PLATFORMS_151211.1401_FBO
Linux, x64, 64bit (optimized), Oracle 12c on Dec 12 2015 02:56:48
Operating system character set identified as UTF-8.
Copyright (C) 1995, 2015, Oracle and/or its affiliates. All rights reserved.
GGSCI (sandbox.localdomain) 1> edit params ggext
extract ggext
userid ogg, password ogg
--RMTHOSTOPTIONS
RMTHOST Master, MGRPORT 7809
RMTFILE /ogg/dirdat/or, MEGABYTES 2, PURGE
DDL include objname HR.*
TABLE HR.*;
GGSCI (node1) 9> dblogin userid ogg,password ogg
Successfully logged into database.
GGSCI (node1 as ogg@orcl) 10> register extract GGEXT database
2016-09-05 10:11:25INFO OGG-02003Extract GGEXT successfully registered with database at SCN 1043211.
GGSCI (node1 as ogg@orcl) 11> add extract ggext, integrated tranlog, begin now
EXTRACT (Integrated) added.
Let’s leave our source site for a while and switch to the target . Our target is going to be a box where we have hadoop client and all requirement java classes.
I used the same box just to save resources on my sandbox environment. You may run different GoldeGate versions on the same box considering, that Manager ports for each of them will be different.
Essentially we need a Hadoop client on the box, which can connect to HDFS and write data there. Installation of Hadoop client is out of the scope for this article, but you can easily get all necessary information from the Hadoop home page .
Having all required Hadoop classes we continue by installing Oracle GoldenGate for Big Data, configuring and starting it up. In the past I received several questions from people struggling to find the exact place where all adapters could be uploaded. The Adapters were well “hidden” on “Oracle edelivery”, but now it is way simpler. You are going to GoldenGate download page on Oracle site and find the section “Oracle GoldenGate for Big Data 12.2.0.1.0” where you can choose the OGG for Linux x86_64, Windows or Solaris. You will need an Oracle account to get it. We upload the file to our linux box, unzip and unpack the tar archive. I created a directory /u01/oggbd as our GoldenGate home and unpacked the tar archive there.
The next step is to create all necessary directories. We start command line utility and create all subdirectories.
目标端:
$ cd /ogg/
$ ./ggsci
Oracle GoldenGate Command Interpreter
Version 12.2.0.1.0 OGGCORE_12.2.0.1.0_PLATFORMS_151101.1925.2
Linux, x64, 64bit (optimized), Generic on Nov 10 2015 16:18:12
Operating system character set identified as UTF-8.
Copyright (C) 1995, 2015, Oracle and/or its affiliates. All rights reserved.
GGSCI (sandbox.localdomain) 1> create subdirs
Creating subdirectories under current directory /u01/oggbd
Parameter files /u01/oggbd/dirprm: created
Report files /u01/oggbd/dirrpt: created
Checkpoint files /u01/oggbd/dirchk: created
Process status files /u01/oggbd/dirpcs: created
SQL script files /u01/oggbd/dirsql: created
Database definitions files /u01/oggbd/dirdef: created
Extract data files /u01/oggbd/dirdat: created
Temporary files /u01/oggbd/dirtmp: created
Credential store files /u01/oggbd/dircrd: created
Masterkey wallet files /u01/oggbd/dirwlt: created
Dump files /u01/oggbd/dirdmp: created
GGSCI (sandbox.localdomain) 2>
We are changing port for our manager process from default and starting it up. I’ve already mentioned that the port was changed due to the existence off several GoldenGate managers running from different directories.
GGSCI (sandbox.localdomain) 2> edit params mgr
PORT 7809
.....
GGSCI (sandbox.localdomain) 3> start manager
Manager started.
Now we have to prepare parameter files for our replicat processes. Let’s assume the environment variable OGGHOME represents the GoldenGate home and in our case it is going to be /u01/oggbd.
Examples for the parameter files can be taken from $OGGHOME/AdapterExamples/big-data directories. There you will find examples for flume, kafka, hdfs, hbase and for metadata providers. Today we are going to work with HDFS adapter.
I copied files to my parameter files directory ($OGGHOME/dirprm) and modified them accordingly:
oracle@sandbox oggbd]$ cp /ogg/AdapterExamples/big-data/hdfs/* /u01/oggbd/dirprm/
oracle@sandbox oggbd]$ vi /u01/oggbd/dirprm/hdfs.props
Here are my values for the hdfs.props file:
$ cat hdfs.props
$ more hdfs.props
gg.handlerlist=hdfs
gg.handler.hdfs.type=hdfs
gg.handler.hdfs.includeTokens=false
gg.handler.hdfs.maxFileSize=1g
gg.handler.hdfs.rootFilePath=/user/oracle/gg
gg.handler.hdfs.fileRollInterval=0
gg.handler.hdfs.inactivityRollInterval=0
gg.handler.hdfs.fileSuffix=.txt
gg.handler.hdfs.partitionByTable=true
gg.handler.hdfs.rollOnMetadataChange=true
gg.handler.hdfs.authType=none
gg.handler.hdfs.format=delimitedtext
gg.handler.hdfs.format.includeColumnNames=true
gg.handler.hdfs.mode=tx
goldengate.userexit.timestamp=utc
goldengate.userexit.writers=javawriter
javawriter.stats.display=TRUE
javawriter.stats.full=TRUE
gg.log=log4j
gg.log.level=INFO
gg.report.time=30sec
#gg.classpath=/var/lib/hadoop/share/hadoop/common/*:/var/lib/hadoop/share/hadoop/common/lib/*:/var/lib/hadoop/share/hadoop/hdfs/*:/var/lib/hadoop/etc/hadoop/:
gg.classpath=/home/hadoop/hadoop/share/hadoop/common/*:/home/hadoop/hadoop/share/hadoop/common/lib/*:/home/hadoop/hadoop/share/hadoop/hdfs/*:/home/hadoop/hadoop/etc/hadoop/:
javawriter.bootoptions=-Xmx512m -Xms32m -Djava.class.path=ggjava/ggjava.jar
You can find information about all those parameters in oracle documentation here, but there are parameters you will most likely need to change from default:
gg.handler.hdfs.rootFilePath – it will tell where the directories and files have to be written on HDFS.
gg.handler.hdfs.format – you can setup one of the four formats supported by adapter.
goldengate.userexit.timestamp – it will depend from your preferences for transactions timestamps written to your hdfs files.
gg.classpath – it will depend from location for your hadoop jar classes and native libraries.
You can see I’ve mentioned the gg.handler.hdfs.format.includeColumnNames parameter. It is supposed to put column name before each value in the output file on HDFS. It may be helpful in some cases, but at the same time it makes the file bigger. If you are planning to create an external Hive table, you may consider commenting on it as I have.
The next parameter file is for our data initialization replicat file. You may consider using a Sqoop or another way to make the initial load for your tables, but I think it makes sense to use the GG replicat if the table size is relatively small. It makes the resulting file-set more consistent with the following replication, since it will be using the same engine and format. So, here is my replicat for initial load:
GGSCI (Master) 1> view params irhdfs
--passive REPLICAT for initial load irhdfs
-- Trail file for this example is located in "dirdat/initld"
-- Command to run REPLICAT:
-- ./replicat paramfile dirprm/irhdfs.prm reportfile dirrpt/ini_rhdfs.rpt
SPECIALRUN
END RUNTIME
EXTFILE /ogg/dirdat/initld
--DDLERROR default discard
setenv HADOOP_COMMON_LIB_NATIVE_DIR=/home/hadoop/hadoop/lib/native
DDL include all
TARGETDB LIBFILE libggjava.so SET property=dirprm/hdfs.props
REPORTCOUNT EVERY 1 MINUTES, RATE
GROUPTRANSOPS 10000
MAP HR.*, TARGET HR.*;
# su - hadoop
$ start-dfs.sh
16/09/05 22:51:46 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting namenodes on
localhost: starting namenode, logging to /home/hadoop/hadoop/logs/hadoop-hadoop-namenode-Master.out
localhost: starting datanode, logging to /home/hadoop/hadoop/logs/hadoop-hadoop-datanode-Master.out
Starting secondary namenodes
0.0.0.0: starting secondarynamenode, logging to /home/hadoop/hadoop/logs/hadoop-hadoop-secondarynamenode-Master.out
16/09/05 22:52:09 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
$ start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /home/hadoop/hadoop/logs/yarn-hadoop-resourcemanager-Master.out
localhost: starting nodemanager, logging to /home/hadoop/hadoop/logs/yarn-hadoop-nodemanager-Master.out
$ jps
8112 NodeManager
8154 Jps
8011 ResourceManager
7836 SecondaryNameNode
7676 DataNode
7581 NameNode
$ hdfs dfs -mkdir /user/oracle/gg
16/09/05 22:54:21 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
mkdir: `/user/oracle/gg': No such file or directory
$ hdfs dfs -mkdir /user/
16/09/05 22:54:28 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
$ hdfs dfs -mkdir /user/oracle/
16/09/05 22:54:34 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
$ hdfs dfs -mkdir /user/oracle/gg
16/09/05 22:54:40 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
$ hdfs dfs -ls /user/oracle/gg
16/09/05 22:55:09 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
$ hdfs dfs -ls /user/oracle
16/09/05 22:55:13 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
drwxr-xr-x - hadoop supergroup 0 2016-09-05 22:54 /user/oracle/gg
$ pwd
/ogg
I was running the initial load in passive mode, without creating a managed process and just running it from command line. Here is an example:
$ ./replicat paramfile dirprm/irhdfs.prm reportfile dirrpt/ini_rhdfs.rpt
$./replicat paramfile dirprm/irhdfs.prm reportfile dirrpt/ini_rhdfs.rpt
$ hdfs dfs -ls /user/oracle/gg
16/09/05 22:56:58 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 8 items
drwxr-xr-x - hadoop supergroup 0 2016-09-05 22:56 /user/oracle/gg/hr.countries
drwxr-xr-x - hadoop supergroup 0 2016-09-05 22:56 /user/oracle/gg/hr.departments
drwxr-xr-x - hadoop supergroup 0 2016-09-05 22:56 /user/oracle/gg/hr.emp_details_view
drwxr-xr-x - hadoop supergroup 0 2016-09-05 22:56 /user/oracle/gg/hr.employees
drwxr-xr-x - hadoop supergroup 0 2016-09-05 22:56 /user/oracle/gg/hr.job_history
drwxr-xr-x - hadoop supergroup 0 2016-09-05 22:56 /user/oracle/gg/hr.jobs
drwxr-xr-x - hadoop supergroup 0 2016-09-05 22:56 /user/oracle/gg/hr.locations
drwxr-xr-x - hadoop supergroup 0 2016-09-05 22:56 /user/oracle/gg/hr.regions
$ hdfs dfs -ls /user/oracle/gg/hr.jobs
16/09/05 22:57:21 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
-rw-r--r-- 1 hadoop supergroup 3085 2016-09-05 22:56 /user/oracle/gg/hr.jobs/hr.jobs_2016-09-05_22-56-24.338.txt
$ hdfs dfs -tail/user/oracle/gg/hr.jobs
16/09/05 22:57:37 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
tail: `/user/oracle/gg/hr.jobs': Is a directory
$ hdfs dfs -tail/user/oracle/gg/hr.jobs/hr.jobs_2016-09-05_22-56-24.338.txt
16/09/05 22:58:08 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
kMIN_SALARY2008MAX_SALARY5000
IHR.JOBS2016-09-05 14:05:35.8378192016-09-05T22:56:24.37301200000000-10000068278JOB_IDSH_CLERKJOB_TITLEShipping ClerkMIN_SALARY2500MAX_SALARY5500
IHR.JOBS2016-09-05 14:05:35.8378192016-09-05T22:56:24.37301300000000-10000068396JOB_IDIT_PROGJOB_TITLEProgrammerMIN_SALARY4000MAX_SALARY10000
IHR.JOBS2016-09-05 14:05:35.8378192016-09-05T22:56:24.37301400000000-10000068509JOB_IDMK_MANJOB_TITLEMarketing ManagerMIN_SALARY9000MAX_SALARY15000
IHR.JOBS2016-09-05 14:05:35.8378192016-09-05T22:56:24.37301500000000-10000068628JOB_IDMK_REPJOB_TITLEMarketing RepresentativeMIN_SALARY4000MAX_SALARY9000
IHR.JOBS2016-09-05 14:05:35.8378192016-09-05T22:56:24.37301600000000-10000068754JOB_IDHR_REPJOB_TITLEHuman Resources RepresentativeMIN_SALARY4000MAX_SALARY9000
IHR.JOBS2016-09-05 14:05:35.8378192016-09-05T22:56:24.37400000000000-10000068886JOB_IDPR_REPJOB_TITLEPublic Relations RepresentativeMIN_SALARY4500MAX_SALARY10500
$
启动源端抽取进程:
GGSCI (node1) 1> info all
Program Status Group Lag at ChkptTime Since Chkpt
MANAGER RUNNING
EXTRACT STOPPED EOLTP01 00:00:00 228:21:34
EXTRACT STOPPED EPUMP01 00:00:00 228:21:34
EXTRACT STOPPED GGEXT 00:00:09 00:49:30
GGSCI (node1) 2> view params ggext
extract ggext
userid ogg, password ogg
--RMTHOSTOPTIONS
RMTHOST Master, MGRPORT 7809
RMTFILE /ogg/dirdat/or, MEGABYTES 2, PURGE
DDL include objname HR.*
TABLE HR.*;
GGSCI (sandbox.localdomain as ogg@orcl) 6>start extract ggext
GGSCI (node1) 5> info all
Program Status Group Lag at ChkptTime Since Chkpt
MANAGER RUNNING
EXTRACT STOPPED EOLTP01 00:00:00 228:54:53
EXTRACT STOPPED EPUMP01 00:00:00 228:54:52
EXTRACT RUNNING GGEXT 00:00:09 01:22:49 ######
Sending START request to MANAGER ...
EXTRACT GGEXT starting
So, moving back to target and preparing our replicat. I used the replicat with the following parameters:
目标端添加复制进程:
GGSCI (Master) 5> view params rhdfs
REPLICAT rhdfs
-- Trail file for this example is located in "dirdat/or" directory
-- Command to add REPLICAT
-- add replicat rhdfs, exttrail dirdat/or
--DDLERROR default discard
setenv HADOOP_COMMON_LIB_NATIVE_DIR=/home/hadoop/hadoop/lib/native
DDL include all
TARGETDB LIBFILE libggjava.so SET property=dirprm/hdfs.props
REPORTCOUNT EVERY 1 MINUTES, RATE
GROUPTRANSOPS 10000
MAP HR.*, TARGET HR.*;
GGSCI (Master) 6> add replicat rhdfs, exttrail dirdat/or
REPLICAT added.
$ more dirprm/rhdfs.props
dirprm/rhdfs.props: No such file or directory
$ more dirprm/hdfs.props
gg.handlerlist=hdfs
gg.handler.hdfs.type=hdfs
gg.handler.hdfs.includeTokens=false
gg.handler.hdfs.maxFileSize=1g
gg.handler.hdfs.rootFilePath=/user/oracle/gg
gg.handler.hdfs.fileRollInterval=0
gg.handler.hdfs.inactivityRollInterval=0
gg.handler.hdfs.fileSuffix=.txt
gg.handler.hdfs.partitionByTable=true
gg.handler.hdfs.rollOnMetadataChange=true
gg.handler.hdfs.authType=none
gg.handler.hdfs.format=delimitedtext
gg.handler.hdfs.format.includeColumnNames=true
gg.handler.hdfs.mode=tx
goldengate.userexit.timestamp=utc
goldengate.userexit.writers=javawriter
javawriter.stats.display=TRUE
javawriter.stats.full=TRUE
gg.log=log4j
gg.log.level=INFO
gg.report.time=30sec
#gg.classpath=/var/lib/hadoop/share/hadoop/common/*:/var/lib/hadoop/share/hadoop/common/lib/*:/var/lib/hadoop/share/hadoop/hdfs/*:/var/lib/hadoop/etc/hadoop/:
gg.classpath=/usr/java/jdk1.8.0_101/jre/lib/amd64/libjsig.so:/usr/java/jdk1.8.0_101/jre/lib/amd64/server/libjvm.so:/usr/java/jdk1.8.0_101/jre/lib/amd64/server:/usr/java/jdk1.8.0_101/jre/lib/amd64:/home/hadoop/h
adoop/share/hadoop/common/*:/home/hadoop/hadoop/share/hadoop/common/lib/*:/home/hadoop/hadoop/share/hadoop/hdfs/*:/home/hadoop/hadoop/etc/hadoop/:
javawriter.bootoptions=-Xmx512m -Xms32m -Djava.class.path=ggjava/ggjava.jar
$ echo $LD_LIBRARY_PATH
/usr/java/jdk1.8.0_101/jre/lib/amd64/libjsig.so:/usr/java/jdk1.8.0_101/jre/lib/amd64/server/libjvm.so:/usr/java/jdk1.8.0_101/jre/lib/amd64/server:/usr/java/jdk1.8.0_101/jre/lib/amd64:
$ ./ggsci
Oracle GoldenGate Command Interpreter
Version 12.2.0.1.0 OGGCORE_12.2.0.1.0_PLATFORMS_151101.1925.2
Linux, x64, 64bit (optimized), Generic on Nov 10 2015 16:18:12
Operating system character set identified as UTF-8.
Copyright (C) 1995, 2015, Oracle and/or its affiliates. All rights reserved.
GGSCI (Master) 1> info all
Program Status Group Lag at ChkptTime Since Chkpt
MANAGER STOPPED
REPLICAT STOPPED RHDFS 00:00:00 00:08:22
GGSCI (Master) 2> start mgr
Manager started.
GGSCI (Master) 3> info all
Program Status Group Lag at ChkptTime Since Chkpt
MANAGER RUNNING
REPLICAT STOPPED RHDFS 00:00:00 00:08:34
GGSCI (Master) 4> start rhdfs
Sending START request to MANAGER ...
REPLICAT RHDFS starting
GGSCI (Master) 5> info all
Program Status Group Lag at ChkptTime Since Chkpt
MANAGER RUNNING
REPLICAT STARTING RHDFS 00:00:00 00:08:38
未整理,问题处理部分暂时未贴
页:
[1]