赞
踩
CHECKPOINT TUNING AND ERROR HANDLING
检查点优化和错误处理
1.What is a Checkpoint?
什么是检查点?
A Checkpoint is a database event which synchronizes the modified
data blocks in memory with the datafiles on
disk. 一个检查点是使在内存中被修改的数据和磁盘上的数据文件进行同步的一个数据库事件。
It offers Oracle the means for ensuring the consistency of data
modified by transactions. 它提供ORACLE作为确报被事务修改的数据的一致性的用途。
The mechanism of writing modified blocks on disk in Oracle is
not synchronized with the commit of the corresponding
transactions.
原理就是在ORACLE中写被修改的块到磁盘中不是和对应事务提交同步的。
A checkpoint has two purposes: (1) to establish data
consistency, and (2) enable faster database
recovery. 一个检查点有两个目的:(1)确保数据一致性,和(2)更快的数据库恢复。
How is recovery faster?
怎么使恢复更快?
Because all database changes up to the checkpoint have been
recorded in the datafiles, making it unnecessary to apply redo log
entries prior to the checkpoint.
原因是所有的数据库变化到检查点已经被记录在数据文件中,使它没有必要再应用检查点之前的日志。
The checkpoint must ensure that all the modified buffers in the
cache are really written to the corresponding datafiles to avoid
the loss of data which may occur with a crash (instance or disk
failure).
检查点必须保证内存中所有被修改的缓存是真正的被写到了对应的数据文件中从而避免因崩溃(实例或者磁盘故障)导致数据丢失。
Oracle writes the dirty buffers to disk only on certain
conditions:
Oracle 只有在某些条件下才把脏缓存写到磁盘:
- A shadow process must scan more than
one-quarter of the db_block_buffer parameter.
后台进程扫描超过1/4
db_block_buffer参数(有点变扭,自己理解为当寻找数据块在链表上超过1/4)
注:database buffer
cache的大小8i以前是db_block_buffers*db_block_size的值,从9i开始就只需要设置db_cache_size
- Every three seconds.
每三秒钟
- When a checkpoint is produced.
当检查点产生
A checkpoint is realized o的n five types of events:
一个检查点在5种事件情况下产生:
- At each switch of the redo log files.
日志文件的每次切换
- When the delay for LOG_CHECKPOINT_TIMEOUT is
reached.
当LOG_CHECKPOINT_TIMEOUT 参数达到
- When the size in bytes corresponding to
:
(LOG_CHECKPOINT_INTERVAL* size of IO OS blocks)
is written on the current redo log file.
当(LOG_CHECKPOINT_INTERVAL * size of OS blocks)字节被写到当前日志文件
- Directly by the ALTER SYSTEM SWITCH LOGFILE
command.
直接使用ALTER
SYSTEM SWITCH LOGFILE 命令
- Directly with the ALTER SYSTEM CHECKPOINT
command.
直接使用ALTER
SYSTEM CHECKPOINT 命令
During a checkpoint the following occurs:
当一个检查点发生:
- The database writer (DBWR)
writes all modified database blocks in the buffer cache back to
datafiles,
DBWR进程把内存中所有修改过的数据块写到数据文件中
- Log writer (LGWR) updates
both the controlfile and the datafiles to indicate when the last
checkpoint occurred (SCN)
LGWR更新控制文件和数据文件的SCN以表明上次检查点发生
2. Checkpoints and Performance
检查点和性能
Checkpoints present a tuning dilemma for the Database
Administrator. 检查点使DBA处于优化进退两难的地步。
Frequent checkpoints will enable faster recovery, but can cause
performance degradation.
频繁的检查点可以使恢复更快,但会导致性能下降。
How then should the DBA address this?
然后DBA怎么处理处理这个?
Depending on the number of datafiles in a database, a checkpoint
can be a highly resource intensive operation,
since all datafile headers are frozen during the
checkpoint. 依赖于数据库中的数据文件的数量,一个检查点可以被认为是高度资源集中的操作,因为在检查点发生时所有数据文件头部都被冻结。
There is a performance trade-off regarding frequency of
checkpoints. 关于频繁的检查点是一个性能权限。
More frequent checkpoints enable faster database recovery after
a crash. 更频繁的检查点使数据库在崩溃后很快恢复。
This is why some customer sites which have a very low tolerance
for unscheduled system downtime will often choose this
option.
这也是一些对于事先未安排的系统宕机有很差容度的服务站点会选择这种操作的原因。
However, the performance degradation of frequent checkpoints may
not justify this philosophy in many cases.
但是在很多情况下,频繁的检查点导致性能的下降
Let''s assume the database is up and running 95% of the time,
and unavailable 5% of the time from infrequent instance
crashes
or hardware failures requiring database
recovery. 让我们来假设数据库95%的时间在运行,和5%由于极少情况下的实例CRASH或者需要数据库恢复的硬件错误而不可用
For most customer sites, it makes more sense to tune for the 95%
case rather than the rare 5% downtime.
对于大多数服务站点,调正95%的情况比5%的停机时间更有意义。
This bulletin assumes that performance is your number one
priority and so recommendations are made accordingly.
无法翻译
Therefore, your goal is to minimize the frequency of checkpoints
through tuning.
因此,你的目标就是通过优化减少检查点的频率。
Tuning checkpoints involves four key initialization
parameters:
优化检查点包括四个核心初始化参数:
- FAST_START_MTTR_TARGET
- LOG_CHECKPOINT_INTERVAL
- LOG_CHECKPOINT_TIMEOUT
- LOG_CHECKPOINTS_TO_ALERT
These parameters are discussed in detail below.
这些参数在下面将详细讨论.
Recommendations are also given for handling "checkpoint not
complete" messages found in the alert log,
which indicate a need to tune redo logs and checkpoints.
处理 在日志文件中表明需要调整日志和检查点的 "checkpoint not complete" 信息的方法也会被建议。
3. Parameters related to incremental checkpointing
和增量检查点相关的参数
Note: Log file switches will always override checkpoints caused by
following parameters.
注意:日志文件切换经常会覆盖有下面参数引起的检查点
3.1 FAST_START_MTTR_TARGET
Since Oracle 9i FAST_START_MTTR_TARGET parameter is the preferred
method of tuning incremental checkpoint target.
从9I以来,参数FAST_START_MTTR_TARGET 是作为优化增加检查点的首选方法。
FAST_START_MTTR_TARGET enables you to specify the number of
seconds the database takes to perform crash recovery of a single
instance.
FAST_START_MTTR_TARGET 使你指定单实例数据库CRASH恢复的秒数。
Based on internal statistics, incremental checkpoint
automatically adjusts the checkpoint target to meet the requirement
of FAST_START_MTTR_TARGET.
根据内部统计,增量检查点自动调整检查点目标从而满足FAST_START_MTTR_TARGET的需要。
V$INSTANCE_RECOVERY.ESTIMATED_MTTR shows the current estimated
mean time to recover (MTTR) in seconds.
V$INSTANCE_RECOVERY.ESTIMATED_MTTR 表明当前MTTR(平均恢复时间)的估算秒数。
This value is shown even if FAST_START_MTTR_TARGET is not
specified.
及时FAST_START_MTTR_TARGET没有被指定,这个值也会被显示。
V$INSTANCE_RECOVERY.TARGET_MTTR shows the effective MTTR target
in seconds enforced by the system.
V$INSTANCE_RECOVERY.TARGET_MTTR 表示被系统执行有效的MTTR目标(秒)
V$MTTR_TARGET_ADVICE shows the number of I/Os resulted by the
current workload under the current MTTR setting
and the estimated number of I/Os that would be resulted by the
current workload under other MTTR settings.
V$MTTR_TARGET_ADVICE
表示在当前MTTR设置下的当前系统负载所产生的I/O数量和在其他MTTR设置下当前负载的IO估算量。
This view helps the user to assess the trade-off between runtime
performance and setting FAST_START_MTTR_TARGET to achieve better
recovery time.
这个视图帮助用户评估运行时候的性能和设置FAST_START_MTTR_TARGET达到更好恢复时间的平衡。
3.2 LOG_CHECKPOINT_INTERVAL
LOG_CHECKPOINT_INTERVAL parameter specifies the maximum number of
redo blocks the incremental checkpoint target should lag the
current log tail.
理解,但无法用语言来表达^_^
If FAST_START_MTTR_TARGET is specified, LOG_CHECKPOINT_INTERVAL
should not be set or set to 0.
如果FAST_START_MTTR_TARGET被指定了,LOG_CHECKPOINT_INTERVAL不应该设置或者设置为0.
On most Unix systems the operating system block size is 512
bytes.
在去多UNIX操作系统上块大小为512字节。
This means that setting LOG_CHECKPOINT_INTERVAL to a value of
10,000 would mean the incremental checkpoint
target should not lag the current log tail by more than 5,120,000
(5M) bytes.
这意思是设置LOG_CHECKPOINT_INTERVAL为10,000 增量检查点不应该落后于当期日志末尾5M
(注意对当期日志末尾的理解,不是日志文件的末尾,是当前在日志文件LGWR写的位置)
If the size of your redo log is 20M, you are taking 4
checkpoints for each log.
如果日志文件的大小为20M,你将在每个日志文件发生4次检查点。
LOG_CHECKPOINT_INTERVAL influences when a checkpoint occurs,
which means careful attention should be given to the
setting of this parameter,keeping it updated as the size of the
redo log files is changed. 当LOG_CHECKPOINT_INTERVAL影响一个检查点的发生,应该更加小心当日志文件的大小被改变,必须更新这个参数的设置
The checkpoint frequency is one of the factors which impacts the
time required for the database to recover from an unexpected
failure. 检查点的频率是数据库从意外故障恢复所需时间的重要一个因素
Longer intervals between checkpoints mean that if the system
crashes, more time will be needed for the database to
recover. 检查点之间较长的间隔 意味着 如果系统CRASH,将花费更多的时间来恢复。
Shorter checkpoint intervals mean that the database will recover
more quickly,
at the expense of increased resource utilization during the
checkpoint operation.
较短的检查点间隔意味着数据库会更快恢复,以在检查点操作更多资源的代价
This parameter also impacts the time required to complete a
database recovery operation during the roll forward phase of
recovery. 这个参数也反映在完成实例前滚恢复操作所需的时间
The actual recovery time is dependent upon this time, and other
factors, such as the type of failure
(instance or system crash, media failure, etc.), and the number of
archived redo logs which need to be applied.
实际恢复的时间依赖于这个时间,和其他因素,比如故障的类型(实例或者系统CRASH,介质故障,等等),还有有多少归档文件需要被恢复的数目。
3.3 LOG_CHECKPOINT_TIMEOUT
The LOG_CHECKPOINT_TIMEOUT parameter specifies the maximum number
of seconds the incremental checkpoint target should lag the current
log tail.
LOG_CHECKPOINT_TIMEOUT 参数指定最大的时间 增量检查点落后于当前日志末尾
In another word, it specifies how long a dirty buffer in buffer
cache can remain dirty.
换句话说,它指定在缓存中赃块能保证多长时间。
Oracle recommends using LOG_CHECKPOINT_INTERVAL to control the
checkpoint interval rather than LOG_CHECKPOINT_TIMEOUT,
which will initiate a checkpoint every "n" seconds, regardless of
the transaction frequency. ORACLE
建议使用LOG_CHECKPOINT_INTERVAL控制检查点的间隔而不是用会每隔N秒发起一个检查点,而不管事务的频率LOG_CHECKPOINT_TIMEOUT。
This can cause unnecessary checkpoints in cases where
transaction volumes vary.
倘若事务 这可能导致不必要的检查点。
Unnecessary checkpoints must be avoided whenever possible for
optimal performance.
为了最优性能应该避免不必要的检查点。
It is a misconception that setting LOG_CHECKPOINT_TIMEOUT to a
given value will
initiate a log switch at that interval, enabling a recovery window
used for a stand-by database configuration. 这是个错误的想法通过设置LOG_CHECKPOINT_TIMEOUT参数值来发起周期的日志切换,为STANDBY数据库能够有恢复窗口。(翻译的有点烂)
Log switches cause a checkpoint,but a checkpoint does not cause a
log switch. 日志切换导致一个检查点,但检查点不会导致日志切换。
The only way to cause a log switch is manually with ALTER SYSTEM
SWITCH LOGFILE
or resizing the redo logs to cause more frequent
switches. 唯一的使日志切换的办法就是手工执行ALTER SYSTEM SWITCH LOGFILE 或者
改变日志文件的大小从而触发更频繁的切换。
This is controlled by operating system blocks, not a timed
interval.
这是由操作系统块控制而不是时间间隔。
Sizing of the online redo logs is critical for performance and
recovery.
日志文件的大小对于性能和恢复很重要。
See additional sections below on redo logs and
checkpoints.
下面有关于日志文件和检查点额外的部分
3.4 LOG_CHECKPOINTS_TO_ALERT
LOG_CHECKPOINTS_TO_ALERT lets you log your checkpoints to the alert
file.
LOG_CHECKPOINTS_TO_ALERT 让你在日志文件中记录你的检查点。
Doing so is useful for determining whether checkpoints are
occurring at the desired frequency.
这样做对于决定检查点在否在周期的发生很有用。
Prior to Oracle9i this parameter was STATIC.
在9I之前这个参数是静态的。
Oracle generally advises this be set to TRUE as the overhead is
negligible but the information in the alert log may be
useful.
ORALCE 通常建议设置为TRUE 因为代价几乎微不足道,而已这个在日志中的信息很有用
See Note 76713.1 to have more detail on How those instance
parameters can influence the checkpoint.
在Note 76713.1 有关于这些实例参数如何影响检查点有更详细的解释
4. Redo logs and Checkpoint
日志文件和检查点
A checkpoint occurs at every log switch. 在每次日志切换都发生检查点。
If a previous checkpoint is already in progress, the checkpoint
forced by the log switch will override the current
checkpoint.
如果先前的检查点已经在进行中,由日志切换引起的检查点会覆盖当期的检查点。
This necessitates well-sized redo logs to avoid unnecessary
checkpoints as a result of frequent log switches.
这个需求精心设计日志文件大小从而不会由于频繁的日志切换避免不必要的检查点。
The lag between the incremental checkpoint target and the log tail
is also limited by 90% of the smallest online log file size.
增量检查点和日志末尾之间的间隔被最小在线日志文件大小的90%所限制。
This makes sure that in most cases log switch would not need to
wait for checkpoint.
这样保证在大多数情况下日志切换没必要等检查点。
Because of this, log file sizes should be configured large
enough.
因为这个,日志文件的大小应该被设置足够大。
A good rule of thumb is to switch logs at most every twenty
minutes.
最好的经验法则就是至多20分钟切换日志。
Having your log files too small can increase checkpoint activity
and reduce performance.
你的日志文件太小会增加检查点的活跃和降低性能。
Oracle recommends the user to set all online log files to be the
same size, and have at least two log groups per
thread. Oracle 建议用户设置所有在线日志相同的大小和每个线程至少两个日志组。
The alert log is a valuabletool for monitoring the rate that log
switches occur, and subsequently, checkpoints occur.
日志文件作为监控日志切换速率和检查点发生的可视化工具 The following is an example of quick log switches from the alert
log:
下面就是日志文件中快速切换的例子
Fri May 16 17:15:43 1997
Thread 1 advanced to log sequence 1272
Current log# 3 seq# 1272 mem# 0:
/prod1/oradata/logs/redologs03.log
Thread 1 advanced to log sequence 1273
Current log# 1 seq# 1273 mem# 0:
/prod1/oradata/logs/redologs01.log
Fri May 16 17:17:25 1997
Thread 1 advanced to log sequence 1274
Current log# 2 seq# 1274 mem# 0:
/prod1/oradata/logs/redologs02.log
Thread 1 advanced to log sequence 1275
Current log# 3 seq# 1275 mem# 0:
/prod1/oradata/logs/redologs03.log
Fri May 16 17:20:51 1997
Thread 1 advanced to log sequence 1276
Current log# 1 seq# 1276 mem# 0:
/prod1/oradata/logs/redologs01.log
If redo logs switch every 3 minutes, you will see performance
degradation.
如果日志每三分钟切换,你将看到性能下降。
This indicates the redo logs are not sized large enough to
efficiently handle
the transaction load. size of the redolog files.
这表明日志文件设置的不够大来有效率的处理事务压力。改变日志文件的大小。
5. Understanding Checkpoint Error messages (“Cannot allocate new log” and “Checkpoint not complete”)
理解检查点错误信息("Cannot allocate new log"和"Checkpoint not complete")
Sometimes, you can see in your alert.log file, the following
corresponding messages:
有时候,你可以在日志文件里看到类型下面的信息
Thread 1 advanced to log sequence 248
Current log#
2 seq# 248 mem# 0: /prod1/oradata/logs/redologs02.log
Thread 1 cannot allocate new log, sequence
249
Checkpoint not complete
This message indicates that Oracle wants to reuse a redo log
file, but
the current checkpoint position is still in that log.
这个信息表明ORACLE想重用一个日志文件,但当前的检查点还是在那个日志中。
In this case, Oracle must wait until the checkpoint position
passes that log.
这种情况下,Oracle必须等待直到检查点经过那个日志。
Because the incremental checkpoint target never lags the current
log tail by more than 90%
of the smallest log file size,
因为增量检查点从不延迟离当期的日志末尾超过最小日志文件的90%
this situation may be encountered if DBWR writes too slowly, or
if a log switch happens before the log is completely full,
or if log file sizes are too small.
这种情况可能在DBWR写比较慢的时候,或者日志在未完全满而发生切换 或者 日志文件太小
When the database waits on checkpoints,redo generation is
stopped until the log switch is done.
当数据库在检查点上等待,日志为停止生产直到日志切换完成。
6. Oracle Release Information
Oracle 版本信息
In Oracle8i initialization parameter FAST_START_IO_TARGET causes
incremental
checkpoint to automatically adjusting its target so that the number
of data
blocks needed by recovery would be no more than
FAST_START_IO_TARGET.
在8I初始化参数FAST_START_IO_TARGET使增量检查点子弟哦那个调整因此
恢复所需要的数据块数量不会超过FAST_START_IO_TARGET
This parameter has been deprecated since Oracle 9i in favor of
parameter FAST_START_MTTR_TARGET.
这个参数从9I开发被废弃,由FAST_START_MTTR_TARGET代替
7. Using Statspack to determine Checkpointing problems
用Statspack确定检查点问题
Statspack snapshots can be taken every 15 minutes or so, these
reports gather useful
information about number of checkpoints started and checkpoints
completed and number
of database buffers written during checkpointing for that window of
time .
Statspack 大概每15分钟被采集,这些报告收集有用的数据,关于检查点开始结束的数目和在这段时间
有多少数据缓存写入。
It also contains statistics about redo activity.
它也包括关于日志活跃的统计信息
Gathering and comparing these snapshot reports gives you a
complete idea about checkpointing performance at different periods
of time.
收集和比较这些snapshot报告可以给你一个完整的关于在不同时间内检查点性能的概念
Another important thing to watch in statspack report is the
following wait events, they could be a good indication
about problems with the redo log throughput and
checkpointing:
另外重要的事情就是在statspack中观察下面等待事件,它们可能是关于日志文件吞吐量和检查点很好的指示
log file switch (checkpoint incomplete)
log file switch (archiving needed)
log file switch/archive
log file switch (clearing log file)
log file switch completion
log switch/archive
log file sync
In the case when one or more of the above wait events is repeated
frequently
with considerable values then you need to take an action like
adding More
online redo log files or increasing their sizes and/or modifying
checkpointing parameters
既然如此,当一个或者多个上面等待事件频繁出现到值得考虑的值时,
你需要采取 如增加更多的日志文件或者增加日志文件大小和/或 修改检查点参数的行动。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。