0%

(译)JVM Concurrent Mark Sweep (CMS) Collector 1.8

最近线上JDK升级到啦1.8,应用在发布时,总会发生Full GC报警,看了下GC日志,发现应用重启时会接连发生4次Full GC,但是这4次GC后很久一段时间(第二天再次查看GC日志和jstat的统计)没有再发生Full GC,查了下官方资料,顺便翻译出来,方便以后阅读。翻译的不好,请见谅,也欢迎提出建议,在下将不胜感激。

原文链接:https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gctuning/cms.html

The Concurrent Mark Sweep (CMS) collector is designed for applications that prefer shorter garbage collection pauses and that can afford to share processor resources with the garbage collector while the application is running. Typically applications that have a relatively large set of long-lived data (a large tenured generation) and run on machines with two or more processors tend to benefit from the use of this collector. However, this collector should be considered for any application with a low pause time requirement. The CMS collector is enabled with the command-line option -XX:+UseConcMarkSweepGC.

CMS收集器特点:1.更短的垃圾回收暂停(stop the world) 2.垃圾收集器能够在应用运行时与其共享处理器资源。通常,一个运行在多处理器机器上并且其中有长期存活的大集合(较大的老生代)的应用可以考虑使用CMS收集器。当然,对于具有低暂停时间要求的应用,都可以考虑使用该收集器。使用-XX:+UseConcMarkSweepGC来启用CMS收集器。

Similar to the other available collectors, the CMS collector is generational; thus both minor and major collections occur. The CMS collector attempts to reduce pause times due to major collections by using separate garbage collector threads to trace the reachable objects concurrently with the execution of the application threads. During each major collection cycle, the CMS collector pauses all the application threads for a brief period at the beginning of the collection and again toward the middle of the collection. The second pause tends to be the longer of the two pauses. Multiple threads are used to do the collection work during both pauses. The remainder of the collection (including most of the tracing of live objects and sweeping of unreachable objects is done with one or more garbage collector threads that run concurrently with the application. Minor collections can interleave with an ongoing major cycle, and are done in a manner similar to the parallel collector (in particular, the application threads are stopped during minor collections).

和其他收集器一样,CMS是分代的,因此,新生代和老年代都会发生回收。CMS尝试通过多线程并发的方式来跟踪对象的可达性,以便减少老生代的收集时间。在老年代垃圾回收期间,CMS会发生两次STW:1.收集开始首次标记时(initial-mark)2.重新标记时(remark)。第二次暂停的时间往往比第一次要长。这两次标记均使用多线程。其余的收集(包括大多存活对象跟踪和不可达对象的清除)使用一个或多个线程和应用并发执行。新生代的回收可以和正在执行老年年代回收交错执行,类似于并行收集器(在新生代回收期间应用程序线程被停止STW)。

Concurrent Mode Failure

The CMS collector uses one or more garbage collector threads that run simultaneously with the application threads with the goal of completing the collection of the tenured generation before it becomes full. As described previously, in normal operation, the CMS collector does most of its tracing and sweeping work with the application threads still running, so only brief pauses are seen by the application threads. However, if the CMS collector is unable to finish reclaiming the unreachable objects before the tenured generation fills up, or if an allocation cannot be satisfied with the available free space blocks in the tenured generation, then the application is paused and the collection is completed with all the application threads stopped. The inability to complete a collection concurrently is referred to as concurrent mode failure and indicates the need to adjust the CMS collector parameters. If a concurrent collection is interrupted by an explicit garbage collection (System.gc()) or for a garbage collection needed to provide information for diagnostic tools, then a concurrent mode interruption is reported.

CMS使用一个或多个线程和应用线程并发进行,目标是在老生代被消耗完之前完成垃圾回收。如前所述,在正常操作中,CMS收集器执行大部分追踪和清理工作时,应用程序线程仍在运行,因此应用线程只会短暂暂停。但是,如果在垃圾回收完成前,老生代被耗尽,或者老生代无法分配足够的空间,此时会暂停所有的应用线程(导致了STW)直到垃圾回收完成。没有在并发期间完成垃圾回收工作称为concurrent mode failure,这个失败表明我们需要调整CMS收集器的参数。如果并发回收被显示调用(System.gc()) 或者为了给诊断工具提供信息而发生中断,则会报告并发模式中断。

Excessive GC Time and OutOfMemoryError

Excessive GC Time and OutOfMemoryError
The CMS collector throws an OutOfMemoryError if too much time is being spent in garbage collection: if more than 98% of the total time is spent in garbage collection and less than 2% of the heap is recovered, then an OutOfMemoryError is thrown. This feature is designed to prevent applications from running for an extended period of time while making little or no progress because the heap is too small. If necessary, this feature can be disabled by adding the option -XX:-UseGCOverheadLimit to the command line.

The policy is the same as that in the parallel collector, except that time spent performing concurrent collections is not counted toward the 98% time limit. In other words, only collections performed while the application is stopped count toward excessive GC time. Such collections are typically due to a concurrent mode failure or an explicit collection request (for example, a call to System.gc).

当GC时间太久(垃圾收集花费时间,超过了总时间的98%,但是回收的堆少于2%,总时间究竟是相对哪个时间而言?在下还未找到解释。),CMS会抛出OutOfMemoryError。这个特性旨在避免由于堆空间过小而导致应用程序业务处理缓慢或者无进展。这个特性可以用-XX:-UseGCOverheadLimit禁用。

这个策略和并行收集器一致,只是并发回收垃圾的时间不计入98%的时间消耗中。换句话讲,只有发生了STW情况下的收集时间(两次标记的时间)会计入这98%的时间限制内。这种长时间回收通常是由于concurrent mode failure或者回收请求(比如:显示调用System.gc)导致的。

Floating Garbage

The CMS collector, like all the other collectors in Java HotSpot VM, is a tracing collector that identifies at least all the reachable objects in the heap. In the parlance of Richard Jones and Rafael D. Lins in their publication Garbage Collection: Algorithms for Automated Dynamic Memory, it is an incremental update collector. Because application threads and the garbage collector thread run concurrently during a major collection, objects that are traced by the garbage collector thread may subsequently become unreachable by the time collection process ends. Such unreachable objects that have not yet been reclaimed are referred to as floating garbage. The amount of floating garbage depends on the duration of the concurrent collection cycle and on the frequency of reference updates, also known as mutations, by the application. Furthermore, because the young generation and the tenured generation are collected independently, each acts a source of roots to the other. As a rough guideline, try increasing the size of the tenured generation by 20% to account for the floating garbage. Floating garbage in the heap at the end of one concurrent collection cycle is collected during the next collection cycle.

CMS和其他Java HotSpot VM中回收器一样,是一个会标记堆中所有可达对象的追踪回收器。在《Algorithms for Automated Dynamic Memory》中CMS被称作增量收集器。因为应用线程和垃圾回收线程在老年代回收时是并发运行的,所以,那些被垃圾回收线程追踪的对象在回收结束时可能会变得不可达。像这些尚未被回收并且不可达的引用被称为浮动垃圾。浮动垃圾的数量取决于并发收集周期的持续时间以及应用程序的引用更新频率(也称为突变)。此外,因为新生代和老年代的垃圾回收是相互独立的,所以,根据经验建议将老年代增加20%的空间来承载浮动垃圾。一个并发回收周期节后产生的浮动垃圾会在下个垃圾回收中回收。

Pauses

The CMS collector pauses an application twice during a concurrent collection cycle. The first pause is to mark as live the objects directly reachable from the roots (for example, object references from application thread stacks and registers, static objects and so on) and from elsewhere in the heap (for example, the young generation). This first pause is referred to as the initial mark pause. The second pause comes at the end of the concurrent tracing phase and finds objects that were missed by the concurrent tracing due to updates by the application threads of references in an object after the CMS collector had finished tracing that object. This second pause is referred to as the remark pause.

CMS在一次回收的过程中会两次暂停(STW)应用。第一次是标记从GC root(比如:被应用线程栈引用的对象和registers,静态的对象等等)可以直接到达的对象和来自堆的其他地方(比如:新生代)的对象。第一次暂停叫做初始标记暂停(initial mark pause)。第二次暂停发生在并发追踪阶段结束后,这次暂停是为了标记那些在并发追踪期间因为被应用线程更新引用而错过的对象。第二次暂停叫做重新标记暂停(remark pause)。

Concurrent Phases

The concurrent tracing of the reachable object graph occurs between the initial mark pause and the remark pause. During this concurrent tracing phase one or more concurrent garbage collector threads may be using processor resources that would otherwise have been available to the application. As a result, compute-bound applications may see a commensurate fall in application throughput during this and other concurrent phases even though the application threads are not paused. After the remark pause, a concurrent sweeping phase collects the objects identified as unreachable. Once a collection cycle completes, the CMS collector waits, consuming almost no computational resources, until the start of the next major collection cycle.

并发追踪可达对象发生在初始标记(initial mark)和重新标记(remark)之间。在并发标记阶段会有一个或者多个垃圾回收线程在使用处理器资源,否则处理器资源对应用而言是可用的。因此,在垃圾回收期间计算密集型应用吞吐量会下降,即使应用线程没有被暂停。在重新标记阶段后,会并发清理那些被被标记为不可达的对象。垃圾回收完成后,CMS进入等待,几乎不消耗任何计算资源,直到下一个老年代回收开始。

Starting a Concurrent Collection Cycle

With the serial collector a major collection occurs whenever the tenured generation becomes full and all application threads are stopped while the collection is done. In contrast, the start of a concurrent collection must be timed such that the collection can finish before the tenured generation becomes full; otherwise, the application would observe longer pauses due to concurrent mode failure. There are several ways to start a concurrent collection.

Based on recent history, the CMS collector maintains estimates of the time remaining before the tenured generation will be exhausted and of the time needed for a concurrent collection cycle. Using these dynamic estimates, a concurrent collection cycle is started with the aim of completing the collection cycle before the tenured generation is exhausted. These estimates are padded for safety, because concurrent mode failure can be very costly.

A concurrent collection also starts if the occupancy of the tenured generation exceeds an initiating occupancy (a percentage of the tenured generation). The default value for this initiating occupancy threshold is approximately 92%, but the value is subject to change from release to release. This value can be manually adjusted using the command-line option -XX:CMSInitiatingOccupancyFraction=, where is an integral percentage (0 to 100) of the tenured generation size.

老年代串行垃圾收集器在老年代即将耗尽时开始工作,此时所有应用线程都会暂停直到垃圾回收完成。与之相反,CMS从开始并发回收时进行计时,以使回收在老年代耗尽之前完成,否则会因为concurrent mode failure使得应用暂停更长时间。有几种方式可以开始一个并发回收。

基于最近的回收历史,CMS会维护两个预估时间:老年代多久会被耗尽;完成一次老年代的回收需要多久。依据这些动态估计值,在老年代耗尽前适时开始老年代的回收。这些预估值是为了避免耗时的concurrent mode failure

如果老年代占用的空间超过了启动设置的大小(一个老生代的占用比)。默认的启动设置的阀值是92%,但是这个值会随着版本发布改变。这个值可以通过-XX:CMSInitiatingOccupancyFraction=配置,N是占用老生代大小的百分比(0-100)。

Scheduling Pauses

The pauses for the young generation collection and the tenured generation collection occur independently. They do not overlap, but may occur in quick succession such that the pause from one collection, immediately followed by one from the other collection, can appear to be a single, longer pause. To avoid this, the CMS collector attempts to schedule the remark pause roughly midway between the previous and next young generation pauses. This scheduling is currently not done for the initial mark pause, which is usually much shorter than the remark pause.

老年代和新生代垃圾回收引起的暂停是独立的。它们不会重叠,但是可能会快速连续发生,比如:一个回收暂停结束后另一个回收又立刻导致暂停,看起来就像单独一个很长的暂停。为了避免这种情况,CMS尝试规划将remark暂停放在两次新生代回收(Young GC)之间。这个规划尚未在初始标记暂停(initial mark pause)使用,因为初始标记暂停(initial mark pause)通常比remark pause要短。

Incremental Mode

Note that the incremental mode is being deprecated in Java SE 8 and may be removed in a future major release.

请注意,增量模式在Java SE 8中已被弃用,并可能在将来的主要版本中被删除,所以不在翻译和i-cms有关部分。

Measurements

Example 8-1, “Output from the CMS Collector” is the output from the CMS collector with the options -verbose:gc and -XX:+PrintGCDetails, with a few minor details removed. Note that the output for the CMS collector is interspersed with the output from the minor collections; typically many minor collections occur during a concurrent collection cycle. CMS-initial-mark indicates the start of the concurrent collection cycle, CMS-concurrent-mark indicates the end of the concurrent marking phase, and CMS-concurrent-sweep marks the end of the concurrent sweeping phase. Not discussed previously is the precleaning phase indicated by CMS-concurrent-preclean. Precleaning represents work that can be done concurrently in preparation for the remark phase CMS-remark. The final phase is indicated by CMS-concurrent-reset and is in preparation for the next concurrent collection.

Example 8-1 Output from the CMS Collector

[GC [1 CMS-initial-mark: 13991K(20288K)] 14103K(22400K), 0.0023781 secs]
[GC [DefNew: 2112K->64K(2112K), 0.0837052 secs] 16103K->15476K(22400K), 0.0838519 secs]
...
[GC [DefNew: 2077K->63K(2112K), 0.0126205 secs] 17552K->15855K(22400K), 0.0127482 secs]
[CMS-concurrent-mark: 0.267/0.374 secs]
[GC [DefNew: 2111K->64K(2112K), 0.0190851 secs] 17903K->16154K(22400K), 0.0191903 secs]
[CMS-concurrent-preclean: 0.044/0.064 secs]
[GC [1 CMS-remark: 16090K(20288K)] 17242K(22400K), 0.0210460 secs]
[GC [DefNew: 2112K->63K(2112K), 0.0716116 secs] 18177K->17382K(22400K), 0.0718204 secs]
[GC [DefNew: 2111K->63K(2112K), 0.0830392 secs] 19363K->18757K(22400K), 0.0832943 secs]
...
[GC [DefNew: 2111K->0K(2112K), 0.0035190 secs] 17527K->15479K(22400K), 0.0036052 secs]
[CMS-concurrent-sweep: 0.291/0.662 secs]
[GC [DefNew: 2048K->0K(2112K), 0.0013347 secs] 17527K->15479K(27912K), 0.0014231 secs]
[CMS-concurrent-reset: 0.016/0.016 secs]
[GC [DefNew: 2048K->1K(2112K), 0.0013936 secs] 17527K->15479K(27912K), 0.0014814 secs]

The initial mark pause is typically short relative to the minor collection pause time. The concurrent phases (concurrent mark, concurrent preclean and concurrent sweep) normally last significantly longer than a minor collection pause, as indicated by Example 8-1, “Output from the CMS Collector”. Note, however, that the application is not paused during these concurrent phases. The remark pause is often comparable in length to a minor collection. The remark pause is affected by certain application characteristics (for example, a high rate of object modification can increase this pause) and the time since the last minor collection (for example, more objects in the young generation may increase this pause).

例子8-1是打开-verbose:gc和-XX:+PrintGCDetails选项后的gc打印日志,例子中略去了一些新生代的收集细节。注意,CMS收集器打印的日志和新生代回收打印的日志是穿插在一起的,通常在一个老年代回收周期中会发生多次新生代回收。CMS-initial-mark表明一个并发收集周期的开始,CMS-concurrent-mark表明并发标记阶段已经完成,CMS-concurrent-sweep说明并发清除阶段已经结束。CMS-concurrent-preclean是预清理阶段。预清理工作并发进行,是CMS-remark阶段的前置操作。CMS-concurrent-reset是老年代垃圾回收的最后一个阶段也是为下次回收做预备工作。

初始标记阶段耗时通常比年轻代暂停时间短。从例8-1我们不难发现,并发阶段(并发标记,并发预清理和并发清除)通常耗时比新生代回收时间长。但是,应用在这些阶段中并没有暂停。重新标记耗时与新生代回收耗时差不多。重新标记的耗时会受某些应用特性(比如:高对象修改率会增加此阶段耗时)和距离上次老年代收集时间(比如:新生代有了更多的对象会增加此阶段耗时)的影响。