Garbage Collection Optimization for High-Throughput and Low-
原文地址:https://engineering.linkedin.com/garbage-collection/garbage-collection-optimization-high-throughput-and-low-latency-java-applications High-performance applications form the backbone of the modern web. At LinkedIn,a number of internal high-throughput services cater to thousands of user requests per second. For optimal user experience,it is very important to serve these requests with low latency. For example,a product our members use regularly is the Feed - a constantly updating list of professional activities and content. Examples of various feeds across LinkedIn include those in company pages,school pages,and most importantly - the homepage feed. The underlying feed data platform that indexes these updates from various entities in our economic graph (members,companies,groups,etc.) has to serve relevant updates with high throughput and low latency. For taking these types of high-throughput,low-latency Java applications to production,developers have to ensure consistent performance at every stage of the application development cycle. Determining optimal Garbage Collection (GC) settings is critical to achieve these metrics. This blog post will walk through the steps to identify and optimize GC requirements,and is intended for a developer interested in a systematic method to tame GC to obtain high throughput and low latency. Insights in this post were gathered while building the next generation feed data platform at LinkedIn. These insights include,but are not limited to,CPU and memory overheads of??and??collectors,avoiding incessant GC cycles due to long-lived objects,performance changes obtained by optimizing task assignment to GC threads,and OS settings that are needed to make GC pauses predictable. When is the right time to optimize GC?GC behavior can vary with code-level optimizations and workloads. So it is important to tune GC on a codebase that is near completion and includes performance optimizations. But it is also necessary to perform preliminary analysis on an end-to-end basic prototype with stub computation code and synthetic workloads representative of production environments. This will help capture realistic bounds on latency and throughput based on the architecture and guide the decision on whether to scale up or scale out. During the prototyping phase of our next generation feed data platform,we implemented almost all of the end-to-end functionality and replayed query workload being served by the current production infrastructure. This gave us enough diversity in the workload characteristics to measure application performance and GC characteristics over a long enough period of time. Steps to optimize GCHere are some high level steps to optimize GC for high-throughput,low-latency requirements. Also included,are the details of how this was done for the feed data platform prototype. We saw the best GC performance with ParNew/CMS,though we also experimented with the G1 garbage collector. 1. Understand the basics of GCUnderstanding how GC works is important because of the large number of variables that need to be tuned. Oracle's??on Hotspot JVM Memory Management is an excellent starting point to become familiar with GC algorithms in Hotspot JVM. To understand the theoretical aspects of the G1 collector,check out this?. 2. Scope out GC requirementsThere are certain characteristics of GC that you should optimize,to reduce its overhead on application performance. Like throughput and latency,these GC characteristics should be observed over a long-running test to ensure that the application can handle variance in traffic while going through multiple GC cycles.
We used Hotspot??on Linux OS and started the experiment with 32 GB heap,6 GB young generation,and a? With the initial GC configuration,we incurred a young GC pause of 80 ms once every three seconds and 99.9th percentile application latency of 100 ms. The GC behavior that we started with is likely adequate for many applications with less stringent latency SLAs. However,our goal was to decrease the 99.9th percentile application latency as much as possible. GC optimization was essential to achieve this goal. 3. Understand GC metricsMeasurement is always a prerequisite to optimization. Understanding the??(with these options:? At LinkedIn,our internal monitoring and reporting systems,??and?,generate useful visualizations for metrics such as GC stall time percentiles,max duration of a stall,and GC frequency over long durations. In addition to Naarad,there are a number of open source tools like??that can create visualizations from GC logs. At this stage,it is possible to determine whether GC frequency and pause duration are impacting the application's ability to meet latency requirements. 4. Reduce GC frequencyIn generational GC algorithms,collection frequency for a generation can be decreased by (i) reducing the object allocation/promotion rate and (ii) increasing the size of the generation. In Hotspot JVM,the duration of young GC pause depends on the number of objects that survive a collection and not on the young generation size itself. The impact of increasing the young generation size on the application performance has to be carefully assessed:
For applications that mostly create short-lived objects,you will only need to control the aforementioned parameters. For applications that create long-lived objects,there is a caveat; the promoted objects may not be collected in an old generation GC cycle for a long time. If the threshold at which old generation GC is triggered (expressed as a percentage of the old generation that is filled) is low,the application can get stuck in incessant GC cycles. You can avoid this by triggering GC at a higher threshold. As our application maintains a large cache of long-lived objects in the heap,we increased the threshold of triggering old GC by setting:? 5. Reduce GC pause durationThe young GC pause duration can be reduced by decreasing the young generation size as it may lead to less data being copied in survivor spaces or promoted per collection. However,as previously mentioned,we have to observe the impact of reduced young generation size and the resulting increase in GC frequency on the overall application throughput and latency. The young GC pause duration also depends on tenuring thresholds and the old generation size (as shown in step 6). With CMS,try to minimize the heap fragmentation and the associated full GC pauses in the old generation collection. You can get this by controlling the object promotion rate and by reducing the? We observed that much of our Eden space was evacuated in the young collection and very few objects died in the survivor spaces over the ages three to eight. So we reduced the tenuring threshold from 8 to 2 (with option:? We also noticed that the young collection pause duration increased as the old generation was filling up; this indicated that the object promotion was taking more time due to backpressure from the old generation. We addressed this problem by increasing the total heap size to 40 GB and reducing? 6. Optimize task assignment to GC worker threadsTo reduce the young generation pause duration even further,we decided to look into options that optimized task binding with GC threads. The? There are also a couple of other interesting options that deal with mapping tasks to GC threads.? 7. Identify CPU and memory overhead of GCConcurrent GC typically increases CPU usage. While we observed that the default settings of CMS behaved well,the increased CPU usage due to concurrent GC work with the G1 collector degraded our application's throughput and latency significantly. G1 also tends to impose a larger memory overhead on the application as compared to CMS. For low-throughput applications that are not CPU-bound,high CPU usage by GC may not be a pressing concern. 8. Optimize system memory and I/O management for GCOccasionally,it is possible to see GC pauses with (i) low user time,high system time and high real (wallclock) time or (ii) low user time,low system time and high real time. This indicates problems with the underlying process/OS settings. Case (i) might imply that the JVM pages are being stolen by Linux and case (ii) might imply that the GC threads were recruited by Linux for disk flushes and were stuck in the kernel waiting for I/O. Refer to this?to check which settings might help in such cases. We used the JVM option? You could potentially use? GC optimization for the feed data platform at LinkedInFor the prototype feed data platform system,we tried to optimize garbage collection using two algorithms with the Hotspot JVM:
With ParNew/CMS,we saw 40-60 ms young generation pauses once every three seconds and a CMS cycle once every hour. We used the following JVM options: =4096m -XX:PermSize=256m -XX:MaxPermSize==6g -XX:MaxNewSize=6g -XX:+UseParNewGC -XX:MaxTenuringThreshold=2 -XX:SurvivorRatio=8 -XX:+UnlockDiagnosticVMOptions -XX:ParGCCardsPerStrideChunk=32768=80
With these options,our application's 99.9th percentile latency reduced to 60 ms while providing a throughput of thousands of read requests. AcknowledgementsMany hands contributed to the development of the prototype application:?,,?,?,?,?,and?. Many thanks to?,?,and??for their help with system optimizations. Footnotes[1]? [2] The? [3] There have been a few bugs related to memory leaks with G1; it is possible that Java7u51 might have missed some fixes. For example,this??was fixed only in Java 8. (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |