Analyzing multi-core processors for a givenbudget在给定的预算内分析多核处理器
This section extends our analysis in two ways. It considers amore general workload (designing for an unknown permutation of
a set of applications) and a variety of area and power budgets.
本节我们的分析方法扩展了两种。包括了一般的工作内容(设计为一个未知的排列和一组应用程序)和各种面积的功耗预算。
Forevery fixed area or power limit, an exhaustive search is performedto find the highest performing 4-core multiprocessor. For all bud-gets, the results shown assume that all contexts are busy. The con-figuration chosen is the one that gives the best average performanceover all permutations of the applications.选择的配置是对所有排列的应用程序都有良好的表现和适应能力的。
Figure 3 shows the weighted speedup for the highest performing4-core multiprocessors within an area budget of 40mm2. The threelines correspond to different power budgets for the cores. The re-sults are presented for two workload conditions – all same, whenall the threads of a 4-threaded workload are the same and all dif-ferent, when all the threads of a 4-threaded workload are different.These two conditions represent two extremes of heterogeneity. Thepoints on the far left represent homogeneous CMP designs, all otherpoints represent varying degrees of heterogeneity. Select points arelabeled with a description of the core selection represented by thatpoint, to aid in the following discussion.The results lead to several interesting observations. First, we no-tice that the advantages of diversity exist even with the all sameworkload. This workload might represent parallel workloads withhomogeneous threads, or perhaps a server handling requests withlittle diversity. Previous proposals discussed the advantages of het-erogeneity only with heterogeneous workloads; however, we findthat even homogeneous workloads achieve their best performancewhen at least one of the cores is well-suited for the application —a carefully constructed heterogeneous design ensures that whateverapplication is being used for the homogeneous runs, such a corelikely exists. For example, for an area budget of 40mm2and apower budget of 30W, the best heterogeneous CMP for all sameworkloads outperforms the best homogeneous CMP by 4%.Note that such a CMP is exploiting diversity across different ho-mogeneous workloads even though there is no diversity within aworkload (that is, we are finding a single best design for all of ourall same workloads).Second, we observe that the advantages due to heterogeneity fora fixed area budget depend largely on the power budget available —as shown by the shape of the lines corresponding to different powerbudgets. In this case (Figure 3), heterogeneity buys little additionalperformance with a generous power budget (50W), but is increas-ingly important as the budget becomes more tightly constrained.For example, in the all-different case, the best heterogeneous CMPoutperforms the best homogeneous CMP by less than 1% when thepower budget is 50W, by 8% when the power budget is 40W, andby 17% when the power budget is 30W. This can be explained bythe the fact that without constraints, the homogeneous architecturecan create “envelope” cores — cores that are over-provisioned forany single application, but able to run most applications with highperformance. For example, for an area budget of 40mm2, if thepower budget is set high (50W), the “best” homogeneous archi-tecture consists of 4 OOO 64 64 l l cores (i.e., out-of-order, largecaches, large window). This architecture is able to run both thememory-bound and processor-bound applications well. When thedesign is more constrained, we can only meet the needs of eachapplication through heterogeneous designs that are customized tosubsets of the applications.#p#分页标题#e#
http://ukthesis.org/dissertation_sample/
We see these same trends in Figure 4, which shows results forfour other area budgets. There is significant benefit to a diversityof cores as long as either area or power are reasonably constrained.For a power budget of 40W, a heterogeneous CMP outperforms thebest homogeneous CMP by 8% when the area budget is 50mm2and by 10% when the budget is 30mm2. A 11% improvement ispossible for an area budget of 20mm2and a power budget of 30W.
The power and area budgets also determine the amount of di-versity needed for a multi-core architecture. In general, the moreconstrained the budget, the more benefits are accrued due to in-creased diversity. For example, considering the all different resultsin Figure 3, while having 4 core types results in the best perfor-mance when the power limit is 30W (17% improvement over thebest homogeneous CMP), two core types (or in some cases, one)are sufficient to get more than 99% of the potential benefits forhigher power limits. In some of the regions where moderate diver-sity is sufficient, two unique cores not only match configurationswith higher diversity, but even beat them. In cases where higherdiversity is optimal, the gains must still be compared against thedesign and test costs of more unique cores. For example, in theexample above, the marginal performance of 4 core types over thebest 2-type result is 3%, and may not justify the extra effort. Go-ing from one core type to two core types, however, results in 14%performance improvement and presents a more compelling case.Our results show that while having two core types is sufficientfor getting most of the potential out of moderately power-limiteddesigns, increased diversity results in significantly better perfor-mance for highly power-limited designs. These results underscorethe increasing importance of single-ISA heterogeneous multi-corearchitectures for current and future processor designs. As designsbecome more aggressive, we will want to place more cores on thedie (placing area pressure on the design), and power budgets percore will likely tighten even more severely.
Another way to interpret these results is that heterogeneous de-signs dampen the effects of constrained power budgets significantly.For example, in the 40mm2results, both homogeneous and hetero-geneous solutions provide good performance with a 50W budget.However, the homogeneous design loses 9% performance with a40W budget and 23% with a 30W budget. With a heterogeneousdesign, we can drop power to 40W with only a 2% penalty anddown to 30W (a 40% power savings) with only a 9% performanceloss.
Perhaps more illuminating than the raw performance of the bestdesigns is what architectures actually provide the highest perfor-mance for a given area and power budget. We observe that therecan be a significant difference between the cores of the best het-erogeneous multiprocessor and the cores constituting the best ho-mogeneous CMP. That is, the best heterogeneous multiprocessorscannot be constructed only by making slight modifications to thebest homogeneous CMP design. Rather, they need to be designedfrom a clean slate. Consider, for example, the best multiprocessorsfor an area budget of 40mm2and a power budget of 40W. Thebest homogeneous CMP consists of single-issue OOO 16 16 s lcores (out-of-order, 16K Icache, 16K Dcache, few functional units,large window). On the other hand, the best heterogeneous CMPwith two types of cores, for all different workloads, consists of twosingle-issue in-order cores with 8KB L1 caches ( IO 8 8 s) and twosingle-issue out-of-order cores with 64KB ICache, 32KB DCacheand double the number of functional units (OOO 64 32 l s).A consistent observation also is the reliance on non-monotonicity.In several of our best heterogeneous configurations, we see that nocore is a subset of any other core. For example, when the powerbudget is 30W, the best heterogeneous CMP for two core types forall same workloads consists of superscalar in-order cores (issue-width=2) and scalar out-of-order cores (issue-width=1), and out-performs the best homogeneous CMP by 4%. Even when all thecores are different, the best multiprocessor for all different work-loads consists of a collection of one in-order core with 16KB L1caches (IO 16 16 s), one out-of-order core with 32KB ICache and16KB DCache (OOO 32 16 s s), one in-order core with 32KB L1caches (IO 32 32 s), and one out-of-order core with 64KB ICacheand 16KB DCache (OOO 64 16 s s). We explore this further inSection 6.3.#p#分页标题#e#
To summarize, the results show that the best heterogeneous CMPis not constructed of cores that make good general-purpose unipro-cessor cores, or even those cores that would appear in good ho-mogeneous multiprocessor architectures.
通过总结,结果表明,虽然UniPro处理器内核有良好的通用效果,甚至是都有均一的多处理器架构,而且这些都是核心,但是异构CMPi最好不兴建。
Rather, the best way todesign a heterogeneous CMP is by tuning each individual core toa class of applications with common characteristics – we see thisbecause the best designs typically contain cores poorly suited forsome applications, but these designs will not have all cores poorlydesigned for a particular application. Such processors are advanta-geous even for completely homogeneous workloads and their ben-efits keep increasing as area and power budgets get tighter.
Note that all our results (even in the following sections) havebeen presented with various cutoff points (area/power budgets) forthe ease of visualization. We analyzed the complete continuousdata space, however, and also looked at finer intervals, to ensurethat our conclusions were not particular to the cutoffs shown inthese graphs.
然而,我们分析了完整的连续数据空间,也看着更精细的间隔时间记录,以确保我们的结论在眼下这种图形是没有特别的临界值点的。