Introduction
A recent paper by Xing and Stern [(1)] provides additional performance data of interest on Verification of Calculations, but is misleading in some important respects. Five methods, all starting from Richardson Extrapolation, are compared: three are based on variants of my Grid Convergence Index method [(2,3,4,5 6)], and two are based on the authors’ work: the earlier “CF method” and the latest “FS method.”
(1) The method referred to as GCI2 is successful but its source is cited as a “private communication” from myself (their Ref. [16]). What they refer to as GCI2 is essentially the GCI method, as I have described and discussed and evaluated at length, Refs. [2,3,4,5,6]. This is the method that, as the authors say [(1)], “is widely used and recommended by ASME and AIAA” citing [(6,7)]. I shall refer to it herein as the “real GCI method” for reasons soon to be apparent. The equivalence is obscured because the authors [(1)] have written their Eq. (12) defining the method using their own ratio term CF, which obscures the simplicity of the original and makes it appear that the factor of safety varies continuously
In the original GCI method [(2)], FS does depend on observed pRE implicitly, in the sense that if pRE is unreasonable one at least reverts to using pth and FS = 3 rather than blindly applying FS = 1.25. The preferred approach would be to continue the grid convergence study with additional grids until reasonable values of pRE were obtained. These rather obvious points have been made more explicit in Ref. [3] which was not available to the authors of Ref. [1].
There is further unfortunate confusion caused by the authors’ choice of a descriptor for their method and their use of the symbol FS both for their “Factor of Safety” method and for the factors of safety used in all five methods, including the “factor of safety used in the Factor of Safety method.” Likewise for their symbol CF, which denotes both their Correction Factor method and the “correction factor” used in the Correction Factor method as well as in their defining equations for other methods (Eqs. (11) and (12)).
This leaves the question of what are the methods called “GCI” and “GCI1” by the authors. The answer is that I consider these to be misapplications of the GCI formulas, and therefore misrepresentations of the GCI method.
(2) For specificity, I will change notation and refer to the method of their Eq. (10) as GCI0, not as “GCI” since I claim it is not the real GCI method. This Eq. (10) is written in terms of an error estimate based on the “observed” [(2,3,4,5,6)] or “estimated” [(1)] order of convergence pRE.
However, the authors correctly indicate in the text following that this is possible only if three grids have been used to calculate a pRE, in which case the recommended Factor of Safety FS = 1.25. If only two grids have been used, pRE in Eq. (10) is replaced by a theoretical value pth (e.g., pth = 2 for a nominally second-order method) and a more conservative value FS = 3 is recommended [(2,3,4,5,6)]. This would indeed agree with my GCI method [(2,3,4,5,6)] in good applications. The trouble is that the authors have applied this regardless of the reasonableness of the estimated pRE. This is contrary to the discussions in Ref. [2], which discussions I claim are part of the real “GCI method” as contrasted with simplistic application of the formula.
Exactly what constitutes a reasonable value was not defined in my work [(2,3,4,5,6)] and I would not take issue with (say) a 5% difference between pRE and pth, e.g., using an observed pRE = 2.1 for a nominally second-order method. (Although it should be obvious that the conservative approach would be to use the minimum of pRE and pth, I would not claim that this was a misapplication of the GCI method.) However, the authors [(1)] have applied the GCI0 formula using observed pRE ≫ theoretical pth, which is imprudent and not recommended [2–6] no matter what the value of Fs. This point was conveyed to the authors of Ref. [1] in their Ref. [16], the private communication from myself. (I am glad to make this communication, cited here as Ref. [8], available to any interested party.)
Using the authors’ notation P = pRE/pth the authors have claimed rational estimates of uncertainty from studies with P as high as 6.01 for the new ship hydrodynamics problem (see Table 7, second line). Since the numerical method used is nominally second order, this indicates an observed order pRE = 12.02, the use of which would be ridiculous. Another instance is the third row entry in Table 6 [(1)] for the grid triplet (4,5,6) showing P = 0.08. These are pretty strong hints that something is wrong. There is no sense in performing a grid convergence study and then ignoring the results. No excuses of the computing difficulties in industrial applications can justify an attempt to obtain a rational estimate of uncertainty from such meaningless corrupted studies. To so attempt is to give merit to meritless results. I can only describe this as a misapplication of the GCI formula, not as a poor result for the real GCI method.
(3) The second method considered in Ref. [1] is GCI1 based on Eq. (11). The appearance of the CF term in the equation serves to exchange the error estimate based on observed pRE to one based on theoretical pth for the condition P >1, i.e., for pRE > pth. This is exactly what is recommended above, and is good conservative practice. However, the Eq. (11) uses Fs = 1.25 in either case. This does not make sense. The GCI1 method as described uses three grid solutions to evaluate an observed pRE, but if this value is too large, the method reverts to using exactly the error estimator that would have been used for only two-grid studies but with the nonconservative value Fs = 1.25 instead of the recommended Fs = 3. Thus, the authors did use the correct factor of safety Fs = 3 as recommended [2−6] for 2 grid studies but they used only Fs = 1.25 for 3-grid studies in which the observed order of convergence pRE was not at least approximately consistent with theory.
The authors then concluded from their tests that “These facts suggest that the use of the GCI1 method is closer to a 68% than a 95% confidence level.” This is a misleading statement since the conclusion follows from using Fs = 1.25 and applying it to a dataset from [(9)] designed with “intentional choice of grid studies with oscillations in both exponent p and output quantity” [(9)]
See also Ref. [3], p. 219. The study [(9)] also did not use Fs = 3, but this would improve the coverage only lightly for this poor data set [(10)].
(4) Therefore, the evaluation in Ref. [1] of the real GCI method (GCI2) is corrupted by confusion with two distorted methods. The authors’ evaluation also does not agree with my own evaluation from some of the same data, notably the wide-ranging study of Cadafalch et al. [(12)] (Ref. [33] of Ref. [1]) which I presented in Ref. [4] and to which I had alerted the authors by [(8)]. Cadafalch et al. used an original and reasonable variant of the real GCI in which a mesh-averaged pRE was used for local uncertainty estimates. The literature survey in Ref. [1] omitted my evaluation [(4)] which itself covered ∼ 200 cases. As I had pointed out in Ref. [8], my evaluation of these uncertainty calculations using the GCI with Fs = 1.25 showed 92% coverage overall, less for upwind differencing, as expected, and 97.7% for adaptive differencing. There was one case of 176 that was found worrisome; it had pRE = 1.2 for an expected first-order method.
(5) When the confusion over the identity of the real GCI is cleared up, the overall evaluation from Ref. [1] of the real GCI method (GCI2) is excellent. The reporting of the various subsets of data is too convoluted to repeat here completely (and the process is suspect in one aspect — see item (9) below). Briefly, for the largest subsets the real GCI method (GCI2) produces reliability of about 93% (in some subsets, 94% and higher) and the FS method produces about 96%.
The authors claim that only their latest “FS method provides a reliability larger than 95%.” The FS method certainly works, but I consider it virtually meaningless to distinguish between 93% coverage for the real GCI method (GCI2) versus 96% for the FS method when the target is roughly 95% [(2,3,4,5,6)]. Note, we are not talking about the solution itself, or even its error, but about uncertainty. It is well recognized that the target of 95%, while traditional and useful, is itself somewhat arbitrary. There is an implicit precision level evidenced by the usual lack of distinction between 95%, or 20:1 odds, or 2-σ for a Gaussian distribution [(4)]
Note that no distribution of any kind has been assumed in the development or testing in Ref. [1] nor in any of the other work cited herein.
I applaud the authors examination on a large number of cases, but note that some of these are old studies, not dependable, with wildly varying results that are perhaps better discarded. On the other hand, the discard of outliers (4 discarded outliers in Sample 21, Table 5 of Ref. [1]) based on Peirce’s criterion may be acceptable but is debatable without a justification based on quality of the studies. See discussion in Ref. [3], Sec. 5.16.
(6) In this regard, the 93% (and higher) coverage by the real GCI method (GCI2) represents a confirmation of the method (with a tolerance of −2%) but the results for the authors’ FS method cannot strictly be considered a confirmation because their exercise is actually a calibration. (Here I distinguish confirmation versus calibration of an uncertainty estimator, analogous to the widely accepted distinction between validation and calibration of a model to experimental data [(2,3,5)].) In this study [(1)], the three free parameters (see Eq. (15)) of their FS method were tuned with the objective of attaining 95% coverage, so their claim of achieving > 95% coverage is somewhat circular. One would expect the tuned parameters to hold generally for other datasets, but strictly speaking their FS method could not be considered confirmed until it is applied successfully to another dataset not used in the calibration. By contrast, the real GCI method (GCI2), by now applied to probably O(1000) cases by many independent modelers, has been stable for over 12 years.
The real GCI method (GCI2) was published in mature detail in Ref. [2] in 1998. The earlier publications of 1993-1994 [(13,14)] already considered a likely range of less conservative Fs but recommended Fs = 3 pending further extensive studies. “But much experimentation would be required over an ensemble of problems to determine a near-optimum value and to establish the correspondence with statistical measures such as the 2-σ band.” [(15)] When those initial studies were completed in 1998, the result was Fs = 1.25 for well-behaved problems, as described in Ref. [2]. The method has been stable since, through a wide range of new applications [(3)]. The important new development has been the extension by Eça and Hoekstra to a Least-Squares GCI (beginning with [(16)] in 2000 and continuing through the decade) suitable for cases of noisy data; see Ref. [3], Sec. 5.11 for additional references and some details.
(7) The latest FS method is an improvement over the authors’ previous CF method, in all its versions.
See Ref. [3], p. 228 for a spotty history of the confusion of variations in the earlier versions and descriptors.
Less obvious than the previously noted results for values of P = 6.01 and 0.08, but more problematical, is the result for the FS method on the ship hydrodynamics problem. (The finest grid admirably uses 8.1 × 106 points.) Uncertainty estimates for well behaved problems should decrease monotonically as the grid is refined. The uncertainty estimates in Table 6 for the FS method for the three finest grid triplets are 7.62, 8.30, and 5.22, which are not monotonically decreasing. Again, something is wrong. All four other methods show monotonic convergence
, e.g., the real GCI method (denoted GCI2 in the table) gives 4.98, 4.21, 2.10.In Ref. [17], the authors had noted that the (then new version of the) CF method was “unfortunately invalid” for the finest grid triplet. The authors claimed this was “caused by the contamination of the iterative error on the fine grid.” This is possible, but (a) we do not know how the iterative error was estimated, (b) Eça and Hoekstra [(20)]
See also Ref. [3], Sec. 5.10.10.3 for additional references and some details.
(8) The statement in Ref. [1] that my previous studies did not use statistical techniques apparently refers to my choice to not use the Students t-test or other small sample correction to estimate statistical confidence. Instead, I had just reported straightforward “coverage” or counting of O(500) cases (incrementally accumulated over years) for which the GCI was conservative or not, compared to the actual error. The evaluation of statistical confidence level is based on a conceptual model in which the examined data set of an individual study is taken to represent the entire population which is inaccessible. If this data set is small, then small sample correction techniques (like Students t-test) are applicable. The simpler approach just claims (say) “89.2% coverage” for an individual study (perhaps small, or not) with the idea that the results of many studies will eventually be aggregated (probably informally, as I did). If small sample corrections for “confidence interval” are made to each study, later aggregation is confused because the small sample corrections are nonlinear and nondistributive. See Ref. [3], p. 229.
The real GCI (GCI2) is indeed widely used and recommended, as noted by Ref. [1]. It is impossible to keep track of its many applications. It is surely well over 500, as we claimed in [(5)], and more likely of O(1000). In Ref. [3], Sections 6.23.3–4 there is a list of 14 types of problems solved by Dr. C. J. Freitas and his group at the Southwest Research Institute. They have applied the real GCI to many realistic problems with far from ideal convergence behaviors with good results. These include confined detonations, vortex cavitation, blast-structure interactions, fluidized beds, shock propagation, structural failure, and turbulent multiphase flows, which problems were solved using FVM, FDM, FEM, ALE, and Riemann solvers. “In all applications the GCI provided a very good estimate of uncertainty [(21)].”
(9) Although the results in Ref. [1] are consistent with previous studies, their methodology appears to be flawed in one aspect. The authors’ define the so-called “actual factor of safety” FSA,i for each case i (each grid triplet) and use average values (averaged over a study) to discriminate performance. The five uncertainty estimators considered are all in the form Uest = Fs × |Eest| where Eest is some estimate of the discretization error. FSA,i was obtained formally by replacing Eest,i by the actual error Ei for each i case and inverting the algebra to solve for FSA,i = (Uest,i/|Ei|). But there is no “actual factor of safety” other than the value used. The ratio (Eest,i/Ei) has been properly interpreted in Ref. [1] and elsewhere as an “effectivity index” and the “actual factor of safety” is similar to an effectivity index but applied to an uncertainty estimate rather than an error estimate. Values (Ei/Eest,i) ≫ 1 or FSA,i ≫ 1 would indicate excessive conservatism for a single case. But this would say little or nothing about % coverage attained for an ensemble of cases, which is the agreed-upon target or principal performance metric. Of two uncertainty estimators with the same acceptable coverage, the one with the smaller average or maximum effectivity index would be preferred, but it would not make sense to actually use this “actual factor of safety” in any estimator of uncertainty for a new calculation.
Suspicion of the concept arises when we note that if one case happens to produce the correct solution, the result is an “actual factor of safety” FSA,i = ∞, which is difficult to interpret and suggests something is wrong. Indeed, Fig. 4 of Ref. [1] shows three methods producing maximum values ∼ 30 and two (including the latest FS method) going off-scale on log plots with FSA,i ∼ 60.
A single value of FSA,i < 1 indicates that the actual single error is outside the uncertainty band, which should happen ∼ 1 in 20 times. As the measure of coverage or reliability (Eq. (19) of Ref. [1]) this test is correct. A single value of FSA,i ≫ 1 would indicate large conservatism for that single case, which, combined with experience and intuition, may suggest an evaluation of excessive conservatism for the method.
See discussion in Ref. [3], Sec. 5.15 on “Evaluation of Uncertainty Estimators from Small Sample Studies.”
Consider a data set of 20 cases with the fine-grid solution variables of interest normalized to S1 = 1. For convenience, say all normalized uncertainty estimates are Uest = 0.1, i.e., an estimated 10% “error band.” (This might have been estimated by the real GCI method for well behaved cases using Fs = 1.25 from an error estimate of 0.08.) Suppose the true values for the first 19 of 20 cases are given by Ti = 1 + i/200 or Ti = 1.005, 1.01, 1.015, … 1.095. Suppose T(20) = 1.11 so this case is not likely to be discarded as an outlier. It is seen that the GCI uncertainty estimate performs exactly as intended for this synthetic case, with the interval [S1 ± Uest] = [0.9, 1.1] capturing the first 19 of 20 cases for 95% coverage. However, the “actual FSA,i” ranges from 0.91 to 20, and the “average actual FSA” = 3.59, which would seem to suggest that the original method using Fs = 1.25 is much more conservative than intended. It is not; it had performed perfectly for this example.
Other synthetic examples can be constructed to demonstrate that the concept of “average actual factor of safety” can mislead in the other direction, suggesting adequate conservatism when in fact the method is behaving terribly. Consider true values Ti = 1.1 + α and α ≪ 1 for i = 1 to 19, and T20 = 1.05, which is not an outlier. Then “average actual FSA” ≈ 21/20 = 1.05, falsely indicating mild conservatism when in fact the method has missed 19 of 20 cases, giving 5% coverage rather than the intended 95%.
These benign synthetic examples demonstrate that the concept used in Ref. [1] of relying on the so-called “average actual factor of safety” as an indicator of quality is misguided.
(11) Finally, some of the confusion of the authors [(1)] in not identifying the GCI2 method as the real (original) GCI method [(2,3,4,5,6)] is understandable due to the fact that application of the method involves a judgment call as to what constitutes a reasonable value of observed pRE. This makes it difficult for the authors to compare performance of different methods without being subject to unfair criticism. The judgment calls implicit in the earlier presentation of the GCI method in Ref. [2] have been made more explicit in Ref. [3] (which of course was not available to the authors of Ref. [1]) but the method is still not free of judgment calls. It would be preferable if all methods were defined in a recipelike fashion with no room for judgment calls. However, there are other judgment issues possible in a grid convergence study, e.g., how many grid triplets are adequate to claim that pRE is reasonably constant, whether or not to use a least-squares evaluation, which detailed least-squares version to use, etc. I do not expect that such judgment calls will be eliminated for poorly behaved convergence studies.
In their new V&V book, Oberkampf and Roy ([(22)], p. 326) have proposed a procedure that standardizes the judgment call at least for the question of what constitutes acceptable pRE. Their procedure accepts 10% agreement between pRE and pth, using p = pth and Fs = 1.25 in the GCI formula. For disagreement >10%, the procedure uses Fs = 3 and limits the p used in the GCI formula to p = min{max(0.5, pRE), pth}. I consider this a reasonable recipe, although of course other cut-offs could be justified as well. The limits on 10% agreement and the lower limit on p = 0.5 (provided that pth > 0.5) are obviously judgment calls but are reasonable and unambiguous, and therefore repeatable in a comparative study such as Ref. [1]. A more conservative (perhaps overly conservative) procedure would not set a lower limit on p other than p > 0 (allowing for difficult problems such as singularities as in Ref. [3], Sec. 5.10.4.1) and would use p = min(pRE, pth) when these agree within 10% instead of pth. These two prescriptive procedures would be additional candidates for evaluation by the authors of Ref. [1] using their extensive dataset.
Acknowledgment
The author gratefully acknowledges discussions with Professor Luis Eça, Dr. C. J. Freitas, Dr. R. W. Logan, Professor C. J. Roy, Maxwell Agnew and Aaron Donahue, and input from an anonymous reviewer.