You're Reading: Drive for Show Putt for Dough Data Analysis
The phrase “Drive for Show, Putt for Dough” has its roots in the game of South African golfer Bobby Locke. Locke won four Open Championships during the years 1949 to 1957 and attributed much of his success to his putting ability. To add some context, this was a time when elite golfers drove the ball around 240 yards (they sometimes could achieve greater distances of up to 300 yards thanks to significant roll on poorly irrigated and firm fairways). Moreover, since the 1950s to the 2010s, the median length of golf courses at the longest tees increased by about 500 yards form about 6,250 to 6,750 yards.
Today, this phrase is still used, especially in cases where long-drivers of the ball, amateurs and professionals alike, are able to bomb their drives but struggle to score. However, the modern game seems to have disproved the validity of a statement that putting is more important than off-the-tee performance. This is evident when looking at the world’s best. Considering the most recent World No. 1s (Jon Rahm, Dustin Johnson, Justin Thomas, and Rory McIlroy), while certainly all great putters, each were also some of the longest players on tour. Perhaps the best way to answer this question of whether the driver is in fact mightier than the putter is through data and data analytics.
In any data analysis, the first step is to determine the goals of the analysis. In this case, it is pretty obvious; answer the question: “Does driving or putting impact overall performance more?” Fortunately, there is data readily available that will assist in answering this question. The PGA Tour provides an extensive array of stats concerning different aspects of its players’ games as exhibited in PGA events. From these stats, we will use the variable Strokes Gained Total to define a golfer’s overall performance. The higher his Strokes Gained Total, the greater his performance.
For this analysis, data from the years 2017 to 2021 was collected from the PGA Tour’s website. In addition to collecting the data for Strokes Gained Total, data on Approach Putt Performance, Average Distance of Putts Made, Driving Distance, and Driving Accuracy was collected.
Approach Putt Performance represents the average distance golfers had remaining to the hole after their first putt. The smaller the distance, the better the ability to lag. For example, in 2021, Branden Grace led this category with an average distance of 1‘ 11” to the hole after his first putt.
Average Distance of Putts Made represents the average total length of putts that golfers had each round. The figure for each round would increase whenever a putt was made. If the player made a 10-foot putt, 10 feet would be added to the round’s total. Similarly, if the player missed a 20-foot-putt then tapped-in the remaining 2-footer, 2 feet would be added to the total. In 2021, J.T. Poston had the highest average distance of putts made at 83’ 1”.
Driving Distance is a fairly self-explanatory category. Unsurprisingly, Bryson DeChambeau was the longest player on tour with an average drive of 323.7 yards in 2021. It is worth noting, however, that the PGA Tour only collects driving distance data on two holes per round. Despite this, the data available should be accurate enough for our purposes.
Finally, Driving Accuracy indicates the percentage of fairways hit throughout the season. With an accuracy of 75.25% in 2021, Brendon Todd was the most accurate player off-the-tee.
Ultimately, the following analysis will indicate which of these four stats (which are known as independent variables) is most important when predicting Strokes Gained Total (which is known as the dependent variable as its value is, in theory, dependent on the values of the independent variables).
Having collected the total five variables over five years, it is time to prepare the data. Simply, this means converting the values for Approach Putt Performance and Average Distance of Putts Made from measures in both feet and inches (1’ 11”) to a decimal value in feet (1.917). Next, the data needed to be consolidated, so that all five stats were represented for each player in a single row. Ultimately, this resulted in 960 data points.
Now for the fun part! With the data ready to go, it is time to actual perform an analysis by choosing an algorithm with which to ultimately generate a model that can be used to assess the specific dataset we have. In this case, linear regression is an appropriate choice. For those unfamiliar with linear regression, the idea is that the regression will provide a mathematical formula that will indicate the value of the dependent variable based on values on independent variables. Again, in our scenario, the dependent variable is Strokes Gained Total while the independent variables are Approach Putt Performance, Average Distance of Putts Made, Driving Distance, and Driving Accuracy. Thus, if the values of the independent variables are favorable, the regression model will suggest a high Strokes Gained Total statistic.
Using Python and the Sklearn machine learning library, the following linear regression was generated:
Variable | Coefficient | Beta Coefficient | P-Value |
---|---|---|---|
Approach Putt Performance (APP) | -1.232 | -0.255 | 0.000 |
Distance of Putts Made (DPM) | 0.061 | 0.343 | 0.000 |
Driving Distance (DD) | 0.050 | 0.603 | 0.000 |
Driving Accuracy (DA) | 0.070 | 0.512 | 0.000 |
Intercept | -20.474 | – – | – – |
As you can see, each independent variable is listed with corresponding values: “Coefficient,” “Beta Coefficient,” and “P-Value.” The regression formula shown uses each of the coefficients as well as the listed value for the intercept. For example, the values for Bryson DeChambeau and Phil Mickelson in 2021 (Table 1) can be inserted into the regression model to calculate his predicted Strokes Gained Total.
Golfer Name | Approach Putt Performance | Average Distance of Putts Made | Driving Distance | Driving Accuracy | Strokes Gained Total |
---|---|---|---|---|---|
Bryson DeChambeau | 2.42' | 76.33' | 323.7 Yards | 54.18% | 1.823 |
Phil Mickelson | 2.58' | 69.92' | 292.4 Yards | 54.47% | -0.442 |
These computed values (1.18 and -0.95) will not necessarily match the actual Strokes Gained Totals that Dechambeau and Mickelson had (1.823 and -0.442). Unfortunately, machine learning nor linear regression are perfect formulas and some error is to be expected. Furthermore, golfers know that there is much more to the game of golf than driving and putting. Just using these two aspects limits the accuracy of the model. Nevertheless, this demonstrates one of linear regression’s primary purposes – making predictions. Consequently, if we did not have any knowledge of what a players’ Strokes Gained Total was, we could use this model and the historical data from which it was created to predict Strokes Gained Total, assuming information on the four features was available.
An observation from this model is that all the coefficients expect for Approach Putt Performance are positive. Thus, an increase in Distance of Putts Made will increase Strokes Gained Total. Similarly, increases in Driving Distance and Driving Accuracy will also increase Strokes Gained Total. Since Approach Putt Performance is best when the first putt is left a short distance from the hole, a greater Approach Putt Performance measure will decrease Strokes Gained Total.
The other purpose of linear regression is to identify which of the independent variables or features are most important in predicting the dependent variable. For golfers, this knowledge can inform us as to what aspect of our games on which we should focus more. Coincidentally, it is also the objective of this article. This is where the Beta Coefficient is useful. Essentially, the variable with the greatest Beta Coefficient is the most important variable. Looking at the output from the analysis, Driving Distance has the greatest Beta Coefficient, followed by Driving Accuracy, Distance of Putts Made, and Approach Putt Performance. The Beta Coefficient differs from the Coefficient in that it standardizes each variable so that they can be evaluated on the same scale. Part of the reason that this is necessary is because the raw data for Driving Distance (which is around 300 yards) is much greater than the raw data for Approach Putt Performance (which is around 2 feet). The regular coefficients do not account for these inconsistencies. Therefore, while coefficients are good for making predictions, beta coefficients are necessary for identifying the significance of variables.
Briefly, the p-value column indicates whether the corresponding variable is worth including the model, e.g., is it worth including Driving Distance. While there are no hard and fast rules in data analytics, typically if the p-value is greater than 0.05, the associated variable should be removed from the analysis. In this case, all variables are 0 which indicates that they all are important to the regression.
Okay, so we have determined our analysis goal, chosen which PGA Tour statistics to use, collected the data, prepared the data, and created a model. It is time to evaluate model and answer our initial question. So, is driving or putting more important? The Beta Coefficients above indicate that driving is convincingly more crucial to the success of a golfer. This is particularly true for PGA Tour golfers. Unfortunately, the games of PGA Tour Professionals are not necessarily reflected in those of amateurs. Nonetheless, the data does not lie, and it seems that professional golfers are driving rather than putting for dough.