by Cyril Morong
Special thanks to Cliff Blau for his helpful comments.Special thanks to Clem Comly for his checking of Retrosheet files to clarify a batting order question.
Statistics like runs scored and RBI’s often tell us little about a hitter’s ability or performance level since they depend on each of his teammate’s performance as well.Bill James and Pete Palmer & John Thorn have developed very good formulas to get around this problem and give us an idea of how much an individual player contributes to his team’s scoring (see sources).But those formulas’ accuracy is checked by using team statistics and cannot be checked using individual statistics.For example, the Bill James Runs Created (RC) formula is said to be a good yardstick for evaluating individual players since team RC is very highly correlated with team runs and the RC value is often very close to the actual number of team runs.
Batters can contribute to team runs by moving along runners with hits, walks and sacrifices even if they don’t get credited with a run scored or an RBI.By not making outs they also give the next batter a chance for an RBI, again helping their team score. But how much value to place on those events (how much they contribute to team runs) can only be evaluated at the team level in the manner mentioned above.
This paper, attempts to analyze just one part of the overall run scoring problem, batting in runs (RBI’s). A formula gives the RBI value of various events (singles, doubles, etc.).Its accuracy is checked using league data and data for individual players (which cannot be done with the RC formula).On a large scale, for league data, the formula is very accurate.It is not as accurate at the individual level, but it still predicts individual RBI totals reasonably well.
Below, I give the formula’s derivation and discuss its meaning.In general, it is home run hitters who will drive in runs.Players who lack home run power must hit for a very high batting average if they are going to drive in runs.
The RBI value of a player’s hitting statistics is determined by the following equation:
RBI Value = 0.2449*1B + 0.4468*2B + 0.6162*3B + 1.6162*HR + 0.0218*(BB + HBP)
1B = singles
2B = doubles
3B = triples
HR = Home runs
BB = non-intentional walks
HBP = the number of times hit by pitch
Sacrifice flies (SF) are not included.This is because each SF, except on very rare occasions, results in one RBI.Only non-intentional walks are included since, except on very rare occasions, intentional walks are not issued with the bases loaded and therefore do not result in RBI’s.Walks and being hit by pitch can result in an RBI if the bases are loaded.
all of the
league data for the seasons from 1955-1999, the formula was less than
adding up all of the singles, doubles, triples, home runs,
walks, number of times a batter was hit by a pitch, and RBI’s for the
and then applying the formula to the data.(For
the RBI’s, the actual figure used was RBI-SF).
Using the values for each type of hit or batting result mentioned above, the formula says that there should have been 629,783 RBI’s in major league baseball during this period.There were actually 631,601, so the formula was only .2879% too low or less than one percent off. (If SF’s were included and given a value of 1, the formula would have been .268% too low).
Furthermore, applying the formula to the each of the individual season totals for each league, the predictions for only nine of the 90 major league seasons were off by more than five percent.Only one season was off by more than ten percent, the NL of 1963, which was predicted 14% too high.54 seasons were predicted too low and 36 were too high.
I then “normalized” the prediction for each season by first calculating a normalized on-base percentage (OBP) for each season.The "average" OBP for the 90 seasons was .3239, found as a simple average, adding up the OBP's for each season and dividing by 90.Then the OBP for each season was divided by .3239.This normalized OBP was then multiplied by the RBI values for walks and each hit. For example, if the league OBP was .310, the RBI value of a home run would be 1.6162*0.957 (=1.5467) since .310/.3239 is .957.Then those normalized RBI values were used to predict RBI's for each league in each season.
In that case, the prediction for 43 of the seasons was within 1%.69 were within 2%.82 were within 3%.Only 1 season was off by more than 5%, the NL of 1963 (8.4% too high).Only 3 seasons were off by more than 4%. 44 seasons were under predicted. Only 2 were under predicted by more than 3%.
Source of the formula
The formula is based on the frequency of runner situations.They are:
No runners on base:55.1%
1 runner on base: 30.3%
2 runners on base: 12.37%
3 runners on base: 2.18%
For specific situations:
Runner on first base: 17.55%
Runners on first and second: 6.78%
Runners on first and third: 3.16%
Bases loaded: 2.18%
Runner on second base: 9.56%
Runners on second and third: 2.43%
Runner on third base: 3.24%
These are frequencies that have been reported on the Society for American Baseball Research daily email digest by Tom Ruane.He has also reported that, in general, singles, doubles, triples, home runs, walks, and being hit by pitch occur with about the same frequency in each of the runner situations.
This means that there are no runners on base for 55.1% of the plate appearances, 1 runner on base for 30.3% of the plate appearances, etc.
Using triples as an example, we can get the RBI value of the typical triple with the following equation:
0*0.551 + 1*0.303 + 2*0.1237 + 3*0.0218 = 0.6162
The same can be done for the other batting events, like home runs and walks.For singles an assumption was made that 64.4% of the runners scored from second base on singles.For doubles an assumption was made that 42.9% of the runners scored from first base on doubles.These assumptions are based on data presented on John Jarvis’s website
Accuracy of the formula for individual players
For a single season, this formula will not work too well since RBI’s can depend so much on the on-base percentage of the hitters that bat in front of a player.But for career data (or something close) such differences will tend to diminish.So my first look at individual players took players from the period of 1988-98.I then eliminated all individual seasons with less than 400 plate appearances.Then all of the statistics for the remaining players were totaled for the period (1988-98).Then all players with less than 4000 plate appearances were eliminated.This left 77 players.
For each player, I removed their RBI’s from sacrifice flies.I also added up their at-bats, non-intentional walks and number of times hit by pitch (call this adjusted plate appearances or ADJPA).I then calculated
or RBI frequency.The same was done each player’s singles, doubles, triples, etc.
The mean RBI frequency for the 77 players during this period was 0.121. The RBI formula was applied to these frequencies for the 77 individual players with adjustments made for their batting order position.Players who batted mostly first or second had 0.015 subtracted from the value predicted by the formula.Players who batted mostly third through sixth had 0.01 added to their predicted value while those who batted mostly seventh through ninth had 0.01 subtracted.There were only two players in the 7-9 group.This is because players who bat in those slots tend to get fewer plate appearances and might also be below average hitters who tend not to play every game.So it is hard for them to get 4000 plate appearances in a given time period.
The actual RBI frequency for each player was subtracted from the predicted value for each player.The average absolute discrepancy was 0.00605.This is only 5% of the mean for the actual data.Being off by 0.00605 for 700 plate appearances (about a full season) means just 4.24 RBI’s.43 players were under predicted, meaning the formula predicted that they would have fewer RBI’s than they actually did.34 were over predicted, meaning the formula predicted that they would have more RBI’s than they actually did.
54 of the 77 players were predicted within 5 RBI’s for a season of 700 plate appearances.Only five were off by more than 10 RBI’s with the highest being -12.85.The correlation between actual RBI frequency and the predicted frequency was .972.In a regression in which the actual RBI frequency was the dependent variable and the predicted frequency the independent variable, the standard error was just 0.0075.That would be only 5.25 RBI’s for a 700 plate appearance season.
The player who was under predicted by 12.85 was Cecil Fielder.The only explanation that I can think of is that he had an exceptionally good leadoff man, Tony Phillips.
The player who was off by +12.73 was Barry Larkin whom I counted in the 3-6 batting order category.But he was a close call and has often batted second.If I made no batting order position for Larkin (which is reasonable, since he has moved around in the batting order) his prediction would be off by only 5.73 RBI’s.
The next worst predicted player was Marquis Grissom.He was off by 11.25.But initially I made no batting order position for him because he has batted third quite a bit even though I always thought of him as a leadoff man.If he were put into the 1-2 category, he would only be off by 0.75.(Three other players had no batting order adjustment because they were moved around the lineup-Lance Johnson, Julio Franco and Ray Lankford)
One other player was off by just a little more than 10 RBI’s. This was Greg Jeffries who was placed in the 3-6 category.Jeffries was a pretty close call and barely had enough at-bats to make it in and has batted second a lot.If I made no batting order position for Jeffries his prediction would be off by only 3.09 RBI’s.
The formula was also applied to players from the 1960-69 period who had 4000 or more plate appearances.I did not have sources listing batting order position for individual players, so I assumed the top two-thirds of the players in RBI’s per plate appearance batted 3-6 while all others batted 1-2.(For the group of 77 from 1988-98, only five players did not fit this assumption-I will email a list of players who fell into each category to any one interested.The formula initially gave a very bad prediction for Felipe Alou, being 18 RBI’s too high.But I asked Clem Comly to check Retrosheet files and I found out that Alou was mostly a number 1 or 2 batter instead of a 3-6 hitter).The same adjustment for batting order position was made as for the 1988-98 period.The mean RBI frequency was 0.115.
In checking the accuracy of the formula for the 1960’s, I “normalized” the RBI values of the hits and walks in the same way as I did for season totals for each league mentioned above.26 of the 43 players were predicted to be within 5 RBI’s of their actual total for 700 plate appearances.22 players were under predicted and 21 were over predicted.Only three players were off by more than 10 RBI’s, with the worst being off by 11.7.The average absolute discrepancy was 0.005837, or just 5.08% of the mean of 0.115.This is just 4.09 RBI’s for 700 plate appearances.
The correlation between actual RBI frequency and the predicted frequency was .973.In a regression in which the actual RBI frequency was the dependent variable and the predicted frequency the independent variable, the standard error was just 0.00745.That would be only 5.51 RBI’s for a 700 plate appearance season.
Although a few players were not predicted very accurately by the formula, it was reasonably accurate for both the 1988-98 period and the 1960-69 period.
The formula was also about as accurate when I used RBI’s per out made instead of RBI’s per plate appearance. (Of course, singles, doubles, etc. per out made were also used)In that case, players were actually being punished for low batting averages, therefore rewarding players with high averages.But a formula that shows home run power to be the most important factor in RBI’s (see the next section) was still reasonably accurate.
Interpreting the RBI formula
What does the formula tell us?It tells us that power hitters drive in runs.Here is an example:
There are two players, A and B.They each get 600 at-bats and 60 walks.Player A has a .360 batting average with 160 singles, 32 doubles, 8 triples and 16 home runs.Player B has a .270 batting average with 106 singles, 20 doubles, 4 triples and 32 home runs.
Applying the formula to each player, we would get 85.6 RBI for player A but 90.4 for player B even though he hits for a much lower batting average.It is the home runs that give him the greater RBI potential.In fact, player A would have to hit .393 to get the same number of RBI’s as player B, assuming that all of the additional hits are singles.In this particular case, it means that each extra home run hit by player B is worth 7.69 points of batting average in terms of RBI potential.In other words, players who don’t hit home runs must hit for a very high batting average to make up for their lack of RBI potential. (If we gave player A 40 doubles instead of 32 and he still had his .360 average, he would still get fewer RBI’s, 87.22).
If we assume that player B also had 32 doubles and 8 triples, the same number as Player A, then player B would get 94.31 RBI’s.For player A to match that with just 16 home runs, he would have to bat .420, again assuming that all of the additional hits are singles.In this case then, it means that each extra home run hit by player B is worth 9.375 points of batting average in terms of RBI potential.
I encourage readers to try other examples.But in general, players who hit a low number of home runs will need to hit for a very high batting average or hit an extremely high number of doubles and triples to make up for the lack of home runs.
This paper has shown that hitting home runs is the main determinant in major league RBI totals.This is based on a formula that is extremely accurate for large, aggregate data sets and still reasonably accurate at the individual level.And, as mentioned above, the formula is reasonably accurate at the individual level on an RBI per plate appearance level as well as an RBI per out made level.This means even players who have low batting averages yet hit for home run power will still drive in a large number of runs.
Of course, players still need to get on base in order for other players to get RBI’s.So this paper has only examined one part of the overall run scoring issue.But it helps clarify which players will be valuable to major league teams in their efforts to score runs and win games.
The website of Tom Ruane’s that I mentioned earlier called “RBI Production--A New Look at an Old Stat.” The address is:
For the major league season by season statistics for the period of 1955-99, I used the STATS, INC. All-Time Baseball Sourcebook.(Bill James is associated with STATS, INC.)
individual player statistics, I used the Sean Lahman Baseball Archive
Palmer & Thorn have published several editions of Total Baseball, which is now considered to be the official encyclopedia of baseball.One of their statistics is called “Batting Runs” and it assigns a run value to each event, single, double, etc.That is, they attempt to value each event in terms of the runs it contributes to the team. What I have done here is much less ambitious, since I only look at RBI’s.I have not attempted to analyze runs scored or how a player could contribute to team runs by moving runners along or just getting on base and giving another player a chance.
John F. Jarvis website: “A Collection of Team-Season Statistics.”The address is at:
Return to homepage