Background
A
client
recently approached us
about the feasibility of using predictive modeling techniques to
evaluate "equine pari-mutuel investments" … in other words,
forecasting horse races. After quite a bit of
discussion, the objective was clarified as
the
ability to invest small to moderate amounts of capital to earn a +15%
ROI.
Could this be a conservative investment option for the
average person?
Was it possible? Frankly, we
had no idea …
In short order, after ascertaining that the raw data needed for analysis was commercially available, the EPM (Equine Pari-Mutuel) modeling project was kicked off with even measures of anticipation and skepticism. We were very curious if could be done, but we had our suspicions that this might be wasted effort.
Our
first step for a new modeling project is to use a data mining
methodolgy (CRISP_DM)
to break the project down into manageable tasks:
Phase I
Design and load the EPM data warehouse
Both pre-race
(past performances)
and post-race (results and pay-offs) were available electronically from
several
vendors. After product/pricing evaluation, a vendor was selected, and a
download schedule for racing programs was developed.
It was
determined that a minimum
of 10,000 modeling observations would be used for the
forecast, so a
target was established to download 150+ race programs based on an
estimate of 9
races per program, and 8 horses per race; the data for each horse for
each race
would comprise a single observation.
ETL
(extract/transform/load)
processes were designed to load the raw data into a relational model in
the
newly-designed EPM data warehouse; this would serve as the source for
the datamart
itself as it went through the cleansing/transformation process.
Phase II
Load, clean, and profile the data
Data profiling
techniques were used
to evaluate the accuracy and cleanliness of the data as it was loaded
into the
target EPM datamart.
A series of data filters was developed to either clean or reject the
data as
needed. The final datamart
contained over 17,000 observations with 3+
million data points.
|
Total Races |
1,811 |
|
|
|
|
|
Horses Entered |
16,630 |
|
|
|
|
|
Horses Raced |
14,218 |
7.9/ Race |
|
14.13% scratched |
|
|
Past
performance entries |
68,469 |
|
|
|
|
|
Data points |
3,423,450 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Avg |
Min |
Max |
|
|
Payoffs: |
Win |
$12.61 |
$2.20 |
$208.20 |
|
|
|
Place |
$6.01 |
$2.10 |
$76.40 |
|
|
|
Show |
$4.08 |
$2.10 |
$32.80 |
|
|
|
|
|
|
|
|
|
Winning margin:
(lengths) |
2.49 |
0.05 |
32.75 |
|
|
|
|
|
|
|
|
|
|
Number of
sprint races (< 1M) |
1,055 |
58.26% |
|
|
|
|
Number of route
races (>= 1M) |
756 |
41.74% |
|
|
|
|
|
|
|
|
|
|
|
Dirt races |
|
1,442 |
79.62% |
|
|
|
Turf races |
|
369 |
20.38% |
|
|
Phase III
Establish a
model baseline
We created ROI
(Return On Investment) baselines by using
the datamart
to evaluate other widely-used handicapping techniques; if one or more
of these
models met the project objectives, we had achieved our objective.
The “morning line” is the pre-race prediction of each horse’s odds as established by a track handicapper. The lowest morning line in each race is considered the pre-race favorite and was the basis for this first selection method; if more than one horse in a race shared the lowest odds, the race was excluded from evaluation. Here are the results:
|
Analysis Type |
Bets |
Bet$ |
M/L |
Odds |
Wins |
Win% |
Avg$ |
Win$ |
Net$ |
Win ROI |
|
|
|
|
|
|
|
|
|
|
|
|
|
M/L Single
Favorite - ALL Races |
1579 |
$3,158 |
2.43 |
1.99 |
503 |
31.86% |
$5.27 |
$2,651 |
($507) |
-16.06% |
|
A brief
explanation of the data columns: |
|
|
Bets: |
Total number of
races bet where a single selection could be used. |
|
Bet$: |
Total amount
wagered based on a $2 win bet. |
|
M/L: |
The average
morning line odds (to 1) for the selected horses. |
|
Odds: |
The average
post-time (final) odds for the selected horses. |
|
Wins: |
Total number of
wins. |
|
Win%: |
Percentage of
wins to number of bets. |
|
Avg$: |
Average win
payoff for $2 bet. |
|
Win$: |
Total payoff
for all wins. |
|
Net$: |
Net win or (loss) for all bets. |
|
Win ROI: |
Net return/loss
percentage for total amount wagered. |
As you can see,
betting on the
morning line favorite selects almost 32% winners, but at an average $2
win
payoff
of only $5.27, the net loss of $507 creates a negative
ROI (-16.06%). One point to notice is that the odds dropped
18%
from morning line to final odds (2.43 to 1.99) meaning that the betting
public agreed with the selection, and drove down the price on the
winning payoff.
Next we examined the scenario of betting selections based on the post-time betting favorites; the odds are established by the ratio of money the crowd bets on each horse in ratio to the total amount wagered. The more money bet, the lower the resulting odds:
|
Analysis Type |
Bets |
Bet$ |
M/L |
Odds |
Wins |
Win% |
Avg$ |
Win$ |
Net$ |
Win ROI |
|
|
|
|
|
|
|
|
|
|
|
|
|
Actual Odds
Favorite - ALL Races |
1603 |
$3,206 |
2.85 |
1.59 |
529 |
33.00% |
$4.82 |
$2,550 |
($657) |
-20.48% |
If you bet
along with the crowd, you
will actually select 33% winners, but the meager $2 win payoff of $4.82 is
less than the minimum $6 payoff needed to break even; net result is a $657 loss for a -20.48% ROI. Not a good
long term
investment strategy.
Next we examined the first of two well-known published speed rating numbers. This rating number was developed a number of years ago to evaluate past races across different distances and different tracks, and seems to be frequently used by the public, as well as the media for comparing horses.
|
Analysis Type |
Bets |
Bet$ |
M/L |
Odds |
Wins |
Win% |
Avg$ |
Win$ |
Net$ |
Win ROI |
|
|
|
|
|
|
|
|
|
|
|
|
|
Speed Rating
‘A’ - ALL Races |
1581 |
$3,162 |
3.27 |
2.44 |
459 |
29.03% |
$5.59 |
$2,566 |
($596) |
-18.85% |
The
results
were a slightly lower
win percentage, and slightly higher win payoff, but still a net loss
and a negative ROI (-18.85%). The earlier reported accuracy
of
this particular rating may
have
caused the payoff odds to erode as it gained popularity with the
betting
public, but it has limited value at this point. Again, by taking note
of the drop from average morning line odds of 3.27 to post-time odds of
2.44, this may be an indication of the public over-betting these
selections.
|
Analysis Type |
Bets |
Bet$ |
M/L |
Odds |
Wins |
Win% |
Avg$ |
Win$ |
Net$ |
Win ROI |
|
|
|
|
|
|
|
|
|
|
|
|
|
Speed Rating
‘B’ - ALL Races |
1488 |
$2,976 |
5.19 |
6.13 |
343 |
23.05% |
$7.24 |
$2,482 |
($494) |
-16.61% |
The bets selected by this strategy returned the highest average payoffs of the four ($7.24), but at a lower winning percentage (23.05%). Again, the combination of payoff odds and winning percentage created a net loss of $494 and again a negative ROI (-16.61%). Interestingly, the post-time odds rose 18% from the morning line, indicating that the public doesn't singularly use this particular method for selection to any great degree.
Model Baseline Summary
Regardless
of the individual model differences, we were 0-4 with trying to
identify a currently available metric for creating a positive ROI on
forecasting horse races.
By eliminating
all of the obvious
selection criteria as acceptable methods, we now moved into the
predictive
analytics phase of the project.
Phase IV
Predictive Modeling
The biggest
challenge in predictive
modeling is always to answer the right question.
Through a combination of cluster analysis and regression, we would try to identify and group segments of races together to create separate forecasting models for each. This technique offered 2 benefits:
As
an example, it would be easier to forecast food supplies needed for a
restaurant if past customer data is categorized by meal rather than
lumped together.
Phase V
Model Evaluation
Of course, the
big question we had
in the beginning …. “Is
it possible?”
|
Analysis Type |
Bets |
Bet$ |
M/L |
Odds |
Wins |
Win% |
Avg$ |
Win$ |
Net$ |
Win ROI |
|
|
|
|
|
|
|
|
|
|
|
|
|
EPM model – ALL
clusters |
1740 |
$3,480 |
5.66 |
7.10 |
410 |
23.56% |
$9.55 |
$3,917 |
$437 |
+12.55% |
| Analysis Type | Bets | Bet$ | M/L | Odds | Wins | Win% | Avg$ | Win$ | Net$ | ROI |
| Cluster A Model | 43 | $86 | 10.29 | 12.72 | 13 | 30.23% | $13.92 | $181 | $95 | +110.47% |
| Cluster B Model | 195 | $390 | 8.35 | 11.37 | 43 | 22.05% | $15.95 | $686 | $296 | +75.90% |
| Cluster C Model | 305 | $610 | 5.83 | 7.11 | 75 | 24.59% | $10.01 | $751 | $141 | +23.11% |
| Analysis Type | Bets | Bet$ | M/L | Odds | Wins | Win% | Avg$ | Win$ | Net$ | ROI |
| Cluster X Model | 91 | $182 | 6.16 | 8.82 | 18 | 19.78% | $8.39 | $151 | ($31) | -17.03% |
| Cluster Y Model | 41 | $82 | 8.83 | 11.78 | 4 | 9.76% | $7.25 | $29 | ($53) | -64.63% |
| Cluster Z Model | 528 | $1,056 | 6.96 | 7.72 | 92 | 17.42% | $8.72 | $802 | ($254) | -24.05% |
| Analysis Type | Bets | Bet$ | M/L | Odds | Wins | Win% | Avg$ | Win$ | Net$ | ROI |
| Top 2 Clusters Combined | 238 | $476 | 8.70 | 11.61 | 56 | 23.53% | $14.95 | $837 | $391 | +75.84% |
| Top 3 Clusters Combined | 543 | $1086 | 7.09 | 9.08 | 131 | 24.13% | $12.35 | $1,618 | $532 | +48.99% |
| Top 4 Clusters Combined | 1,050 | $2,100 | 6.09 | 7.47 | 264 | 25.14% | $10.63 | $2,805 | $705 | +33.57% |
To use the 4-Cluster model as an example, by filtering out clusters predicted less profitable, the investor is able to select 25% winners at an average $2 win payoff of $10.63 which creates an ROI of +33.57%.
Additionally, the morning line odds may be used as a dynamic filter. By only actively betting the selections that meet or exceed a predetermined morning line level, the investor may increase his potential ROI at the expense of fewer betting opportunities.
| Analysis Type | Bets | %Races | Bet$ | M/L | Odds | Wins | Win% | Avg$ | Win$ | Net$ | ROI |
| EPM Model - Top 4 Clusters | 1050 | 100.00% | $2,100 | 6.09 | 7.47 | 264 | 25.14% | $10.63 | $2,805 | $705 | 33.57% |
| EPM Model - M/L > 1/1 | 1045 | 99.52% | $2,090 | 6.11 | 7.51 | 259 | 24.78% | $10.78 | $2,793 | $703 | 33.64% |
| EPM Model - M/L > 2/1 | 965 | 91.90% | $1,930 | 6.49 | 8.03 | 218 | 22.59% | $11.97 | $2,610 | $680 | 35.23% |
| EPM Model - M/L > 3/1 | 797 | 75.90% | $1,594 | 7.37 | 9.32 | 159 | 19.95% | $14.47 | $2,300 | $706 | 44.29% |
| EPM Model - M/L > 4/1 | 614 | 58.48% | $1,228 | 8.62 | 11.23 | 106 | 17.26% | $18.15 | $1,924 | $696 | 56.68% |
| EPM Model - M/L > 5/1 | 465 | 44.29% | $930 | 10.05 | 13.43 | 74 | 15.91% | $22.00 | $1,628 | $698 | 75.05% |
| EPM Model - M/L > 6/1 | 386 | 36.76% | $772 | 11.08 | 14.92 | 56 | 14.51% | $24.89 | $1,394 | $622 | 80.57% |
| EPM Model - M/L > 8/1 | 274 | 26.10% | $548 | 13.16 | 17.91 | 40 | 14.60% | $28.50 | $1,140 | $592 | 108.03% |
| EPM Model - M/L > 10/1 | 200 | 19.05% | $400 | 15.06 | 21.00 | 31 | 15.50% | $31.55 | $978 | $578 | 144.50% |
For
example, if the investor decides to only bet selections if their M/L odds are 3/1
or
greater …
·
there were 797
races that met that criteria (about 75% of the bettable races)
…
·
the average M/L
for those horses was
over 7-1 (7.37) …
·
by the time the
horses went to post,
the average final odds were over 9-1 (9.32) …
·
the win percentage
was just under 1 out
of 5 (19.95%) …
·
the average $2 win pay-off
was $14.47 (making the average winner's final odds 6.23) …
·
the net ROI goes
up to +44.29%
PRJOECT SUMMARY
In
summary, we were able to create a forecasting model that not only
doubled the initial ROI objective
of +15%, but also allowed flexible filter options to increase the
margin if desired. The effectiveness of EPM Model ranks as a fairly
successful predictive analytics project.
Phase VI
Model Implementation
Discussions have begun for the optimum way to
implement the model for commercial
delivery. There are plans to offer weekly selections for some of the meets later this year.
For additional information on this or additional predictive modeling projects, please contact us at info@southwood.com
Model and resultsİ 2009, Southwood & Black, All rights reserved.