SOUTHWOOD & BLACK - EPM Project Summary

Background 

A client recently approached us about the feasibility of using predictive modeling techniques to evaluate "equine pari-mutuel investments" … in other words, forecasting horse races. After quite a bit of discussion, the objective was clarified as the ability to invest small to moderate amounts of capital to earn a +15% ROI.  Could this be a conservative investment option for the average person?

Was it possible? Frankly, we had no idea …

 The key point to remember is that "pari-mutuel" means that the bettor is competing directly against other gamblers, unlike casino games where the bettor competes against the house's predetermined  odds advantage ... in pari-mutuel wagering, you don't have to be smarter than the "house" ... or the trainers ... or the jockeys ... or even the horses ... you just have to smarter than the guy next to you with the bad haircut and polyester pants ... he's the only one you have to beat to make a profit. 

In short order, after ascertaining that the raw data needed for analysis was commercially available, the EPM (Equine Pari-Mutuel) modeling project was kicked off with even measures of anticipation and skepticism. We  were very curious if could be done, but we had our suspicions that this might be wasted effort.

Our first step for a new modeling project is to use a data mining methodolgy (CRISP_DM) to break the project down into manageable tasks:

 

Phase I            Design and load the EPM data warehouse

Both pre-race (past performances) and post-race (results and pay-offs) were available electronically from several vendors. After product/pricing evaluation, a vendor was selected, and a download schedule for racing programs was developed.

It was determined that a minimum of  10,000 modeling observations would be used for the forecast, so a target was established to download 150+ race programs based on an estimate of 9 races per program, and 8 horses per race; the data for each horse for each race would comprise a single observation.

ETL (extract/transform/load) processes were designed to load the raw data into a relational model in the newly-designed EPM data warehouse; this would serve as the source for the datamart itself as it went through the cleansing/transformation process.

 

Phase II           Load, clean, and profile the data

 
Data profiling techniques were used to evaluate the accuracy and cleanliness of the data as it was loaded into the target EPM datamart. A series of data filters was developed to either clean or reject the data as needed. The final datamart contained over 17,000 observations with 3+ million data points.

 Here are some of the data profiling results:

Total Races

1,811

 

 

 

Horses Entered

16,630

 

 

 

Horses Raced

14,218

7.9/ Race

 

14.13% scratched

Past performance entries

68,469

 

 

 

Data points

3,423,450

 

 

 

 

 

 

 

 

 

 

 

Avg

Min

Max

 

Payoffs:

Win

$12.61

$2.20

$208.20

 

 

Place

$6.01

$2.10

$76.40

 

 

Show

$4.08

$2.10

$32.80

 

 

 

 

 

 

 

Winning margin: (lengths)

2.49

0.05

32.75

 

 

 

 

 

 

 

Number of sprint races (< 1M)

1,055

58.26%

 

 

Number of route races (>= 1M)

756

41.74%

 

 

 

 

 

 

 

 

Dirt races

 

1,442

79.62%

 

 

Turf races

 

369

20.38%

 

 

 

 

Phase III          Establish a model baseline

 

We created ROI (Return On Investment) baselines by using the datamart to evaluate other widely-used handicapping techniques; if one or more of these models met the project objectives, we had achieved our objective.

 We built a series of queries and aggregations to calculate the wagering performance based on the morning line favorite (lowest odds), the post-time betting favorite, and 2 common speed ratings available in most track publications.

MORNING LINE MODEL 

The “morning line” is the pre-race prediction of each horse’s odds as established by a track handicapper. The lowest morning line in each race is considered the pre-race favorite and was the basis for this first selection method; if more than one horse in a race shared the lowest odds, the race was excluded from evaluation. Here are the results:

Analysis Type

Bets

Bet$

M/L

Odds

Wins

Win%

Avg$

Win$

Net$

Win ROI

 

 

 

 

 

 

 

 

 

 

 

M/L Single Favorite - ALL Races

1579

$3,158

2.43

1.99

503

31.86%

$5.27

$2,651

($507)

-16.06%

 

 

A brief explanation of the data columns:

Bets:

Total number of races bet where a single selection could be used.

Bet$:

Total amount wagered based on a $2 win bet.

M/L:

The average morning line odds (to 1) for the selected horses.

Odds:

The average post-time (final) odds for the selected horses.

Wins:

Total number of wins.

Win%:

Percentage of wins to number of bets.

Avg$:

Average win payoff for $2 bet.

Win$:

Total payoff for all wins.

Net$:

Net win or (loss) for all bets.

Win ROI:

Net return/loss percentage for total amount wagered.

 

As you can see, betting on the morning line favorite selects almost 32% winners, but at an average $2 win payoff of only $5.27, the net loss of $507 creates a negative ROI (-16.06%).  One point to notice is that the odds dropped 18% from morning line to final odds (2.43 to 1.99) meaning that the betting public agreed with the selection, and drove down the price on the winning payoff.

 

POST-TIME FAVORITE MODEL

Next we examined the scenario of betting selections based on the post-time betting favorites; the odds are established by the ratio of money the crowd bets on each horse in ratio to the total amount wagered. The more money bet, the lower the resulting odds:

Analysis Type

Bets

Bet$

M/L

Odds

Wins

Win%

Avg$

Win$

Net$

Win ROI

 

 

 

 

 

 

 

 

 

 

 

Actual Odds Favorite - ALL Races

1603

$3,206

2.85

1.59

529

33.00%

$4.82

$2,550

($657)

-20.48%

 

If you bet along with the crowd, you will actually select 33% winners, but the meager $2 win payoff of $4.82 is less than the minimum $6 payoff needed to break even; net result is a $657 loss for a -20.48% ROI. Not a good long term investment strategy.  

 

SPEED RATING "A" MODEL

Next we examined the first of two well-known published speed rating numbers. This rating number was developed a number of years ago to evaluate past races across different distances and different tracks, and seems to be frequently used by the public, as well as the media for comparing horses. 

Analysis Type

Bets

Bet$

M/L

Odds

Wins

Win%

Avg$

Win$

Net$

Win ROI

 

 

 

 

 

 

 

 

 

 

 

Speed Rating ‘A’ - ALL Races

1581

$3,162

3.27

2.44

459

29.03%

$5.59

$2,566

($596)

-18.85%

 

The results were a slightly lower win percentage, and slightly higher win payoff, but still a net loss and a negative ROI  (-18.85%). The earlier reported accuracy of this particular rating may have caused the payoff odds to erode as it gained popularity with the betting public, but it has limited value at this point. Again, by taking note of the drop from average morning line odds of 3.27 to post-time odds of 2.44, this may be an indication of the public over-betting these selections.

SPEED RATING "B" MODEL 

Finally we examined a more established speed rating number for its effectiveness in selecting a profitable segment of winners. This number has been published in one of the leading racing publications for a number of years as a way to compare the running times of past races:

 

Analysis Type

Bets

Bet$

M/L

Odds

Wins

Win%

Avg$

Win$

Net$

Win ROI

 

 

 

 

 

 

 

 

 

 

 

Speed Rating ‘B’ - ALL Races

1488

$2,976

5.19

6.13

343

23.05%

$7.24

$2,482

($494)

-16.61%

 

The bets selected by this strategy returned the highest average payoffs of the four ($7.24), but at a lower winning percentage (23.05%). Again, the combination of payoff odds and winning percentage created a net loss of $494 and again a negative ROI (-16.61%).  Interestingly,  the post-time odds rose 18% from the morning line, indicating that the public doesn't singularly use this particular method for selection to any great degree.

Model Baseline Summary

Regardless of the individual model differences, we were 0-4 with trying to identify a currently available metric for creating a positive ROI on forecasting horse races.
 
By eliminating all of the obvious selection criteria as acceptable methods, we now moved into the predictive analytics phase of the project.

 (As a footnote, we also evaluated the results of place and show wagering; we’ll cover this facet in more detail later, but none of the strategies here created a positive ROI basis. Losses ranged from -11.55% to -17.64 % for place betting, and from -13.22% to -15.38% for show betting.)

 

Phase IV          Predictive Modeling

 

The biggest challenge in predictive modeling is always to answer the right question. Our initial attempt was:

             How can we predict the most winners?

 After initial modeling efforts created predictive models that could do no better than a -11.24% ROI, we finally realized that we were asking the wrong question. We went back and examined our project objective (generating a positive ROI) and realized that the right question was actually:

             How can we predict the optimal combination of winning percentage and payoff odds?

 In other words, even if you predict 8 out of 10 races correctly, but the average $2 win payoff is only $2.20, you’ve bet a total of $20, collected $17.60 in winnings for a net loss of $2.40, with a negative ROI (-12%). Conversely, you could lose 9 of 10, but if the payoff on that single winner is $22, you have a net gain of $2 for a positive ROI (+10%).

Through a combination of cluster analysis and regression, we would try to identify and group segments of races together to create separate forecasting models for each. This technique offered 2 benefits:

As an example, it would be easier to forecast food supplies needed for a restaurant if past customer data is categorized by meal rather than lumped together.

 We would expend a great deal of time and effort designing the reqression models, identifying raw and derived variables, identifying new clusters, and validating the profitability of each of the models.

 

Phase V           Model Evaluation

 

Of course, the big question we had in the beginning …. “Is it possible?

 Here are the preliminary results:

 

Analysis Type

Bets

Bet$

M/L

Odds

Wins

Win%

Avg$

Win$

Net$

Win ROI

 

 

 

 

 

 

 

 

 

 

 

EPM model – ALL clusters

1740

$3,480

5.66

7.10

410

23.56%

$9.55

$3,917

$437

+12.55%

 

This was considered a MAJOR milestone. For the first time, after weeks of modeling and testing, we had created a combined model (ALL clusters) with a positive ROI (+12.55%). Let's examine how this was possible by comparing our results to one of the baseline models:

EPM Model versus Morning Line Favorite

Although the baseline model had a higher winning percentage (31.80% to 23.56%),  the EPM model's average $2 payout was 81.2% higher ($9.55 to 5.27) which actually created the positive ROI.  As a side note, while the public over-bet the morning line favorite as denoted by the morning line to post-time odds decrease (2.43 to 1.99); the EPB model actually selected horses which were out of favor with the betting public: the average 5.66 morning line odds actually rose to 7.10 by post-time. It seems as if the contrarian approach is needed to achieve a positive risk-reward ratio in this environment.

Cluster Analysis

The next step was to evaluate the cluster models independently to determine the optimum way to combine the models for the individual clusters.

TOP 3 CLUSTERS

Analysis Type Bets Bet$ M/L Odds Wins Win% Avg$ Win$ Net$ ROI
Cluster A Model 43 $86 10.29 12.72 13 30.23% $13.92 $181 $95 +110.47%
Cluster B Model 195 $390 8.35 11.37 43 22.05% $15.95 $686 $296 +75.90%
Cluster C Model 305 $610 5.83 7.11 75 24.59% $10.01 $751 $141 +23.11%


BOTTOM 3 CLUSTERS

Analysis Type Bets Bet$ M/L Odds Wins Win% Avg$ Win$ Net$ ROI
Cluster X Model 91 $182 6.16 8.82 18 19.78% $8.39 $151 ($31) -17.03%
Cluster Y Model 41 $82 8.83 11.78 4 9.76% $7.25 $29 ($53) -64.63%
Cluster Z Model 528 $1,056 6.96 7.72 92 17.42% $8.72 $802 ($254) -24.05%


We began the cluster analysis by ranking the cluster models in terms of  profitability (ROI). Listed above are the top and bottom 3 from that ranking, but there were a number of other clusters that fell between the two..

At first glance someone might suggest only using the Cluster A Model becuase of its high potential return (+110.47%), but those clusters only occurred in about 2.3% of the races, so investment opportunities would not be present very frequently. An interesting observation is that this not only is this cluster highly predictable (30.23%), but at an average $2 win payoff of $13.92, the winners are paying at just under 6-1 odds. Furthermore, the public seems to be totally over-looking this particular  type of bet based on the increase from morning line to post-time odds of 10.29 to 12.72.

The Cluster B Model wasn't nearly as predictive in terms of  winning percentage (22.05%), but at $2 win payoff odds of almost 7-1, the ROI was still at a very healthy  +75.90%. These clusters occur in about 10.8% of the races.

Cluster C Model opportunies occurred about 16.8% of the time, and the 4-1 win payoff ($10.01) returns an acceptable +23.11% ROI.

On the flip side, Cluster Z races happen about 29% of the time, and their -24.05% ROI is to be avoided. Finally, Cluster Y races only occur about as frequently as Cluster A, but the -64.63% ROI is a killer for anyone's capital.

By the way, there are a number of clusters between the top 3 and bottom 3.

 Here are the results of combining a number of the top clusters in order to maximize ROI plus investment opportunities:


CLUSTER COMBINED MODELS

Analysis Type Bets Bet$ M/L Odds Wins Win% Avg$ Win$ Net$ ROI
Top 2 Clusters Combined 238 $476 8.70 11.61 56 23.53% $14.95 $837 $391 +75.84%
Top 3 Clusters Combined 543 $1086 7.09 9.08 131 24.13% $12.35 $1,618 $532 +48.99%
Top 4 Clusters Combined 1,050 $2,100 6.09 7.47 264 25.14% $10.63 $2,805 $705 +33.57%

 

The best investment performance may be achieved by combining a number of the top performing cluster models to reach the desired combination of margin and investment opportunities. This may be somewhat of a subjective issue depending on the investors' own level of risk/reward comfort.

To use the 4-Cluster model as an example, by filtering out clusters predicted less profitable, the investor is able to select 25% winners at an average $2 win payoff of $10.63 which creates an ROI of +33.57%. 

Additionally, the morning line odds may be used as a dynamic filter. By only actively betting the selections that meet or exceed a predetermined morning line level, the investor may increase his potential ROI at the expense of fewer betting opportunities.


EPM MODEL - Morning Line Analysis


Analysis Type Bets %Races Bet$ M/L Odds Wins Win% Avg$ Win$ Net$ ROI
EPM Model  - Top 4 Clusters 1050 100.00% $2,100 6.09 7.47 264 25.14% $10.63 $2,805 $705 33.57%
EPM Model  - M/L > 1/1 1045 99.52% $2,090 6.11 7.51 259 24.78% $10.78 $2,793 $703 33.64%
EPM Model  - M/L > 2/1 965 91.90% $1,930 6.49 8.03 218 22.59% $11.97 $2,610 $680 35.23%
EPM Model  - M/L > 3/1 797 75.90% $1,594 7.37 9.32 159 19.95% $14.47 $2,300 $706 44.29%
EPM Model  - M/L > 4/1 614 58.48% $1,228 8.62 11.23 106 17.26% $18.15 $1,924 $696 56.68%
EPM Model  - M/L > 5/1 465 44.29% $930 10.05 13.43 74 15.91% $22.00 $1,628 $698 75.05%
EPM Model  - M/L > 6/1 386 36.76% $772 11.08 14.92 56 14.51% $24.89 $1,394 $622 80.57%
EPM Model  - M/L > 8/1 274 26.10% $548 13.16 17.91 40 14.60% $28.50 $1,140 $592 108.03%
EPM Model  - M/L > 10/1 200 19.05% $400 15.06 21.00 31 15.50% $31.55 $978 $578 144.50%

For example, if the investor decides to only bet selections if their M/L odds are 3/1 or greater …

·         there were 797 races that met that criteria (about 75% of the bettable races)

·         the average M/L for those horses was over 7-1 (7.37) …

·         by the time the horses went to post, the average final odds were over 9-1 (9.32) …

·         the win percentage was just under 1 out of 5 (19.95%) …

·         the average $2 win pay-off was $14.47 (making the average winner's final odds 6.23) …

·         the net ROI goes up to +44.29%

 

PRJOECT SUMMARY

In summary, we were able to create a forecasting model that not only doubled the initial ROI objective of +15%, but also allowed flexible filter options to increase the margin if desired. The effectiveness of EPM Model ranks as a fairly successful predictive analytics project.

 

Phase VI          Model Implementation

 

Discussions have begun for the optimum way to implement the model for commercial delivery. There are plans to offer weekly selections for some of the meets later this year.

 

 

For additional information on this or additional predictive modeling projects, please contact us at info@southwood.com

Model and resultsİ 2009, Southwood & Black, All rights reserved.