Continuing the definition series, today it is about an important topic: predictive modeling. Predictive modeling/analytics is a type of mathematical modeling used by a wide variety of organizations in order to predict outcomes of processes, events and pretty much everything that one cares about to know. It’s based on a probabilistic and statistical kind of modeling and underpins most of what data science/analytics stands for. However it is more concerned with the model part of analysis, which is sometimes not very much a focus of main attention by practitioners or data engineers. But I think, and increasingly so, that further work on mathematical models for data science will be more important for the future of the subject.

Its place in the Information Technology business sector is central. And even advanced manufacturing, retail or official/government sectors of the economy rely on good predictive modeling to enhance decision-making and better manage resources in their organizations. Another not forget or miss definition, that might be of help in every occasion that we need it:

### Predictive Modeling

Predictive modeling is a process that uses data mining and probability to forecast outcomes. Each model is made up of a number of predictors, which are variables that are likely to influence future results. Once data has been collected for relevant predictors, a statistical model is formulated. The model may employ a simple linear equation or it may be a complex neural network, mapped out by sophisticated software. As additional data becomes available, the statistical analysis model is validated or revised.

Predictive modeling is often associated with meteorology and weather forecasting, but it has many applications in business. Bayesian spamfilters, for example, use predictive modeling to identify the probability that a given message is spam. In fraud detection, predictive modeling is used to identify outliers in a data set that point toward fraudulent activity. And in customer relationship management (CRM), predictive modeling is used to target messaging to those customers who are most likely to make a purchase. Other applications include capacity planning, change management, disaster recovery, engineering, physical and digital security management and city planning.

Although it may be tempting to think that with the advent of big data, predictive models will be more accurate, statistical theorems show that after a certain point, feeding more data into a predictive analytics model will not provide more accurate results. Analyzing representative portions of the available information (sampling) can help speed development time on models and allow them to be deployed more quickly.

In line with what I said above about the relative importance of the modeling approach versus the accumulation of data/information that might unnecessarily increase the dimension of a dataset, in order to improve performance of a predictive modeling/analytics algorithm, it may be preferable to focus on techniques such as sampling.

Also an important feature of these techniques is the potential to increase productivity of any activity, here understood as any device or action that permits do more with less input actions or unnecessary further work/effort. From the Wikipedia page on Predictive modeling we can read:

Generally, predictive modelling in archaeology is establishing statistically valid causal or covariable relationships between natural proxies such as soil types, elevation, slope, vegetation, proximity to water, geology, geomorphology, etc., and the presence of archaeological features. Through analysis of these quantifiable attributes from land that has undergone archaeological survey, sometimes the “archaeological sensitivity” of unsurveyed areas can be anticipated based on the natural proxies in those areas. Large land managers in the United States, such as the Bureau of Land Management (BLM), the Department of Defense (DOD), and numerous highway and parks agencies, have successfully employed this strategy. By using predictive modelling in their cultural resource management plans, they are capable of making more informed decisions when planning for activities that have the potential to require ground disturbance and subsequently affect archaeological sites.

From the same page are listed some possible and known caveats of using predictive modelling:

Possible fundamental limitations of predictive model based on data fitting:

1) History cannot always predict future: Using relations derived from historical data to predict the future implicitly assumes there are certain steady-state conditions or constants in the complex system. This is almost always wrong when the system involves people.

2) The issue of unknown unknowns: In all data collection, the collector first defines the set of variables for which data is collected. However, no matter how extensive the collector considers his selection of the variables, there is always the possibility of new variables that have not been considered or even defined, yet are critical to the outcome.

3) Self-defeat of an algorithm: After an algorithm becomes an accepted standard of measurement, it can be taken advantage of by people who understand the algorithm and have the incentive to fool or manipulate the outcome. This is what happened to the CDO rating. The CDO dealers actively fulfilled the rating agencies input to reach an AAA or super-AAA on the CDO they are issuing by cleverly manipulating variables that were “unknown” to the rating agencies’ “sophisticated” models.

*Image: Predictive modelling (Wikipedia)*