2015 - LeanBI

Understanding Through Detailed Example: Technical Procedure in Predictive Analysis with Specific Example of “Storage of Beer”

In Predictive Analytics historical and current data are statistically analyzed in order to make predictions of trends or behaviors (often in the future) using the gained insights. The predictions are statistical in nature and look something like: There is 90% probability that beer consumption will rise by 21% in Rio in early August 2016 and in Bern by 25%. Predictive Analytics is already finding applications today:

Predictive Maintenance: The timing of a product defect can be predicted –> The company managing the maintenance can react in advance
Customer satisfaction: A company can predict the satisfaction level of a customer in advance -> The customer is then proactively assisted with his problems –> It is done with the hope that customer will remain loyal
Credit Scoring: The Bank can estimate the likelihood of a borrower repaying a loan in time –> Decision is taken whether to extend the credit
Predictive Police: Burglaries can be predicted in Zurich –> Target oriented deployment of Police can take place

A successful Predictive Analytics project consists of the following three key elements and we will focus on their technical implementation in this blog.

Access to relevant data
A question that is required to be answered with the data
And a technical implementation of algorithms and models

Overview of the procedure

Image: Process for Predictive Analytics.

The Data is cycled through the process of Feature Extraction and Feature Selection. Here, it is essentially decided how certain information should be processed before it is used (Feature Extraction) and out of the existing data, which information should actually be used (Feature Selection). The resulting selected features then pass through an algorithm which is trained by using available results obtained in the past. As shown in the image, a progressive optimization takes place where better and relevant features are added and those features that do not help the algorithms are deleted. Further, different algorithms are tested until a satisfactory prediction is obtained. The best-trained algorithm will then be used for providing high quality predictions with new data. We will now discuss these various points in greater detail.

Feature Extraction

A date, for example, can provide many kinds of information: weekday, weekend, school holidays, August 1st is just around the corner and the Olympics are about to start. Depending on the question, all this information can be very usefule: weekend -> More time for drinking, August 1 -> Large retail sales the day before Olympics -> Spike in beer demand, etc. Thus, Feature Extraction can be a decisive factor in ascertaining whether an algorithm can correctly predict the demand for beer.

Feature Selection

The selection of the data features (Feature Selection) also plays a significant role in the process. A sea of all kinds of data may be available and known but it is not wise to apply the algorithm to all the available data as it can lead to the problem known as Overfitting. The information that the Bern Onion Market took place on 23 November 2015 can have enormous influence on the consumption of Glühwein (mulled wine) but it is not relevant for predicting the consumption of beer and it is rather likely to cause noise. Further, the algorithm now must also learn to ignore this information. It may happen that certain irrelevant data features may acquire higher weights in the algorithms leading to sub-optimal predictions.

A question of dataset

The task of Feature Extraction and Feature Selection can be completely handled by the computer. If the quantum of data is large enough and sufficient computing capacity is available no human intervention may be needed. An algorithm should have no problems in finding out that the last Summer Olympic Games in London boosted the consumption of beer and that this was not a one-off occurrence (the previous Olympics also led to an increased consumption of beer). Thus, a computer can find the feature “Olympic Games” independently and it would not require any human interaction. It would of course only work if the data covering the period of the last few Olympic Games (15-20 years) is also available. This requires an adequate data base (mostly Big Data) and large computing power (up to data centers such as Google). With large amounts of data and high computing power, overfitting tends to lose its impact. If either the large computing power or the required amount of data is not available, then humans would be required to process at least a part of the features (although possibly not on the 1st of August). A tool for using the right features is data visualization. The visualizations bring new insights and an understanding of the data and they can also provide clues as to whether a relationship between a feature and the required prediction exists. In particular, one can plot a possible feature against the desired prediction. (For example, beer consumption versus beer production.) It can then be discovered that that the beer production in the 7-day cycle drifts away from beer consumption since not much beer is produced on weekends when the consumption is highest. Thus, a good data scientist will suggest choosing the weekend as a feature. The guiding principle for medium-sized projects is a combination of human and automated Feature Extraction and Feature Selection. If the human identifies both the relevant and irrelevant, noisy features, it makes the work that much easier for the computers. The algorithms can now detect the remaining features themselves. By moderating the quantity of automated processing required for Feature Selection, the computation complexity and the risk of Overfitting is reduced.

Choosing and training machine learning algorithms

The data and the selected features will now be handed over to the algorithm for training purposes. From experience, the algorithm is now able to learn, among other things, by how much the last Summer Olympics increased beer consumption. The computer can therefore also estimate the impact of games on the beer consumption and predict higher beer consumption on 10 August 2016, when the Olympic Games in Rio would be underway. Of course, this learning does not happen individually on any one single feature (the Olympics) but is a result of training on all the selected features (Olympics, Weekend, 1 August, FIFA World Cup, Oktoberfest, Weather, Season, Gurten Festival, etc.). However, the question that arises next is, which machine learning algorithm should be used? We would like to suggest a few important requirements for the algorithms:

Required accuracy: Sometimes, a very accurate prediction is not required. In this case, relatively imprecise algorithms can be used which are processed much faster.
Available data volumes: When the amount of data is small, only one algorithm with a few parameters can be used. More complex, more accurate and computationally intensive algorithms with more parameters are possible with larger amounts of data.
Explanatory algorithms: It may happen that in addition to a prediction, one may also wish to have an explanation of how the prediction was obtained. The algorithm should, for example, communicate that the prediction for beer consumption is very high as the Olympic Games in Rio would be underway at this time. Many machine learning algorithms are not able to accomplish this.
Private algorithms: In the programs/libraries which are often used, not all algorithms are available. The effort required to program an algorithm or to use a new library can be high. Due to this reason, there is certainly some incentive to make use of existing algorithms.

Practically, there is no one algorithm that can fulfill all wishes. Particularly so, because the “perfect” algorithm is not implemented in the tools used. It is therefore reasonable to start with a simple algorithm which is available at the touch of a button and will quickly perform the calculations. When such an algorithm does not deliver a reasonable prediction, it is unlikely that a more sophisticated algorithm would arrive at a good prediction. At that point, it is more productive, for example, to use additional data and improved features. However, if the prediction quality of a simple algorithm needs to be improved a little, more sophisticated algorithms can be quite helpful and thus a step by step search can be made for better algorithms and features.

Testing machine learning algorithms

Data is required not only for training an algorithm, but it is also required for testing (or in other words, for measuring the quality of an algorithm). Therefore, the existing data will be divided into two groups. With one of these groups, known as the training data, the algorithm is trained and with the other group, i.e. the test data, the algorithm is tested. In such a test, the algorithm is queried to deliver a certain prediction. The delivered prediction is then compared with the real measurements in the test data. If the predictions match closely with actual measurements, the algorithm is considered to have performed well. I would like to say a few words about the division of the data into training and test data. The division should be as realistic as possible so that one is not misled by the test. For example, a realistic test would take place if we were to test against the data of last month for automatic beer demand. The algorithm is then trained with all data which was already available a month ago. Subsequently, the trained algorithm provides the desired prediction for the future (from this month). When testing, the prediction will now be compared against the measured values of this month.

Answer questions / make predictions

With the features selected and algorithm trained, the machine can now make the desired predictions. Thus, for example, it will be known how much beer to produce. The logistics can also be optimized, as we now know where and how much the demand for beer would be. Fortunately, the prediction is made automatically, leaving sufficient time for you to enjoy the Olympics with your cool, favorite beer in your hand.

Questions

Write to us with any questions or if you are interested in a live demonstration (info@leanbi.ch) or call us at (+41 79 247 99 59).

Discover the Power of Your Data

We do not want our blog visitors to miss our half day free onsite workshop, an offer which has been quite well appreciated in our email campaign. You may be interested in the contents of this campaign for one or more of the following reasons:

Do you want to …

Optimize your logistics?
Improve the quality of your products?
Reduce the cost of product development and production?
Offer more value added services with your products?

We optimize your products and processes and trace unknown interrelations.

You give us your mountain of data and we will show you new ways to improve your logistics, development, products and processes and create default prediction performance systems. Our approach will augment your production efficiency while simultaneously reducing your maintenance costs. Our Data Scientists analyze your data with specialized algorithms based on In Memory Big Data technologies – not at large programming costs but instead with lots of visualizations that you understand and that open up completely new additional possibilities for you. Why not discover the power of your data through a demo by us and begin using the power of your data.

Book our half day free workshop at your place

Please feel free to contact us (without obligation) by phone at +41 79 247 99 59 or email at info@leanbi.ch.

Analytics 3.0 and Industry 4.0 Should Marry to Give Birth to a New Star, Analytics Industry 4.0.

Analytics 3.0: After predictive analytics, now the prescriptive

Fig.1: The development of analytics according to Davensport

Analytics 3.0 is regarded as the new era in big data. It was developed in the USA by Prof. Thomas Davensport. In addition to ”What will happen?” as in predictive analytics, one can also ask ”Why will it happen?” in prescriptive analytics. These ”what” and ”why” lead to new relationships and insights into the chain of cause and effect.

Besides conventional BI technology, Analytics 3.0 also incorporates new big data technologies that allow you to stream large quantities of live data in various data formats. The data analysis is done on distributed infrastructures and on In memory technologies using machine learning and data mining algorithms. As opposed to conventional data warehousing, there are no limits on data formats and data modeling is dramatically simplified.

Algorithms lie at the heart of Analytic 3.0. Machine learning algorithms automatically extract information from data. This is done without any machine-human interface and to make this possible, people are trained in algorithms with smaller amounts of data and models are built. The algorithms are also partly self-learning: over time, the models improve and so do their predictions.

Data mining also actually has an important place within Analytics 3.0. Here however, a specific person is always present in the discovery or prediction process. Typically, the solution of a concrete, complex problem lies in the foreground. One might, for example, using pattern recognition, want to obtain better understanding of a complex situation with several unknown influencing factors. Data mining uses many machine learning algorithms and vice versa.

Analytics 3.0 is a combination of technology and mathematics. It is reality and future at the same time. Analytics 3.0 has been used for many years and many universities and companies are doing intensive research on it.

Today, the number of algorithms that can be used is already very large and a strong transformation process is underway. Every day, new algorithms are added and existing ones are improved. Most of these algorithms are public and can be acquired through several open source packages such as R, Mahout, Weka, etc. Other algorithms are encapsulated in commercial products and therefore proprietary.

Additional software is required for the algorithms to function optimally with big data technology (distribution on CPU and RAM). Again, there are open source options or purchasable software products which are continuously developed.

One thing is clear: The possibilities of predictive and prescriptive analytics are far from being exhausted.

Industry 4.0: The computerization of industry

As a German/European project since 2011, Industry 4.0 includes the computerization of manufacturing technology. The automation technology necessary for Industry 4.0 should become smarter with the introduction of procedures for self-optimization, self-configuration, self-diagnosis and cognition. It should help people better in their increasingly complex work. This creates an intelligent factory (Smart Factory), which is adaptable and resource-efficient and optimally integrates into the business processes of a company.

The ideas of Industry 4.0 are omnipresent in Swiss industry and are not limited to manufacturing technology. The degree of maturity of Swiss industrie with regard to Industry 4.0 varies hugely. A few pioneers are already running remote maintenance systems where machines installed somewhere in the world feed data to the manufacturer which in turn, for example, triggers manufacturing processes. Some operators have connected their facilities located in various places or their entire factories so centrally that their data are jointly evaluated. But these implementations are only the first steps into a world of Industry 4.0, because the ”self”, that is the logical and physical networking of machines is growing only very selectively.

An important element of Industry 4.0 is the development of sensor technology itself. For example, the sensors in ”machine vision” – the field of image acquisition, also in the wavelength levels of infrared and X-ray – are bringing new possibilities to online quality measurements and they are simultaneously very data intensive. Spectroscopy is increasingly directly involved in the processes and delivers large amounts of data. Ever modernizing sensor techniques have even larger data streams which must be dealt with.

From our perspective, analysis still gets too little an emphasis within Industry 4.0. Analytics 3.0 and Industry 4.0 are highly separated worlds. Why? Both worlds are complex and are only partially controllable. The intersection is going to be great and just the skills necessary to unite both worlds are missing today.

Analytics Industry 4.0: A new star is born

Figure 3: Towards Analytics Industry 4.0 with the Cloud

If we now consider the intersection of both worlds, then we may call it Analytics Industry 4.0. the Cloud is going to be central in bringing these worlds together because it ultimately boils down to networking of data to a central location. Analytics Industry 4.0 is a branch of Industry 4.0 emphasizing the analytical part of this fourth industrial revolution. What is the purpose of such a branch? For this, let us go back to the definition of Industry 4.0 and understand the importance of the analysis:

Self-optimization: The self-optimization of the manufacturing process is, in addition to the physical machine operation, a mathematical optimization process which is based on data. It is based on nothing but the algorithms described in Analytic 3.0.
Self-optimization has two aspects. On one hand the optimization of the production process itself, and on the other, the manufactured product is also in focus. A self-optimization of a manufactured product can be described as automated quality optimization. It requires automated quality measurements which generate huge amounts of data that must be processed. Thus large, high-performance analytical infrastructures are necessary to enable timely flow of this data back into the production process.
Self-diagnosis: The purpose of self-diagnosis is detecting any possible machine breakdowns in advance. This extends far beyond notifications. A self-diagnosis can only happen through the combination of measurement data, their algorithmic processing and recycling of the information derived from the production process for further physical processing.
Cognition is the totality of the mental activities associated with thinking, knowledge, remembering and communicating. Just as the human brain needs it, industry also needs a data pool as a basis to generate knowledge. This basis is the (Cloud-)infrastructure of Analytics 3.0.

The aim is therefore aligning Analytics 3.0 with the ongoing fourth industrial revolution. It not only affects manufacturing technology, but also storage technology, process engineering, air conditioning technology and energy technology. Both the data-infrastructure and the algorithms are specific tools of these industries and have to be developed. In our view, open source is going to play a crucial role in this. We believe that the goal of Industry 4.0 will be achieved fastest through existing and new open source projects in the analytics and big data area. Open source initiatives generate new products for Analytics Industry 4.0.