CERN Openlab Summer Internship III

Rolling Forecast

A rolling forecast is performed which helps us to predict binary popularity values for each week. Each week’s data is added to the existing data and a new model is created. This follows the notion, more data leads to better data analysis. Prediction can be done in various ways following the implementation of several machine learning algorithms, mainly, Naive Bayes, Stochastic Gradient Descent and Random Forest. Their models are then combined into an ensemble to check which algorithm offers the best true positive, true negative, false positive or false negative value. This project also includes plotting the results obtained against the time scale to get a notion of how accuracy scores change with each week. These include sensitivity, specificity, precision, and recall and fallout rate against time scale.

Rolling forecasts are prediction machine learning techniques which built upon a pre-existing model each time a new dataset is added. These are used to make future predictions based on past and present analysis trends. Incremental learning is suited for this problem as each new week’s data is produced it has to be added to the model. This is incorporated in the problem because each week’s data is added onto the existing data. A new model is generated; which has combined features from the previous year’s model and the learning’s from the recently added week. This leads to the formation of a new and enhanced model. The window of the model can either be kept consistent or can be perpetually increasing with each new week. In this problem, the window size increases by one week. It starts with 52 weeks (2013), hence of one year. Then one week is added to it, adding to the existing model.

The procedure is described in detail in the diagram below. 2013.csv is the complete data of 2013, on this a machine learning algorithm is applied which helps to generate a model. Predictions are done for a week of 2014 and then cross validated against the calculated target values. The learnings that are obtained from cross validation are applied to the next model. Hence with every week the model learns more and performs better for the upcoming weeks.

Week 20140107 is added to the complete 2013 dataset. The model is generated by applying machine learning algorithms basically, Naive Bayes, Stochastic Gradient Descent, and Random Forest. Then the following week, which is 20140108, is predicted and cross validated with the target values obtained by computing the naccess and nusers. Based on the cross-validation, some learning is done which is updated to the model and this process continues each week.

A pipeline of Naive Bayes, Stochastic Gradient Descent and Random Forest is applied on Apache Spark platform, using pyspark, the Python binding of Apache Spark. Spark requires a cluster manager and a distributed storage system. For cluster management, Mesos is being used while for interfacing distributed storage; Spark is interfaced with Hadoop Distributed File System (HDFS). MLlib is a distributed machine learning framework on top of Spark that, because of the distributed memory-based Spark architecture, is nine times as fast as the Hadoop disk-based version of Apache Mahout. Apache Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark’s standalone mode, and it can process data in HDFS. It is designed to perform batch processing (training one year’s data) and processing for new workloads like streaming (live prediction of each week), interactive queries, and machine learning.

Data is being read from the csv file into a pyspark dataframe. In order for parallelization to occur the algorithms must take the input of RDD format. To enable this, the target values and dataframe are being converted to RDD using LabeledPoint functions. Apache Spark allows parallel analysis on distributed data and this offers a much less total execution time for the whole algorithm or ensemble to execute.

This project allowed testing of several features and functionalities of Spark. These include streaming data and Spark Streaming, because each week’s data is being added to the entire dataset. The cache persistence functionalities of Spark allow an RDD once created to persist in memory and for the next iteration of the algorithm the new week can be added straight to the RDD rather than undergoing the complete procedure once again. Machine Learning is performed using the MLlib which has defined Random Forest, Naive Bayes and Stochastic Gradient Descent.

Ensembles

Machine learning ensemble methods use multiple learning algorithms, like Naive Bayes, Random Forest, and Stochastic Gradient Descent to obtain better predictive performance that could be obtained from any of the constituent learning algorithms. This ensemble refers only to a concrete finite set of alternative models and allows a flexible structure that enhances user’s manipulation of the chosen features and parameters as well as to choose amongst different alternatives. Bagging and boosting approach for ensembles is being applied. The computer program derives a rule of thumb or finds any structure in the data. Then this rule is applied to the subsets of the training data set. With each subset, some modifications are performed on the rule and a different rule is created. Then this new rule is applied to the next subset. This process is repeated for all subsets. AdaBoost is a great technique as it combines bagging and boosting. Hence, all the advantages of a random forest with gradient boosting are possible in a bagging and boosting technique.

Performance Metrics

The performance metrics describe how good one algorithm is when it comes to the time for training the model and how accurate the prediction is.

CPU Time

Using psutil (python system and process utilities) and the wide range of functions that it provides, the cpu time utilized can be used to ascertain how long one process might have taken. psutil is a cross-platform Python library for retrieving information on running processes and system utilization (CPU, memory, disks, and network). It is also used for system monitoring, profiling and limiting process resources and management of running processes. Approximately 25% of the CPU is always being used by the algorithms.

Accuracy

Accuracy is measured in terms of true positive, true negative, false negative and false positives. Combining these values, different parameters such as precision, recall, F1, true negative rate, true positive rate and fall-out rate are obtained. These offer a greater degree of insight along with the plots. Systems with high recall but low precision returns many results, but most of its predicted labels are incorrect when compared to the training labels. A system with high precision but low recall is just the opposite, returning very few results, but most of its predicted labels are correct when compared to the training labels. An ideal system with high precision and high recall will return many results, with all results labelled correctly. Underfitting and overfitting need to be checked using cross-validation. Underfitting is when the features are not learnt properly because the linear function learning the features is not sufficient to fit all the training samples. Overfitting learns the noise in the data and considered noise as feature as well. Overfitting can be approximated as a high order polynomial whereas underfitting would resemble a low degree polynomial such as a linear function.

Conclusion

Apache Spark is a framework providing speedy and parallel processing of distributed data in real time. It provides powerful cache and persistence capabilities. Because of these features Apache Spark is applied to data analytics problems. Components of Spark like Spark Streaming and MLlib (Spark native machine learning library) make analysis possible, which is useful for CMS data analysis. Spark provides a solution to the many features that are required for CMS data analyses, like streaming and distributed data.

blog comments powered by Disqus

Published

17 August 2015