This article is primarily targeted at Czech state institutions and is based on recommendations of the Czech authority. However,...
Running experiments in Azure Machine LearningLukas Beran
Microsoft Azure is probably well know. Azure is a cloud platform with a lot of platform and infrastructure services.Machine Learning belongs to the newest Azure services. Azure Machine Learning offers machine learning techniques with a great documentation to customers with just basic needs and knowledge of math and statistics as well as to advanced researchers with Python and R and SLA guarantee and technical support.
Theory of Machine Learning
Machine Learning is a really popular technique in the last months and years. Simply said, machine learning transforms datasets into parts of software called models. These models represents datasets, generalizes these datasets and creates new predictions using these models. This technique uses also search engine Bing for example.
Machine learning is used also by spam filters which constantly learn new spam rules and apply these rules to new incoming emails. Machine learning is also used for anomaly or error detection or very effective noise filters. And of course machine learning is used by personal assistants such as Cortana.
But the mentioned examples are only few of examples where machine learning used. Generally we can divide into three categories:
- Data Mining where machine learning is used for finding patterns in large databases.
- Statistical Engineering where machine learning is used for transforming data into software which can be used for decision mechanisms used for incomplete data.
- Artificial Intelligence where machine learning is used for emulation of a human brain, so computers can “see, hear and understand”.
But obviously machine learning is not a simple technique. Usually it requires complicated software, powerful computers a experienced researchers. Because of that, companies usually can’t afford to built their own machine learning algorithms and systems, but can be much more suitable to buy machine learning as a service.
Machine Learning at Microsoft
Because Azure Machine Learning is a new service, some of you maybe thinks that Microsoft does not have so much experience with machine learning. But it not true 🙂 In Microsoft, researchers use machine learning for more then 20 years. And this is much more experience than most of their competitors have.
First machine learning techniques appeared in 1992, when they started using with Bayesian Networks modelling of natural language and voice recognition. Thanks to this they found that a lot of problems are solvable using machine learning using linear classification and Bayesian Networks. Result of this was a first spam filter based on a content analysis.
The research of machine learning algorithms brought new opportunities in computer vision and voice recognition. Decision trees were used for pixel-wise classifications for the human pose recognition which is also used in the Xbox Kinect sensor.
Machine Learning Studio
Big advantage of Azure Machine Learning is it’s simplicity, because people without a deep knowledge of data analysis can predict some new data and values. Machine Learning Studio uses drag and drop gestures for building the experiment and also simple graphs of data flow. Thanks to this you can create your own experiment without any line of a source code. Except this, you can use prepared sample experiments and algorithms created by Microsoft Research.
For advanced researchers there are more than 350 of R packages and full support of Python.
Experiments allow sharing and collaboration. You can co-work with your colleagues on one experiment and check the results. You can also share your experiments with a public via Azure Machine Learning gallery, where you can also find a lot of already created experiments.
You can upload your data to the Machine Learning Studio from local files (supported formats CSV, TSV, TXT, SvmLight, ARFF, ZIP or RData), or you can read your data using Reader module which can read data from Azure Table, Azure Blob Storage, Azure SQL Database, Hive Query, Data Feed Provider (OData) or from http (CSV, TSV, ARFF and SvmLight).
For accessing your experiment from your application or web, you can use API with a support of Request-Response Service (RRS) and Batch Execution Service (BES).
For your inspiration, you can read official documentation at https://azure.microsoft.com/en-us/documentation/services/machine-learning/, examples are available at https://studio.azureml.net/.
Price of Azure Machine Learning
For your experiment, you can use free version of Azure Machine Learning, which needs Microsoft Account, but does not require Azure subscription or credit card. But version is limited.
Creating first experiment in Azure Machine Learning
The steps bellow is for the paid version of Azure Machine Learning. If you want to use the free version, click on Azure Machine Learning web to Start button and login using your Microsoft Account. Afterwards you will have your free workspace available. The approach is then the same as for the paid version.
In the left menu of the Azure classic portal (https://manage.windowsazure.com) choose Machine Learning. First you need to create a new workspace. Machine learning workspace is for basic separation of your tasks or customers. For each worskapce you have separated statistics and sharing options.
Let’s create a new workspace by clicking on New – Data Services – Machine Learning – Quick Create and fill the required information. Workspace Name is just a name of the workspace. Owner of the workspace account which has the ownership of the workspace. Location is a datacenter where your experiments will be running. It’s a good idea to choose the closest (geographically) datacenter because of connection speed and latency. Storage Account is a name of your storage where you want to have your data stored. You can choose existing or create new.
Confirm the creation by clicking Create an ML workspace.
Now we have our workspace created and we can continue by creating our first experiment. By clicking on the name of a workspace you open it. On Dashboard we see usage of the workspace and basic information about the workspace and also link to access Machine Learning Studio and link to documentation.
On the Configure tab you can block access of users to this workspace (Allow or deny access to workspace), change owner of this workspace (Workspace Owner) or check the storage name assigned to this workspace (Storage Name).
On the last tab you can check or change settings of Web Services for your experiments. This setting is available when you have published some web services. You can read more about web services at Deploy an Azure Machine Learning web service.
Now let’s open Machine Learning Studio. On the Dashboard click Sign-in to ML Studio. You will see that you don’t have any experiments. So create a new experiment. Click on New and choose Experiment – Blank Experiment. You can also select from available example experiments or you can open the gallery and select an experiment from the gallery.
After that you will see workspace of the experiment. In the left pane are datasets (own and example) and available modules. Main part of the screen is a workspace. To this workspace you will add modules and datasets and connect them together. Right part of the screen is configuration pane of the modules you have added to your experiment. Here you can also go directly to documentation of the selected module.
Controlling of the experiment is by drag&drop.
Experiments contain parts for data, training the model, scoring the model and evaluation. We can create an experiment which takes data, train the model and applies this model to a new unknown data. If you have your data in plain text or with mistakes or missing values, you have to prepare and repair the data. In Machine Learning Studio are available modules for clean missing data, projecting columns or replacing some values for example. Then we need to divide the data into training and testing, because we need to validate the outputs (supervised learning).
Experiment creation contains these five parts:
- Creating a model
- Get data
- Prepare data
- Choose set of features
- Training the model
- Choosing and applying appropriate machine learning algorithm
- Evaluating the model
- Score the data a predict new values
In the following example we will try to predict price of a used cars. We will use example data of cars which is available in Machine Learning Studio.
Example dataset is in Machine Learning Studio. But you can also upload your own dataset. In the left pane find dataset Automobile price data and insert the module to workspace of your experiment.
Now we can visualize the data. Right click on the output of the module (dataset) and choose Visualize.
Columns from the window are features of each value, in this example features are make, number of wheels, fuel type and so on and of course price which we want to predict. By clicking a column you can show statistics information of the column such as number of unique values, number of missing values and so on.
Main tasks of data preparation are cleaning, integration, transformation, reduction and discretization or quantification. For this we have Data Transformation modules. We can also use Clean Missing Data module for removing/replacing missing values and Remove Duplicate Rows for removing duplicates.
For effective predictions we need to remove/replace missing values. Usually we can remove entire row of the table or replace missing value by median or mean. But if we have a feature with a lot of missing values, maybe it would be better to remove the column.
Let’s find module Project Columns by which we can remove unneeded columns. Add this module to the workspace under the dataset and connect input of the module with output of the dataset.
By clicking the module Project Columns we can see setting of the module in the right pane. In Select Columns click Launch column selector and in a new windows choose Begin with All columns to start with all columns and Exclude column normalized-losses, because this column contains 20% of missing values.
Now find Clean Missing Data module and add it under Project Columns module and connect them.
By selecting Clean Missing Data module, click on Selected columns button in the right pane and again start with All columns. In Cleaning mode change the mode to Remove entire row.
We can add comments to modules by double-clicking the module.
Now we are ready to clean the data. Run the experiment by clicking Run button in the bottom menu. We can check status of the experiment by simple indicator in each module. Hourglass icon means the module is in a queue, running circle means the module is currently running and green check icon means the module has finished.
Output of all modules can be visualized or saved as a dataset.
Choosing set of features
In machine learning, set of features is something measurable which describes the problem we are trying to solve. In our example, each row of the dataset is one car and each column (feature) is a parameter of the car (price, make, engine size, highway mpg, …). Finding the best set of features requires experimenting and knowledge of the real problem. Some features are better than others, some of them can decrease the quality of our experiment (prediction). Some feature can be in a strong correlation with other features, so we can remove this feature. In our example city mpg is in a strong correlation with highway mpg so we can remove one of them.
Now the experiment starts. It’s very common that we will not find the best set of feature on the first attempt and we will need to try different set of feature and run the experiment again.
In the first run we start with the following set of features: make, body-style, wheel-base, engine-size, horsepower, peak-rpm, highway-mpg and of course price. Add these features to another Project Columns module which we connect to the last module. Set to start with no columns and Include mentioned columns in the setting of this module.
Choosing and applying Machine Learning algorithm
No we have our data ready for training and testing. Basic tasks in machine learning are classification and regression. Simply said, we use classification when we have known set of defined values and all of inputs are from one of these values (gender – man/woman, boolean – yes/no, etc). Regression is a continuous prediction model (age, price, etc).
In our example we want to predict a price, so our model is based on regression. And for faster training and evaluation, we will use two algorithms which we will compare.
First we need to divide the data into training and testing, because we will use supervised learning (we know the price in our training dataset). Testing part of the data will be used for evaluation how good or bad is the model. We will remove the prices from the data and we will predict the price and compare it with the real known price. Dividing ratio can be from 50:50 to 80:20 and we will use 60:40.
For dividing (splitting) the data is module Split, which we add to the end of the model and set Splitting mode – Split Rows and fraction 0.6 (60% of data in the first part).
Now we run the experiment again to split the data and get results to the outputs.
Because the model has only few number of features (less than 100) and the matrix is not sparse, the decision line will be probably non-linear, so the basic linear regression algorithm is not suitable. So we choose two non-linear regression algorithms and compare them – Poisson Regression and Decision Forest Regression. Both algorithms have setting which can be modified. But for our example we will use the default setting.
If you want to find best setting, you can use module Sweep Parameters which tests different settings and compares the results. But this approach requires a lot of time (tens of hours based on the size of input data).
So let’s add both algorithms to our model.
Then we add modules for training (Train Model) and connect them to the regression modules. Left input is connected to regression module, right input is connected to left output of Split module.
In both training modules we select column which we want to predict (price) and run the experiment again.
Predictions of new data
Now our model is trained and we can score the rest of data to validate how is our model accurate. So let’s add two modules Score Model and connect left input to output of the Train Model and module and right input to the right output of the Split module.
Running the experiment we score the testing data. Results are available from the output of the Score Model modules in the column Scored Labels.
Now we can evaluate the results in Evaluate Model module which we connect to scoring modules.
On it’s output ve can see the evaluation.
For evaluation of both algorithms we can use another Evaluate Model module.
After running the experiment we will see simple comparison of both algorithms.
As we can see from the table, Decision Forest Regression algorithm provides better results.
Work with results
Results are available from the module Score Model. We can save these results to a new dataset (right click on the output and Save as Dataset) or we can download the results to our computer in one of the supported file formats (ARFF, CSV, SVMLight, TSV) by adding required module (fo example Convert to CSV) connected to Score Model.
For direct access to the results we can use Writer module and API.
- Mean Absolute Error (MAE) is a quantity used to measure how close forecasts or predictions are to the eventual outcomes.
- Root Mean Squared Error (RMSE) is the square root of the mean/average of the square of all of the error.
- Relative Absolute Error (RAE) takes the total absolute error and normalizes it by dividing by the total absolute error of the simple predictor.
- Relative Squared Error (RSE) is relative to what it would have been if a simple predictor had been used.
- Coefficient of Determination (R2, pronounced R squared) is a number that indicates the proportion of the variance in the dependent variable that is predictable from the independent variable.
For all the statistics information is valid the lower the better except the coefficient of determination, where the closer to one (1) the better.