“Data, data everywhere…”: Leveraging IBM Watson Studio for private data with Federated Learning
Written by Yair Schiff and Jim Rhyness
To (mis)quote a line from a famous poem about a sailor stranded on a raft in the middle of an ocean of information:
“Data, data everywhere,
Nor anyway to use them.”
Organizations are increasingly turning to machine learning and AI to facilitate and make critical business decisions on sensitive data. However, while the power of these models has steadily increased, the need to protect certain private data has remained constant. This represents a significant hurdle for many organizations across sectors, from healthcare to financial services, that want to tap into the potential of AI but require safety and privacy protocols for their data. Releasing data to a machine learning platform in exchange for enhanced decision making is often not an option for certain entities, and even intra-organization data sharing can often incur significant lead times and overhead, if it’s possible at all.
To help circumvent the issue of releasing sensitive data needed to train machine learning models, IBM Watson Studio in collaboration with IBM Research is proud to announce the general availability release of Federated Learning for IBM Watson® Machine Learning. Federated Learning is a new training paradigm where an individual model is trained collaboratively by remote parties that each hold onto their respective data, thus unlocking the potential of machine learning without sacrificing data privacy.
With this latest state-of-the-art offering, we are continuing to lower the barrier to adoption for cognitive computing solutions. Moreover, with full integration into the Cloud Pak for Data™ platform, IBM Federated Learning can leverage all the data preprocessing, model deployment, and monitoring tools that IBM Watson has to offer.
Read more about the technical aspects of IBM Federated Learning and get started with Watson Studio today!
Let’s get this (remote training) party started!
To see how we can unlock IBM Watson Machine Learning for private data, we will walk through an example, using Federated Learning in action. Let’s assume we’re a two sided online platform that connects travelers with hosts for overnight stays (do any of those come to mind?). As a service that aims to maximize user experience, we want to anticipate which hosts provide their guests stellar service. Using listing information and guest reviews, we want to predict who will become a ‘super host’ in order to better connect guests and hosts in the future. However, with listing location information and other potentially sensitive details in the comments, our corporate governance prohibits the data for each location from leaving pre-approved data centers. Using Federated Learning with IBM Watson Machine Learning, we finally have a way to leverage the power of AI and make use of our private data. Read more about the dataset we will use below and find instructions for download.
IBM Federated Learning relies on remote parties that each train on their local private data and an aggregator that is hosted on Watson Studio and fuses model updates from each party to create a combined trained model. We will walk through the process of setting up these entities and training below.
Creating the Experiment
Our first step is to add a Federated Learning experiment to a Watson Studio project.
Federated Learning in IBM Watson Studio currently supports TensorFlow 2, PyTorch, Scikit-Learn (including regression, classification, and KMeans models), and XGBoost frameworks. For our super host prediction task, we choose the Scikit-Learn framework to train a linear classification model. We then upload a randomly initialized, untrained model that will be updated with Stochastic Gradient Descent, in the form of a pickle, zip.
Fusion Method
Next, we choose a fusion method, which is the algorithm used by the aggregator to combine the training results from the parties. IBM Federated Learning supports different methods depending on the framework and model type that has been selected. In this example, we select Weighted average, since we assume each remote training party will have varying dataset size and want to weight them accordingly.
Hyperparameters
Each framework and data fusion type contains several training hyperparameters that we can set for all the parties and for the training aggregator. In our experiment, we will set the number of training rounds, quorum for the parties (meaning that even if a certain percentage of remote training systems become unresponsive, we can continue to progress training nevertheless), and accuracy termination threshold, which will end training once the aggregated model has achieved some minimum level of accuracy.
Remote Training Systems
We now need to designate which parties can collaborate on this training by creating / selecting Remote Training Systems. Using the ‘Allowed users’ list we restrict who is permitted to act as a remote training party. Each party must be a collaborator in the project of the aggregator and authorized to join the training experiment.
The remote party component of the training process is provided in the IBM Watson Machine Learning Client. The party can be run in a Cloud Pak for Data™ instance or on a stand-alone machine — basically, anywhere that you can run a Python process and maintain connectivity to the aggregator.
We will create Remote Training Systems for each of host locations: Sydney, Toronto, Trentino, Vancouver, Venice and Vienna. These will be the parties that train using the city-specific datasets that they each own.
Aggregator
With the experiment settings finalized, we start the aggregator, which waits for the parties to connect. When the parties have all joined, the aggregator sends the model and receives updates from the parties via web sockets.
Party Configuration
On the aggregator UI, we have access to remote training template scripts that each party can download to start training local copies of the model. Each script will contain a similar template to the one below, and the parties are responsible for filling in the relevant credentials and data paths.
Data Handlers
One of the key fields in the remote training script above is the data handler path. For each party to process its respective data, it must create a ‘Data handler’. Find out more about this process by watching our tutorials.
A Data Handler extends the ibmfl.data.data_handler.DataHandler class, which is available in the IBM Watson Machine Learning client and gives each party the flexibility to load and preprocess data in different formats. Below is an example data handler that could be used by the remote parties in our training.
Training
The parties are now ready to connect to the aggregator and start training. With each party’s script and data handler setup, they can run training locally and receive updates and progress from the aggregator.
Back in the UI, the aggregator takes model updates from each party, fuses them together, reports back to the systems, and displays training progress.
Once the experiment has completed, we have access to the full jointly-trained model. As part of Watson Studio and Cloud Pak for Data™, the saved Federated Learning model can leverage the full suite of data science lifecycle solutions offered on these platforms, including deployment as either a REST API that returns predictions online or as batch scoring jobs.
With this real world example, we’ve demonstrated how Federated Learning on the IBM Watson Studio platform can be used to harness the power of AI without sacrificing data privacy. Learn more about IBM Federated Learning and get started with this latest offering from IBM Watson Studio today.
Happy modeling!