Posted by Yi Wang on March 27, 2017
Baidu makes its contribution to the evolution of deep learning with PaddlePaddle (Parallel Distributed Deep Learning), a platform that emphasizes simplicity and efficiency. PaddlePaddle makes it easier for developers unfamiliar with the space to leverage these tools without extensive experience.
Before we open sourced the project in September 2016, PaddlePaddle was an internal project for three years. Our driving force in creating it was for use with internal products at Baidu. Since September 2016, we’ve had very fast growth in the project stemming from GitHub issues, most of which came from outside of Baidu’s PaddlePaddle team. Potential clients also came and asked if could we provide a complete solution. By “complete solution,” they meant frameworks like PaddlePaddle to ease the programming work, but the question remained: how should users run the programs? We’d like to provide a solution for users to set up on-premises clusters to run PaddlePaddle programs with other distributed programs that form Internet businesses. Based on suggestions from our users, we work with the Kubernetes community on cluster management.
Today there isn’t an open source deep machine learning framework that officially integrates with Kubernetes. We would like to be the first one to help our users developing and running artificial intelligence (AI) on Kubernetes. PaddlePaddle also has a plan to tightly integrate with Kubernetes to achieve fault-recoverable distributed training.
Fault-tolerance is critical as users want to run a Web server, a log collector, a data pipeline of Hadoop MapReduce and Spark, and AI on the same cluster — because higher-priority jobs (such as Web server jobs) might preempt processes belonging to lower-priority jobs (for example, AI jobs). However, users don’t like it when AI jobs just stop or fail; instead; they want it to continue running, just slower. This blog post explains the technical advantages to working with Kubernetes to achieve fault-recovery.
When we put this project on GitHub in September 2016 we started with 14 contributors. Now GitHub shows over 50 contributors. Our team inside Baidu has grown since September 2016, but many of our contributors now come from outside of Baidu. I’m happy that the open source community is growing fast. The traction for this project comes mostly from Chinese, Japanese, and US companies.
Community has really helped our team overcome some challenges, such as how to make the system scalable and tolerable. Our original community didn’t have much knowledge of distributed systems or distributed operating systems. CoreOS is a primary contributor to Kubernetes (plus Google and Red Hat) that has also helped us a lot. The collaboration of the Kubernetes and PaddlePaddle communities has brought a lot of experience and knowledge to distributed machine learning. We’re still working on this, but we’ve seen rapid progress.
The community we’ve built are mostly researchers, engineers, and college students — we haven’t yet gotten strong Marketing or Enterprise solution experience. It’s clear, however, that what we’re creating has great potential value in the enterprise solution market. We would like to see people with relevant experience push this innovation.
Kubernetes is a cluster management system, or a distributed operating system, designed to run all kinds of jobs on the same set of computers. On a Kubernetes cluster, users can run all kinds of jobs that form a business, like a search engine or an e-commerce site. This includes Web servers that generate logs, Hadoop and Spark programs that process logs, and PaddlePaddle jobs that extract knowledge from the processed data and improve the Web server’s serving quality. Kubernetes schedules containers instead of virtual machines (VMs). This saves the overhead of running per-VM operating systems.
We’d like to offer a complete solution — which means not only a framework to ease the programming, but also infrastructure technologies to run the programs. Also, we want to be very open. We are not going to maintain two versions of PaddlePaddle — internal and external. The project has been Baidu-led. But as a first step toward having it be community-led, we are forming an advisory board.
In short, our aspirations include:
We’re also considering working with some universities and labs to keep evolving the API and help bring their innovations to the market.
Recently we’ve been discussing how engineering and research can work together to help the world. Existing innovations in deep learning have been about finding applications in internet search, advertising, making recommendations, speech recognition and image recognition. Others (outside of Baidu) work in these areas, and up until now, they’ve seen some astonishing achievements, but still there is a gap in helping traditional industries like manufacturing, banking and other industries with this technology.
In China, many traditional industries are undergoing an AI revolution. For example, Gree and Midea have voice-controlled air conditioners. LeTV and Changhong’s TV sets are voice controllable as well. Airlines and banks are also actively looking for efficiency brought by AI. In fact, HNA has a business unit called EchTech, which is actually a big data and AI research group. To support the manufacturers, we are working on pruning deep learning models and deploying them onto devices in the Internet-of-things.
At PaddlePaddle, we’re excited for the developments we plan to bring to the deep learning market in 2017.
Get the latest Software Integrity news, thought leadership, and more.