Angel基于参数服务器的规模分布式机器学习平台
去年12月底,Angel从LF AI基金会毕业了,也是中国首个从LF AI基金会毕业的开源项目。这意味着,Angel得到全球技术专家的认可,成为世界最顶级的AI开源项目之一。
从LF AI毕业了,代码层面的license格式要修改,可以持续关注开源社区的动向哈。
LF AI 是 Linux基金会旗下面向AI领域的顶级基金会。
上图是LF AI网站对angel的介绍,有意思的是阿里开源的Alink也加入LF AI,如下图
概述
Angel是腾讯开源的大规模分布式机器学习平台,专注于稀疏数据高维模型的训练。目前Angel是Linux基金会人工智能(LFAI)孵化项目,相比于TensorFlow, PyTorch和Spark等业界同类平台,它有如下特点:
Angel是一个基于Parameter Server(PS)理念开发的高性能分布式机器学习平台,它具有灵活的可定制函数PS Function(PSF),可以将部分计算下推至PS端。PS架构良好的横向扩展能力让Angel能高效处理千亿级别的模型。
Angel具有专门为处理高维稀疏特征特别优化的数学库,性能可达breeze数学库的10倍以上。Angel的PS和内置的算法内核均构建在该数学库之上。
Angel擅长推荐模型和图网络模型相关领域(如社交网络分析)。图1是Angel和几个业界主流平台在稀疏数据,模型维度,性能表现,深度模型和生态建设几个维度的对比。Tensorflow和PyTouch在深度学习领域和生态建设方面优势明显,但在稀疏数据和高维模型方面的处理能力相对不足,而Angel正好与它们形成互补,3.0版本推出的PyTorch On Angel尝试将PyTorch和Angel的优势结合在一起。
Angel与业界主流平台的对比
Angel3.0整体架构
Angel ChangeLog
Release-2.2.0 - 2019-05-06
In this Release, we have enhanced the graph algorithms: (1) we made a refactoring of the existing K-Core algorithm, the performance and stability have been significantly improved; (2) we add the louvain algorithm, which is also named Fast-Unfolding. The test results show that the K-Core and the Louvain algorithm are both 10x faster than GraphX. In this release we official release the Vero, a new GBDT implementation over Spark On Angel. The advantage of Vero is that it obtains great The main feature of Vero is which has obvious advantages in supporting high dimensional models and multi-classification problems. We also add kerberos support in this release.
New features in Release-2.2.0:
Add Fast Unfolding algorithm in Spark-on-Angel
Support predict for FTRL-LR in Spark-on-Angel
Support predict for FTRL-FM in Spark-on-Angel
Add Vero, a feature parallelism version GBDT on Spark-on-Angel
Support regression for GBDT on Spark-on-Angel
Add a new data split input format--BalanceInputFormatV2
Support running over Kubernetes.
Bugs fixed in Release-2.2.0:
Fix the failure of loading model when the model is moved, closing the csc check
Fix the problem that parameter servers exist with errors in Spark-on-Angel
Fix the problem that the interface of sparse index pull might be blocked when given parameters are invalid
Fix the problem that saving result would fail if the parent path is not existed
Fix the problem that the BalanceInputFormat would return empty splits sometimes
Fix the problem when saving json configuration files
Fix the problem when requesting the resources for Angel workers
Release-2.1.0 - 2019-03-08
In this release, we add an intelligent model partitioning method, named"LoadBalancePartitioner", In Spark-on-Angel. By analyzing the distribution of features in the training data in advance, the number of features of each partition can be precisely controlled. This leads a balanced load for each server. The empirical tests demonstrate that the efficiency of model training can be greatly improved in many cases. Further, we add three algorithms in this release, including FM solved by FTRL optimizer, K-Core algorithm and feature-parallel GBDT, which can support a high-dimentional tree model.
New features in Release-2.1.0:
Adding a load-balanced model partitioner, called "LoadBalancePartitioner" in Spark-on-Angel
Adding Ftrl-FM algorithm
Adding K-core algorithm
Adding a feature-parallel version of GBDT algorithm
Release-2.0.2 - 2019-01-30
In this release, we optimize the performance of FTRL algorithm, adding the support for float data type. We limit the maxmimum retry times for remote requests to avoid unrecoverable blocking. We also increase the performance of the math library.
New features in Release-2.0.2
Optimize the model partitioning for FTRL algorithm
Support float data type for FTRL algorithm
Avoid rehashing in math library to obtain performance improvement
Add a maxmimum retry times for remote requests on servers
Release-2.0.1 - 2019-01-11
In this release, we add the support for incremental training for FTRL. We implement some new optimizers and learning rate scheduling strategies. Documentations about how to choose optimizers or scheduling strategies and how to accelerate deep learning algorithms with openblas are provided in this release.
New features in Release-2.0.1
Add documentations about how to use openblas to accelerate deep learning algorithms
Optimize the performance for FTRL
Support inremental training for FTRL
Add optimizers with L1 penalty: Adagrad/Adadelta
Add some scheduling strategies for learning rate
Bug Fix:
Fix the problem of inconsistency of nodes number in network embedding
Fix casting problem existing in quantile compressing
Owner
Angel is a big project, it consists a seris of sub-project, each sub-project has a owner and a backup owner:
Angel: paynie, leleyu
sona: fitzwang, leleyu
mlcore: fitzwang, endymecy
math: rachelsunrh, fitzwang
serving: ouyangwen, fitzwang
format: paynie, raohuaming
PyTorchOnAngel: leleyu, ouyangwen