Distributed Data Processing¶

Description	Tools for scaling data transformations and analyses across multiple servers.
Projects	10
Lines Committed vs. Age Chart (click to view)

Projects¶

Project	Size Score	Trend Score	Byline
Analytics Zoo	4.25	1.75	Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray
Dask	6.75	4.75	Parallel computing with task scheduling.
Hadoop	9.0	8.25	The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.
HPCC	6.0	6.25	HPCC Systems (High Performance Computing Cluster) is an open source, massive parallel-processing computing platform for big data processing and analytics.
Mars	6.75	4.5	Mars is a tensor-based unified framework for large-scale data computation which scales Numpy, Pandas and Scikit-learn.
Modin	5.0	7.5	Speed up your Pandas workflows by changing a single line of code
Pig	4.0	5.0	Apache Pig is a platform to create programs on top of Apache Hadoop.
Ray	9.0	8.75	An open source framework that provides a simple, universal API for building distributed applications. Ray is packaged with RLlib, a scalable reinforcement learning library, and Tune, a scalable hyperparameter tuning library.
RayDP	2.0	5.75	Distributed data processing library on Ray by running popular big data frameworks like Apache Spark on Ray. RayDP seamlessly integrates with other Ray libraries to make it simple to build E2E data analytics and AI pipeline.
Spark	9.25	4.5	A unified analytics engine for large-scale data processing.