Distributed Data Processing

Description

Tools for scaling data transformations and analyses across multiple servers.

Projects

10

Lines Committed vs. Age Chart (click to view)

Lines Committed vs. Age Chart (click to view)

Projects

Project

Size Score

Trend Score

Byline

Analytics Zoo

5.0

8.25

Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray

Dask

6.0

6.25

Parallel computing with task scheduling.

Hadoop

8.5

7.5

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.

HPCC

5.5

7.0

HPCC Systems (High Performance Computing Cluster) is an open source, massive parallel-processing computing platform for big data processing and analytics.

Mars

6.75

6.25

Mars is a tensor-based unified framework for large-scale data computation which scales Numpy, Pandas and Scikit-learn.

Modin

4.5

6.25

Speed up your Pandas workflows by changing a single line of code

Pig

3.25

1.75

Apache Pig is a platform to create programs on top of Apache Hadoop.

Ray

8.0

9.0

An open source framework that provides a simple, universal API for building distributed applications. Ray is packaged with RLlib, a scalable reinforcement learning library, and Tune, a scalable hyperparameter tuning library.

RayDP

1.75

6.5

Distributed data processing library on Ray by running popular big data frameworks like Apache Spark on Ray. RayDP seamlessly integrates with other Ray libraries to make it simple to build E2E data analytics and AI pipeline.

Spark

9.5

6.75

A unified analytics engine for large-scale data processing.