Spark¶
The most popular big data processing tool and the successor to Hadoop Map-Reduce. It is mostly used for ETL (extract transform load), but also includes facilities for streaming aggregations, graph computations, and machine learning.
Logo |
|
---|---|
Website |
|
Repository |
|
Byline |
A unified analytics engine for large-scale data processing. |
License |
Apache 2.0 |
Project age |
12 years 8 months |
Backers |
Apache (Governed by), DataBricks (Commercial Product By) |
Lastest News (2021-10-13) |
3.2.0 We are happy to announce the availability of Spark 3.2.0! Visit the release notes to read about the new features, or download the … more |
Size score (1 to 10, higher is better) |
9.25 |
Trend score (1 to 10, higher is better) |
4.5 |
Education Resources¶
URL |
Resource Type |
Description |
---|---|---|
Documentation |
Official project documentation. |
Git Commit Statistics¶
Statistics computed using Git data through November 30, 2022.
Statistic |
Lifetime |
Last 12 Months |
---|---|---|
Commits |
108,271 |
24,253 |
Lines committed |
47,197,033 |
5,113,840 |
Unique committers |
2,735 |
307 |
Core committers |
14 |
21 |
Similar Projects¶
Project |
Size Score |
Trend Score |
Byline |
---|---|---|---|
6.75 |
4.75 |
Parallel computing with task scheduling. |
|
6.0 |
6.25 |
HPCC Systems (High Performance Computing Cluster) is an open source, massive parallel-processing computing platform for big data processing and analytics. |
|
9.0 |
8.25 |
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. |
|
6.75 |
4.5 |
Mars is a tensor-based unified framework for large-scale data computation which scales Numpy, Pandas and Scikit-learn. |
|
9.0 |
8.75 |
An open source framework that provides a simple, universal API for building distributed applications. Ray is packaged with RLlib, a scalable reinforcement learning library, and Tune, a scalable hyperparameter tuning library. |