Spark¶

The most popular big data processing tool and the successor to Hadoop Map-Reduce. It is mostly used for ETL (extract transform load), but also includes facilities for streaming aggregations, graph computations, and machine learning.

Logo
Website	http://spark.apache.org
Repository	https://github.com/apache/spark
Byline	A unified analytics engine for large-scale data processing.
License	Apache 2.0
Project age	12 years 8 months
Backers	Apache (Governed by), DataBricks (Commercial Product By)
Lastest News (2021-10-13)	3.2.0 We are happy to announce the availability of Spark 3.2.0! Visit the release notes to read about the new features, or download the … more
Size score (1 to 10, higher is better)	9.25
Trend score (1 to 10, higher is better)	4.5

Education Resources¶

URL	Resource Type	Description
https://spark.apache.org/docs/3.1.1/	Documentation	Official project documentation.

Git Commit Statistics¶

Statistics computed using Git data through November 30, 2022.

Statistic	Lifetime	Last 12 Months
Commits	108,271	24,253
Lines committed	47,197,033	5,113,840
Unique committers	2,735	307
Core committers	14	21

../_images/apache_spark-monthly-commits.png

Similar Projects¶

Project	Size Score	Trend Score	Byline
Dask	6.75	4.75	Parallel computing with task scheduling.
HPCC	6.0	6.25	HPCC Systems (High Performance Computing Cluster) is an open source, massive parallel-processing computing platform for big data processing and analytics.
Hadoop	9.0	8.25	The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.
Mars	6.75	4.5	Mars is a tensor-based unified framework for large-scale data computation which scales Numpy, Pandas and Scikit-learn.
Ray	9.0	8.75	An open source framework that provides a simple, universal API for building distributed applications. Ray is packaged with RLlib, a scalable reinforcement learning library, and Tune, a scalable hyperparameter tuning library.