Amazon Migrates Large-Scale Data Operations to Apache Ray

Amazon has successfully migrated its large-scale data operations from Apache Spark to Apache Ray, resulting in a significant 82% efficiency gain. The company’s move is expected to save $100 million annually on compute services.

In a recent talk at the All Things Open 2024 conference, Patrick Ames, Amazon Principal Engineer, discussed the company’s migration to Ray and its benefits. Ames highlighted that Ray is a general-purpose distributed compute framework that can be applied to various areas of distributed systems.

The move was driven by Amazon’s need for more efficient data compacting chores, which are essential for its in-house business intelligence services. The company had previously used Apache Spark, but it struggled with duplicate detection and scaling as the data sets grew.

Ray offered a solution, allowing Amazon to decouple storage from compute and store database tables in S3 buckets. Ames described how Ray’s Pythonic APIs and ability to work with large data sets enabled efficient parallelization of tasks.

Early results showed that Ray compacted a GB of data in under 0.1 seconds, outperforming Spark by 82%. The new framework also consumed significantly less memory, reducing the need for multiple servers.

However, reliability was initially a concern, but Amazon’s team improved it to 99.15%, comparable to Spark’s 99.91%. The migration is expected to reduce Amazon’s computational needs by roughly 250,000 years of vCPU time annually.

Ames concluded that while Spark still has advantages, such as general-purpose data processing features, Ray offers flexibility and can be tailored to specific problems. He emphasized the importance of investing in Ray for large-scale data operations.

Source: https://thenewstack.io/amazon-to-save-millions-moving-from-apache-spark-to-ray