PyArrow to Replace NumPy in Python Pandas 3.0

Python Pandas is set for a performance boost with the upcoming version 3.0, which will rely on PyArrow instead of NumPy for faster data loading and reading. PyArrow is 10 times faster than NumPy and supports columnar storage, eliminating computational back-and-forth.

Pandas was initially built on top of NumPy but has since adopted PyArrow in version 2. Initially a performance enhancement, PyArrow will become a required dependency starting with Pandas 3.0, with pyarrow.string being the default type inferred for string data.

According to Python instructor Ruben Lerner, PyArrow is significantly faster than NumPy, especially when handling columnar data. The data analysis library itself uses vectorized operations written in C, making it more efficient than Python code. However, Pandas still faces challenges with dates and compression techniques.

Lerner demonstrated the difference between NumPy and PyArrow using a 2.2GB CSV file, which took 55.8 seconds to read into memory with NumPy but only 11.8 seconds with PyArrow. The data also shrinks in size when rendered into Feather or Parquet formats, reducing storage requirements.

Pandas 3.0’s release date remains uncertain, but the inclusion of PyArrow is a welcome change for organizations seeking to speed up their data-crunching operations without migrating to new platforms. This update allows users to keep their existing Pandas API while benefiting from improved performance and memory efficiency with PyArrow.

Source: https://thenewstack.io/python-pandas-ditches-numpy-for-speedier-pyarrow