Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: pandas-dev/pandas
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: v2.2.1
Choose a base ref
...
head repository: pandas-dev/pandas
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: v2.2.2
Choose a head ref

Commits on Mar 1, 2024

  1. Backport PR #57689 on branch 2.2.x (CI: fix ci (calamine typing)) (#5…

    …7692)
    
    Backport PR #57689: CI: fix ci (calamine typing)
    MarcoGorelli authored Mar 1, 2024
    Copy the full SHA
    9a07184 View commit details

Commits on Mar 3, 2024

  1. Backport PR #57668 on branch 2.2.x (CLN: More numpy 2 stuff) (#57707)

    Backport PR #57668: CLN: More numpy 2 stuff
    
    Co-authored-by: Thomas Li <47963215+lithomas1@users.noreply.github.com>
    meeseeksmachine and lithomas1 authored Mar 3, 2024
    Copy the full SHA
    4ac5ee2 View commit details

Commits on Mar 4, 2024

  1. Copy the full SHA
    6db283c View commit details
  2. Backport PR #57721 on branch 2.2.x (update from 2022 to 2024 image) (#…

    …57729)
    
    Backport PR #57721: update from 2022 to 2024 image
    
    Co-authored-by: Thomas Baumann <thbaumann90@gmail.com>
    meeseeksmachine and lopof authored Mar 4, 2024
    Copy the full SHA
    3cc5afa View commit details

Commits on Mar 6, 2024

  1. Backport PR #57172: MAINT: Adjust the codebase to the new 's keyword …

    …meaning (#57740)
    
    Co-authored-by: Mateusz Sokół <8431159+mtsokol@users.noreply.github.com>
    mroeschke and mtsokol authored Mar 6, 2024
    Copy the full SHA
    301f914 View commit details

Commits on Mar 7, 2024

  1. Backport PR #57759 on branch 2.2.x (DOC: add whatsnew for v2.2.2) (#5…

    …7763)
    
    * Backport PR #57759: DOC: add whatsnew for v2.2.2
    
    * [skip-ci]
    
    ---------
    
    Co-authored-by: Marco Edward Gorelli <marcogorelli@protonmail.com>
    Co-authored-by: Marco Gorelli <33491632+MarcoGorelli@users.noreply.github.com>
    3 people authored Mar 7, 2024
    Copy the full SHA
    63b9eba View commit details

Commits on Mar 8, 2024

  1. Backport PR #57665 on branch 2.2.x (BUG: interchange protocol with nu…

    …llable datatypes a non-null validity) (#57769)
    
    BUG: interchange protocol with nullable datatypes a non-null validity (#57665)
    
    * BUG: interchange protocol with nullable datatypes a non-null validity provides nonsense results
    
    * whatsnew
    
    * 🏷️ typing
    
    * parametrise over more types
    
    * move whatsnew
    
    (cherry picked from commit 03717bc)
    MarcoGorelli authored Mar 8, 2024
    Copy the full SHA
    e44f91d View commit details
  2. Backport PR #57780 on branch 2.2.x (COMPAT: Adapt to Numpy 2.0 dtype …

    …changes) (#57784)
    
    Backport PR #57780: COMPAT: Adapt to Numpy 2.0 dtype changes
    
    Co-authored-by: Sebastian Berg <sebastianb@nvidia.com>
    meeseeksmachine and seberg authored Mar 8, 2024
    Copy the full SHA
    d600189 View commit details

Commits on Mar 12, 2024

  1. Backport PR #57821 on branch 2.2.x (Fix doc build) (#57822)

    Backport PR #57821: Fix doc build
    
    Co-authored-by: Trinh Quoc Anh <trinhquocanh94@gmail.com>
    meeseeksmachine and tqa236 authored Mar 12, 2024
    Copy the full SHA
    33006cd View commit details

Commits on Mar 13, 2024

  1. Backport PR #57830 on branch 2.2.x (DOC: Pin dask/dask-expr for scale…

    ….rst) (#57832)
    
    Backport PR #57830: DOC: Pin dask/dask-expr for scale.rst
    
    Co-authored-by: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com>
    meeseeksmachine and mroeschke authored Mar 13, 2024
    Copy the full SHA
    9ed5382 View commit details

Commits on Mar 14, 2024

  1. Backport PR #57796 on branch 2.2.x (Fix issue with Tempita recompilat…

    …ion) (#57834)
    
    Backport PR #57796: Fix issue with Tempita recompilation
    
    Co-authored-by: William Ayd <will_ayd@innobi.io>
    meeseeksmachine and WillAyd authored Mar 14, 2024
    Copy the full SHA
    4fdbe56 View commit details

Commits on Mar 15, 2024

  1. Backport PR #57848 on branch 2.2.x (DOC: Remove duplicated Series.dt.…

    …normalize from docs) (#57854)
    
    Backport PR #57848: DOC: Remove duplicated Series.dt.normalize from docs
    
    Co-authored-by: Marc Garcia <garcia.marc@gmail.com>
    meeseeksmachine and datapythonista authored Mar 15, 2024
    Copy the full SHA
    b6488af View commit details
  2. Backport PR #57843: DOC: Remove Dask and Modin sections in scale.rst …

    …in favor of linking to ecosystem docs. (#57861)
    
    Co-authored-by: Yuki Kitayama <47092819+yukikitayama@users.noreply.github.com>
    mroeschke and yukikitayama authored Mar 15, 2024
    Copy the full SHA
    962e233 View commit details

Commits on Mar 18, 2024

  1. Backport PR #57883 on branch 2.2.x (Bump pypa/cibuildwheel from 2.16.…

    …5 to 2.17.0) (#57888)
    
    Backport PR #57883: Bump pypa/cibuildwheel from 2.16.5 to 2.17.0
    
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
    meeseeksmachine and dependabot[bot] authored Mar 18, 2024
    Copy the full SHA
    cd6eeae View commit details
  2. Backport PR #57892 on branch 2.2.x (CI: xfail Pyarrow slicing test) (#…

    …57898)
    
    Backport PR #57892: CI: xfail Pyarrow slicing test
    
    Co-authored-by: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com>
    meeseeksmachine and mroeschke authored Mar 18, 2024
    Copy the full SHA
    71a6797 View commit details
  3. Backport PR #57889 on branch 2.2.x (BUG: Handle Series construction w…

    …ith Dask, dict-like, Series) (#57899)
    
    Backport PR #57889: BUG: Handle Series construction with Dask, dict-like, Series
    
    Co-authored-by: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com>
    meeseeksmachine and mroeschke authored Mar 18, 2024
    Copy the full SHA
    cc56321 View commit details

Commits on Mar 19, 2024

  1. Backport PR #57905 on branch 2.2.x (Revert "Fix issue with Tempita re…

    …compilation (#57796)") (#57907)
    
    Backport PR #57905: Revert "Fix issue with Tempita recompilation (#57796)"
    
    Co-authored-by: William Ayd <will_ayd@innobi.io>
    meeseeksmachine and WillAyd authored Mar 19, 2024
    Copy the full SHA
    83497f5 View commit details
  2. Backport PR #57886 on branch 2.2.x (CI: Remove ASAN job) (#57910)

    Backport PR #57886: CI: Remove ASAN job
    
    Co-authored-by: William Ayd <will_ayd@innobi.io>
    meeseeksmachine and WillAyd authored Mar 19, 2024
    Copy the full SHA
    2a6d800 View commit details

Commits on Mar 21, 2024

  1. Backport PR #57029 on branch 2.2.x (DOC: Add DataFrame.to_numpy met…

    …hod) (#57940)
    
    Backport PR #57029: DOC: Add `DataFrame.to_numpy` method
    
    Co-authored-by: Zhengbo Wang <77875500+luke396@users.noreply.github.com>
    meeseeksmachine and luke396 authored Mar 21, 2024
    Copy the full SHA
    78f7a02 View commit details
  2. Backport PR #57764 on branch 2.2.x (BUG: PyArrow dtypes were not supp…

    …orted in the interchange protocol) (#57947)
    MarcoGorelli authored Mar 21, 2024
    Copy the full SHA
    7e8d492 View commit details

Commits on Mar 27, 2024

  1. Backport PR #57548 on branch 2.2.x (Fix accidental loss-of-precision …

    …for to_datetime(str, unit=...)) (#58034)
    
    Backport PR #57548: Fix accidental loss-of-precision for to_datetime(str, unit=...)
    
    Co-authored-by: Elliott Sales de Andrade <quantum.analyst@gmail.com>
    meeseeksmachine and QuLogic authored Mar 27, 2024
    Copy the full SHA
    40e621f View commit details
  2. Backport PR #57758 on branch 2.2.x (BUG: DataFrame Interchange Protoc…

    …ol errors on Boolean columns) (#58036)
    
    Backport PR #57758: BUG: DataFrame Interchange Protocol errors on Boolean columns
    
    Co-authored-by: Marco Edward Gorelli <marcogorelli@protonmail.com>
    meeseeksmachine and MarcoGorelli authored Mar 27, 2024
    Copy the full SHA
    e1a7302 View commit details

Commits on Mar 28, 2024

  1. Backport PR #57974 on branch 2.2.x (BUG: Fixed ADBC to_sql creation o…

    …f table when using public schema) (#58050)
    
    Backport PR #57974: BUG: Fixed ADBC to_sql creation of table when using public schema
    
    Co-authored-by: Shabab Karim <shababkarim93@gmail.com>
    meeseeksmachine and shabab477 authored Mar 28, 2024
    Copy the full SHA
    f455401 View commit details

Commits on Apr 1, 2024

  1. Backport PR #57553 on branch 2.2.x (API: avoid passing Manager to sub…

    …class init) (#58008)
    
    * Backport PR #57553: API: avoid passing Manager to subclass __init__
    
    * whatsnew, type ignores
    
    * merge 2.2.2 file from main
    
    * rebase on 2.2.x whatsnew
    jbrockmendel authored Apr 1, 2024
    Copy the full SHA
    810b2d0 View commit details
  2. Backport PR #58075 on branch 2.2.x (DOC: whatsnew note for #57553) (#…

    …58080)
    
    Backport PR #58075: DOC: whatsnew note for #57553
    
    Co-authored-by: jbrockmendel <jbrockmendel@gmail.com>
    meeseeksmachine and jbrockmendel authored Apr 1, 2024
    Copy the full SHA
    822d285 View commit details

Commits on Apr 3, 2024

  1. Copy the full SHA
    e9b81ee View commit details
  2. Revert "BLD: Pin numpy on 2.2.x" (#58093)

    Revert "BLD: Pin numpy on 2.2.x (#56812)"
    
    This reverts commit 24ea67f.
    lithomas1 authored Apr 3, 2024
    Copy the full SHA
    0f83d50 View commit details
  3. Backport PR #58100 on branch 2.2.x (MNT: fix compatibility with beaut…

    …ifulsoup4 4.13.0b2) (#58137)
    
    Backport PR #58100: MNT: fix compatibility with beautifulsoup4 4.13.0b2
    
    Co-authored-by: Clément Robert <cr52@protonmail.com>
    meeseeksmachine and neutrinoceros authored Apr 3, 2024
    Copy the full SHA
    b56842d View commit details
  4. Backport PR #58138 on branch 2.2.x (BLD: Fix nightlies not building) (#…

    …58140)
    
    Backport PR #58138: BLD: Fix nightlies not building
    
    Co-authored-by: Thomas Li <47963215+lithomas1@users.noreply.github.com>
    meeseeksmachine and lithomas1 authored Apr 3, 2024
    Copy the full SHA
    a947587 View commit details

Commits on Apr 8, 2024

  1. Backport PR #58181 on branch 2.2.x (CI: correct error msg in test_vie…

    …w_index) (#58187)
    
    Backport PR #58181: CI: correct error msg in `test_view_index`
    natmokval authored Apr 8, 2024
    Copy the full SHA
    691fc88 View commit details

Commits on Apr 9, 2024

  1. Backport PR #58087 on branch 2.2.x (BLD: Build wheels using numpy 2.0…

    …rc1) (#58105)
    
    Backport PR #58087: BLD: Build wheels using numpy 2.0rc1
    
    Co-authored-by: Thomas Li <47963215+lithomas1@users.noreply.github.com>
    meeseeksmachine and lithomas1 authored Apr 9, 2024
    Copy the full SHA
    c7ec566 View commit details

Commits on Apr 10, 2024

  1. Backport PR #58203 on branch 2.2.x (DOC: Add release date/contributor…

    …s for 2.2.2) (#58206)
    
    Backport PR #58203: DOC: Add release date/contributors for 2.2.2
    
    Co-authored-by: Thomas Li <47963215+lithomas1@users.noreply.github.com>
    meeseeksmachine and lithomas1 authored Apr 10, 2024
    Copy the full SHA
    45b0b32 View commit details
  2. Backport PR #58202: DOC/TST: Document numpy 2.0 support and add tests… (

    #58208)
    
    Backport PR #58202: DOC/TST: Document numpy 2.0 support and add tests for string array
    lithomas1 authored Apr 10, 2024
    Copy the full SHA
    5466f15 View commit details
  3. Copy the full SHA
    98aeac9 View commit details
  4. RLS: 2.2.2

    Pandas Development Team authored and lithomas1 committed Apr 10, 2024
    1
    Copy the full SHA
    d9cdd2e View commit details
Showing with 771 additions and 387 deletions.
  1. +2 −6 .circleci/config.yml
  2. +1 −8 .github/actions/run-tests/action.yml
  3. +0 −14 .github/workflows/unit-tests.yml
  4. +2 −15 .github/workflows/wheels.yml
  5. +2 −0 ci/deps/actions-310.yaml
  6. +2 −0 ci/deps/actions-311-downstream_compat.yaml
  7. +0 −32 ci/deps/actions-311-sanitizers.yaml
  8. +2 −0 ci/deps/actions-311.yaml
  9. +2 −0 ci/deps/actions-312.yaml
  10. +2 −0 ci/deps/actions-39-minimum_versions.yaml
  11. +2 −0 ci/deps/actions-39.yaml
  12. +2 −0 ci/deps/circle-310-arm64.yaml
  13. +1 −0 doc/source/reference/frame.rst
  14. +0 −1 doc/source/reference/series.rst
  15. +7 −157 doc/source/user_guide/scale.rst
  16. +1 −0 doc/source/whatsnew/index.rst
  17. +1 −1 doc/source/whatsnew/v2.2.1.rst
  18. +59 −0 doc/source/whatsnew/v2.2.2.rst
  19. +2 −0 environment.yml
  20. +4 −0 pandas/_libs/src/datetime/pd_datetime.c
  21. +8 −4 pandas/_libs/src/vendored/numpy/datetime/np_datetime.c
  22. +1 −3 pandas/_libs/src/vendored/ujson/python/objToJSON.c
  23. +1 −1 pandas/_libs/tslib.pyx
  24. +2 −0 pandas/compat/__init__.py
  25. +2 −0 pandas/compat/pyarrow.py
  26. +3 −3 pandas/core/array_algos/quantile.py
  27. +3 −1 pandas/core/arrays/arrow/array.py
  28. +4 −1 pandas/core/arrays/base.py
  29. +3 −1 pandas/core/arrays/categorical.py
  30. +3 −1 pandas/core/arrays/datetimelike.py
  31. +3 −3 pandas/core/arrays/datetimes.py
  32. +3 −1 pandas/core/arrays/interval.py
  33. +3 −1 pandas/core/arrays/masked.py
  34. +10 −4 pandas/core/arrays/numeric.py
  35. +3 −1 pandas/core/arrays/numpy_.py
  36. +7 −2 pandas/core/arrays/period.py
  37. +3 −1 pandas/core/arrays/sparse/array.py
  38. +5 −2 pandas/core/arrays/timedeltas.py
  39. +11 −3 pandas/core/construction.py
  40. +5 −2 pandas/core/dtypes/cast.py
  41. +1 −1 pandas/core/dtypes/missing.py
  42. +30 −17 pandas/core/frame.py
  43. +4 −1 pandas/core/generic.py
  44. +1 −1 pandas/core/indexes/base.py
  45. +3 −3 pandas/core/indexes/multi.py
  46. +58 −0 pandas/core/interchange/buffer.py
  47. +72 −10 pandas/core/interchange/column.py
  48. +5 −0 pandas/core/interchange/dataframe.py
  49. +10 −7 pandas/core/interchange/from_dataframe.py
  50. +31 −0 pandas/core/interchange/utils.py
  51. +2 −0 pandas/core/internals/managers.py
  52. +2 −1 pandas/core/resample.py
  53. +27 −18 pandas/core/series.py
  54. +1 −3 pandas/io/excel/_calamine.py
  55. +1 −7 pandas/io/html.py
  56. +1 −1 pandas/io/pytables.py
  57. +3 −1 pandas/io/sql.py
  58. +1 −0 pandas/tests/arrays/integer/test_arithmetic.py
  59. +14 −9 pandas/tests/arrays/test_datetimelike.py
  60. +2 −2 pandas/tests/dtypes/test_inference.py
  61. +4 −1 pandas/tests/extension/array_with_attr/array.py
  62. +5 −3 pandas/tests/extension/json/array.py
  63. +4 −1 pandas/tests/extension/list/array.py
  64. +5 −3 pandas/tests/extension/test_common.py
  65. +1 −1 pandas/tests/frame/methods/test_select_dtypes.py
  66. +1 −1 pandas/tests/frame/test_arithmetic.py
  67. +19 −0 pandas/tests/frame/test_constructors.py
  68. +11 −0 pandas/tests/frame/test_subclass.py
  69. +11 −1 pandas/tests/indexes/object/test_indexing.py
  70. +4 −1 pandas/tests/indexes/test_base.py
  71. +1 −1 pandas/tests/indexes/test_index_new.py
  72. +189 −9 pandas/tests/interchange/test_impl.py
  73. +1 −1 pandas/tests/io/pytables/test_timezones.py
  74. +24 −0 pandas/tests/io/test_sql.py
  75. +1 −1 pandas/tests/scalar/timestamp/test_formats.py
  76. +19 −0 pandas/tests/series/test_constructors.py
  77. +8 −0 pandas/tests/tools/test_to_datetime.py
  78. +12 −10 pyproject.toml
  79. +1 −1 scripts/generate_pip_deps_from_conda.py
  80. +4 −1 scripts/validate_min_versions_in_sync.py
8 changes: 2 additions & 6 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
@@ -3,7 +3,7 @@ version: 2.1
jobs:
test-arm:
machine:
image: ubuntu-2004:2022.04.1
image: default
resource_class: arm.large
environment:
ENV_FILE: ci/deps/circle-310-arm64.yaml
@@ -46,7 +46,7 @@ jobs:
cibw-build:
type: string
machine:
image: ubuntu-2004:2022.04.1
image: default
resource_class: arm.large
environment:
TRIGGER_SOURCE: << pipeline.trigger_source >>
@@ -72,10 +72,6 @@ jobs:
no_output_timeout: 30m # Sometimes the tests won't generate any output, make sure the job doesn't get killed by that
command: |
pip3 install cibuildwheel==2.15.0
# When this is a nightly wheel build, allow picking up NumPy 2.0 dev wheels:
if [[ "$IS_SCHEDULE_DISPATCH" == "true" || "$IS_PUSH" != 'true' ]]; then
export CIBW_ENVIRONMENT="PIP_EXTRA_INDEX_URL=https://pypi.anaconda.org/scientific-python-nightly-wheels/simple"
fi
cibuildwheel --prerelease-pythons --output-dir wheelhouse
environment:
9 changes: 1 addition & 8 deletions .github/actions/run-tests/action.yml
Original file line number Diff line number Diff line change
@@ -1,16 +1,9 @@
name: Run tests and report results
inputs:
preload:
description: Preload arguments for sanitizer
required: false
asan_options:
description: Arguments for Address Sanitizer (ASAN)
required: false
runs:
using: composite
steps:
- name: Test
run: ${{ inputs.asan_options }} ${{ inputs.preload }} ci/run_tests.sh
run: ci/run_tests.sh
shell: bash -el {0}

- name: Publish test results
14 changes: 0 additions & 14 deletions .github/workflows/unit-tests.yml
Original file line number Diff line number Diff line change
@@ -96,14 +96,6 @@ jobs:
- name: "Pyarrow Nightly"
env_file: actions-311-pyarrownightly.yaml
pattern: "not slow and not network and not single_cpu"
- name: "ASAN / UBSAN"
env_file: actions-311-sanitizers.yaml
pattern: "not slow and not network and not single_cpu and not skip_ubsan"
asan_options: "ASAN_OPTIONS=detect_leaks=0"
preload: LD_PRELOAD=$(gcc -print-file-name=libasan.so)
meson_args: --config-settings=setup-args="-Db_sanitize=address,undefined"
cflags_adds: -fno-sanitize-recover=all
pytest_workers: -1 # disable pytest-xdist as it swallows stderr from ASAN
fail-fast: false
name: ${{ matrix.name || format('ubuntu-latest {0}', matrix.env_file) }}
env:
@@ -190,18 +182,12 @@ jobs:
- name: Test (not single_cpu)
uses: ./.github/actions/run-tests
if: ${{ matrix.name != 'Pypy' }}
with:
preload: ${{ matrix.preload }}
asan_options: ${{ matrix.asan_options }}
env:
# Set pattern to not single_cpu if not already set
PATTERN: ${{ env.PATTERN == '' && 'not single_cpu' || matrix.pattern }}

- name: Test (single_cpu)
uses: ./.github/actions/run-tests
with:
preload: ${{ matrix.preload }}
asan_options: ${{ matrix.asan_options }}
env:
PATTERN: 'single_cpu'
PYTEST_WORKERS: 0
17 changes: 2 additions & 15 deletions .github/workflows/wheels.yml
Original file line number Diff line number Diff line change
@@ -139,27 +139,14 @@ jobs:
shell: bash -el {0}
run: echo "sdist_name=$(cd ./dist && ls -d */)" >> "$GITHUB_ENV"

- name: Build normal wheels
if: ${{ (env.IS_SCHEDULE_DISPATCH != 'true' || env.IS_PUSH == 'true') }}
uses: pypa/cibuildwheel@v2.16.5
- name: Build wheels
uses: pypa/cibuildwheel@v2.17.0
with:
package-dir: ./dist/${{ startsWith(matrix.buildplat[1], 'macosx') && env.sdist_name || needs.build_sdist.outputs.sdist_file }}
env:
CIBW_PRERELEASE_PYTHONS: True
CIBW_BUILD: ${{ matrix.python[0] }}-${{ matrix.buildplat[1] }}

- name: Build nightly wheels (with NumPy pre-release)
if: ${{ (env.IS_SCHEDULE_DISPATCH == 'true' && env.IS_PUSH != 'true') }}
uses: pypa/cibuildwheel@v2.16.5
with:
package-dir: ./dist/${{ startsWith(matrix.buildplat[1], 'macosx') && env.sdist_name || needs.build_sdist.outputs.sdist_file }}
env:
# The nightly wheels should be build witht he NumPy 2.0 pre-releases
# which requires the additional URL.
CIBW_ENVIRONMENT: PIP_EXTRA_INDEX_URL=https://pypi.anaconda.org/scientific-python-nightly-wheels/simple
CIBW_PRERELEASE_PYTHONS: True
CIBW_BUILD: ${{ matrix.python[0] }}-${{ matrix.buildplat[1] }}

- name: Set up Python
uses: mamba-org/setup-micromamba@v1
with:
2 changes: 2 additions & 0 deletions ci/deps/actions-310.yaml
Original file line number Diff line number Diff line change
@@ -24,6 +24,8 @@ dependencies:

# optional dependencies
- beautifulsoup4>=4.11.2
# https://github.com/conda-forge/pytables-feedstock/issues/97
- c-blosc2=2.13.2
- blosc>=1.21.3
- bottleneck>=1.3.6
- fastparquet>=2022.12.0
2 changes: 2 additions & 0 deletions ci/deps/actions-311-downstream_compat.yaml
Original file line number Diff line number Diff line change
@@ -26,6 +26,8 @@ dependencies:

# optional dependencies
- beautifulsoup4>=4.11.2
# https://github.com/conda-forge/pytables-feedstock/issues/97
- c-blosc2=2.13.2
- blosc>=1.21.3
- bottleneck>=1.3.6
- fastparquet>=2022.12.0
32 changes: 0 additions & 32 deletions ci/deps/actions-311-sanitizers.yaml

This file was deleted.

2 changes: 2 additions & 0 deletions ci/deps/actions-311.yaml
Original file line number Diff line number Diff line change
@@ -24,6 +24,8 @@ dependencies:

# optional dependencies
- beautifulsoup4>=4.11.2
# https://github.com/conda-forge/pytables-feedstock/issues/97
- c-blosc2=2.13.2
- blosc>=1.21.3
- bottleneck>=1.3.6
- fastparquet>=2022.12.0
2 changes: 2 additions & 0 deletions ci/deps/actions-312.yaml
Original file line number Diff line number Diff line change
@@ -24,6 +24,8 @@ dependencies:

# optional dependencies
- beautifulsoup4>=4.11.2
# https://github.com/conda-forge/pytables-feedstock/issues/97
- c-blosc2=2.13.2
- blosc>=1.21.3
- bottleneck>=1.3.6
- fastparquet>=2022.12.0
2 changes: 2 additions & 0 deletions ci/deps/actions-39-minimum_versions.yaml
Original file line number Diff line number Diff line change
@@ -27,6 +27,8 @@ dependencies:

# optional dependencies
- beautifulsoup4=4.11.2
# https://github.com/conda-forge/pytables-feedstock/issues/97
- c-blosc2=2.13.2
- blosc=1.21.3
- bottleneck=1.3.6
- fastparquet=2022.12.0
2 changes: 2 additions & 0 deletions ci/deps/actions-39.yaml
Original file line number Diff line number Diff line change
@@ -24,6 +24,8 @@ dependencies:

# optional dependencies
- beautifulsoup4>=4.11.2
# https://github.com/conda-forge/pytables-feedstock/issues/97
- c-blosc2=2.13.2
- blosc>=1.21.3
- bottleneck>=1.3.6
- fastparquet>=2022.12.0
2 changes: 2 additions & 0 deletions ci/deps/circle-310-arm64.yaml
Original file line number Diff line number Diff line change
@@ -25,6 +25,8 @@ dependencies:

# optional dependencies
- beautifulsoup4>=4.11.2
# https://github.com/conda-forge/pytables-feedstock/issues/97
- c-blosc2=2.13.2
- blosc>=1.21.3
- bottleneck>=1.3.6
- fastparquet>=2022.12.0
1 change: 1 addition & 0 deletions doc/source/reference/frame.rst
Original file line number Diff line number Diff line change
@@ -49,6 +49,7 @@ Conversion
DataFrame.infer_objects
DataFrame.copy
DataFrame.bool
DataFrame.to_numpy

Indexing, iteration
~~~~~~~~~~~~~~~~~~~
1 change: 0 additions & 1 deletion doc/source/reference/series.rst
Original file line number Diff line number Diff line change
@@ -342,7 +342,6 @@ Datetime properties
Series.dt.tz
Series.dt.freq
Series.dt.unit
Series.dt.normalize

Datetime methods
^^^^^^^^^^^^^^^^
164 changes: 7 additions & 157 deletions doc/source/user_guide/scale.rst
Original file line number Diff line number Diff line change
@@ -156,7 +156,7 @@ fits in memory, you can work with datasets that are much larger than memory.

Chunking works well when the operation you're performing requires zero or minimal
coordination between chunks. For more complicated workflows, you're better off
:ref:`using another library <scale.other_libraries>`.
:ref:`using other libraries <scale.other_libraries>`.

Suppose we have an even larger "logical dataset" on disk that's a directory of parquet
files. Each file in the directory represents a different year of the entire dataset.
@@ -219,160 +219,10 @@ different library that implements these out-of-core algorithms for you.

.. _scale.other_libraries:

Use Dask
--------
Use Other Libraries
-------------------

pandas is just one library offering a DataFrame API. Because of its popularity,
pandas' API has become something of a standard that other libraries implement.
The pandas documentation maintains a list of libraries implementing a DataFrame API
in `the ecosystem page <https://pandas.pydata.org/community/ecosystem.html>`_.

For example, `Dask`_, a parallel computing library, has `dask.dataframe`_, a
pandas-like API for working with larger than memory datasets in parallel. Dask
can use multiple threads or processes on a single machine, or a cluster of
machines to process data in parallel.


We'll import ``dask.dataframe`` and notice that the API feels similar to pandas.
We can use Dask's ``read_parquet`` function, but provide a globstring of files to read in.

.. ipython:: python
:okwarning:
import dask.dataframe as dd
ddf = dd.read_parquet("data/timeseries/ts*.parquet", engine="pyarrow")
ddf
Inspecting the ``ddf`` object, we see a few things

* There are familiar attributes like ``.columns`` and ``.dtypes``
* There are familiar methods like ``.groupby``, ``.sum``, etc.
* There are new attributes like ``.npartitions`` and ``.divisions``

The partitions and divisions are how Dask parallelizes computation. A **Dask**
DataFrame is made up of many pandas :class:`pandas.DataFrame`. A single method call on a
Dask DataFrame ends up making many pandas method calls, and Dask knows how to
coordinate everything to get the result.

.. ipython:: python
ddf.columns
ddf.dtypes
ddf.npartitions
One major difference: the ``dask.dataframe`` API is *lazy*. If you look at the
repr above, you'll notice that the values aren't actually printed out; just the
column names and dtypes. That's because Dask hasn't actually read the data yet.
Rather than executing immediately, doing operations build up a **task graph**.

.. ipython:: python
:okwarning:
ddf
ddf["name"]
ddf["name"].value_counts()
Each of these calls is instant because the result isn't being computed yet.
We're just building up a list of computation to do when someone needs the
result. Dask knows that the return type of a :class:`pandas.Series.value_counts`
is a pandas :class:`pandas.Series` with a certain dtype and a certain name. So the Dask version
returns a Dask Series with the same dtype and the same name.

To get the actual result you can call ``.compute()``.

.. ipython:: python
:okwarning:
%time ddf["name"].value_counts().compute()
At that point, you get back the same thing you'd get with pandas, in this case
a concrete pandas :class:`pandas.Series` with the count of each ``name``.

Calling ``.compute`` causes the full task graph to be executed. This includes
reading the data, selecting the columns, and doing the ``value_counts``. The
execution is done *in parallel* where possible, and Dask tries to keep the
overall memory footprint small. You can work with datasets that are much larger
than memory, as long as each partition (a regular pandas :class:`pandas.DataFrame`) fits in memory.

By default, ``dask.dataframe`` operations use a threadpool to do operations in
parallel. We can also connect to a cluster to distribute the work on many
machines. In this case we'll connect to a local "cluster" made up of several
processes on this single machine.

.. code-block:: python
>>> from dask.distributed import Client, LocalCluster
>>> cluster = LocalCluster()
>>> client = Client(cluster)
>>> client
<Client: 'tcp://127.0.0.1:53349' processes=4 threads=8, memory=17.18 GB>
Once this ``client`` is created, all of Dask's computation will take place on
the cluster (which is just processes in this case).

Dask implements the most used parts of the pandas API. For example, we can do
a familiar groupby aggregation.

.. ipython:: python
:okwarning:
%time ddf.groupby("name")[["x", "y"]].mean().compute().head()
The grouping and aggregation is done out-of-core and in parallel.

When Dask knows the ``divisions`` of a dataset, certain optimizations are
possible. When reading parquet datasets written by dask, the divisions will be
known automatically. In this case, since we created the parquet files manually,
we need to supply the divisions manually.

.. ipython:: python
:okwarning:
N = 12
starts = [f"20{i:>02d}-01-01" for i in range(N)]
ends = [f"20{i:>02d}-12-13" for i in range(N)]
divisions = tuple(pd.to_datetime(starts)) + (pd.Timestamp(ends[-1]),)
ddf.divisions = divisions
ddf
Now we can do things like fast random access with ``.loc``.

.. ipython:: python
:okwarning:
ddf.loc["2002-01-01 12:01":"2002-01-01 12:05"].compute()
Dask knows to just look in the 3rd partition for selecting values in 2002. It
doesn't need to look at any other data.

Many workflows involve a large amount of data and processing it in a way that
reduces the size to something that fits in memory. In this case, we'll resample
to daily frequency and take the mean. Once we've taken the mean, we know the
results will fit in memory, so we can safely call ``compute`` without running
out of memory. At that point it's just a regular pandas object.

.. ipython:: python
:okwarning:
@savefig dask_resample.png
ddf[["x", "y"]].resample("1D").mean().cumsum().compute().plot()
.. ipython:: python
:suppress:
import shutil
shutil.rmtree("data/timeseries")
These Dask examples have all be done using multiple processes on a single
machine. Dask can be `deployed on a cluster
<https://docs.dask.org/en/latest/setup.html>`_ to scale up to even larger
datasets.

You see more dask examples at https://examples.dask.org.

.. _Dask: https://dask.org
.. _dask.dataframe: https://docs.dask.org/en/latest/dataframe.html
There are other libraries which provide similar APIs to pandas and work nicely with pandas DataFrame,
and can give you the ability to scale your large dataset processing and analytics
by parallel runtime, distributed memory, clustering, etc. You can find more information
in `the ecosystem page <https://pandas.pydata.org/community/ecosystem.html#out-of-core>`_.
Loading