Groups > linux.debian.maint.python > #15644 > unrolled thread

Seeking a small group to package Apache Arrow (was: Bug#970021: RFP: apache-arrow -- cross-language development platform for in-memory analytics)

Started by	Julian Gilbey <jdg@debian.org>
First post	2024-03-25 19:20 +0100
Last post	2024-04-06 09:52 +0200
Articles	6 — 5 participants

Back to article view | Back to linux.debian.maint.python

  Seeking a small group to package Apache Arrow (was: Bug#970021: RFP:  apache-arrow -- cross-language development platform for in-memory analytics) Julian Gilbey <jdg@debian.org> - 2024-03-25 19:20 +0100
    Bug#970021: Seeking a small group to package Apache Arrow (was: Bug#970021: RFP: apache-arrow -- cross-language development platform for in-memory analytics) Rene Engelhard <rene@debian.org> - 2024-03-29 21:00 +0100
    Bug#970021: Seeking a small group to package Apache Arrow (was: Bug#970021: RFP: apache-arrow -- cross-language development platform for in-memory analytics) Julian Gilbey <julian@d-and-j.net> - 2024-03-30 21:30 +0100
      Bug#970021: Seeking a small group to package Apache Arrow (was: Bug#970021: RFP: apache-arrow -- cross-language development platform for in-memory analytics) Julian Gilbey <julian@d-and-j.net> - 2024-03-31 13:30 +0200
        Bug#970021: Seeking a small group to package Apache Arrow (was: Bug#970021: RFP: apache-arrow -- cross-language development platform for in-memory analytics) Dirk Eddelbuettel <edd@debian.org> - 2024-03-31 14:00 +0200
    Bug#970021: Seeking a small group to package Apache Arrow (was: Bug#970021: RFP: apache-arrow -- cross-language development platform for in-memory analytics) Richard Duivenvoorde <richard@duiv.nl> - 2024-04-06 09:52 +0200

#15644 — Seeking a small group to package Apache Arrow (was: Bug#970021: RFP: apache-arrow -- cross-language development platform for in-memory analytics)

From	Julian Gilbey <jdg@debian.org>
Date	2024-03-25 19:20 +0100
Subject	Seeking a small group to package Apache Arrow (was: Bug#970021: RFP: apache-arrow -- cross-language development platform for in-memory analytics)
Message-ID	<IlX5n-1Dj0-1@gated-at.bofh.it>

Hi all,

[NB: sent to d-science, d-python, d-devel and the RFP bug; reply-to
set to d-science and the RFP bug only]

An update on Apache Arrow, and in particular the Python library
PyArrow.  For those who don't know:

  Apache Arrow is a development platform for in-memory analytics. It
  contains a set of technologies that enable big data systems to
  process and move data fast. It specifies a standardized
  language-independent columnar memory format for flat and
  hierarchical data, organized for efficient analytic operations on
  modern hardware.

  The project is developing a multi-language collection of libraries
  for solving systems problems related to in-memory analytical data
  processing. This includes such topics as:

  * Zero-copy shared memory and RPC-based data movement

  * Reading and writing file formats (like CSV, Apache ORC, and Apache
    Parquet)

  * In-memory analytics and query processing

  (from: https://arrow.apache.org/docs/index.html)

Pandas has announced that Pandas 3.x will depend on PyArrow
in a critical way (it will back the "string" datatype), and it is due
to be released imminently.

So this is a plea for anyone looking for something really helpful to
do: it would be great to have a group of developers finally package
this!  There was some initial work done (see the RFP bug report for
details: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=970021),
but that is fairly old now.  As Apache Arrow supports numerous
languages, it may well benefit from having a group of developers with
different areas of expertise to build it.  (Or perhaps it would make
more sense to split the upstream source into a collection of different
Debian source packages for the different supported languages.  I don't
know.)  Unfortunately I don't have the capacity to devote any time to
it myself.

Thanks in advance for anyone who can step forward for this!

Best wishes,

   Julian

[toc] | [next] | [standalone]

#15682 — Bug#970021: Seeking a small group to package Apache Arrow (was: Bug#970021: RFP: apache-arrow -- cross-language development platform for in-memory analytics)

From	Rene Engelhard <rene@debian.org>
Date	2024-03-29 21:00 +0100
Subject	Bug#970021: Seeking a small group to package Apache Arrow (was: Bug#970021: RFP: apache-arrow -- cross-language development platform for in-memory analytics)
Message-ID	<Inqyl-2AVn-11@gated-at.bofh.it>
In reply to	#15644

Hi,

Am 25.03.24 um 19:17 schrieb Julian Gilbey:
>    * Reading and writing file formats (like CSV, Apache ORC, and Apache
>      Parquet)

liborcus supports this (Apache Parquet) if built with Apache Arrow. And 
thus makes LibreOffice being able to handle it.

I didn't invest any time in Apache Arrow since I am already too low on 
time anyway and I deemed it too a "low popularity" thing anyway.

> So this is a plea for anyone looking for something really helpful to
> do: it would be great to have a group of developers finally package
> this!
Indeed.
> There was some initial work done (see the RFP bug report for
> details: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=970021),
> but that is fairly old now.  As Apache Arrow supports numerous
> languages, it may well benefit from having a group of developers with
> different areas of expertise to build it.  (Or perhaps it would make
> more sense to split the upstream source into a collection of different
> Debian source packages for the different supported languages.  I don't
> know.)

Would definitely make transitions easier.

>   Unfortunately I don't have the capacity to devote any time to
> it myself.

Dito.


Regards,


Rene

[toc] | [prev] | [next] | [standalone]

#15689 — Bug#970021: Seeking a small group to package Apache Arrow (was: Bug#970021: RFP: apache-arrow -- cross-language development platform for in-memory analytics)

From	Julian Gilbey <julian@d-and-j.net>
Date	2024-03-30 21:30 +0100
Subject	Bug#970021: Seeking a small group to package Apache Arrow (was: Bug#970021: RFP: apache-arrow -- cross-language development platform for in-memory analytics)
Message-ID	<InNuV-2S5d-3@gated-at.bofh.it>
In reply to	#15644

Hi Diane,

On Fri, Mar 29, 2024 at 11:49:07AM -0700, Diane Trout wrote:
> On Mon, 2024-03-25 at 18:17 +0000, Julian Gilbey wrote:
> > 
> > 
> > So this is a plea for anyone looking for something really helpful to
> > do: it would be great to have a group of developers finally package
> > this!  There was some initial work done (see the RFP bug report for
> > details: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=970021),
> > but that is fairly old now.  As Apache Arrow supports numerous
> > languages, it may well benefit from having a group of developers with
> > different areas of expertise to build it.  (Or perhaps it would make
> > more sense to split the upstream source into a collection of
> > different
> > Debian source packages for the different supported languages.  I
> > don't
> > know.)  Unfortunately I don't have the capacity to devote any time to
> > it myself.
> > 
> > Thanks in advance for anyone who can step forward for this!
> 
> I've been maintain dask and anndata and saw that apache arrow was
> getting increasingly popular.
> 
> I took the current science-team preliminary packaging 7.0.0 packaging
> and managed to get it to build through a combination of patches and
> turning off features.
> 
> I even mostly managed to get pyarrow to build. (Though some tests fail
> due to pytest lazy-fixture being abandoned).
> 
> I pushed my current work in progress to.
> 
> https://salsa.debian.org/diane/arrow.git
> 
> Was anyone else planning on working on it or should I push my updates
> to the science-team package?

Lovely to hear from you, and oh wow, that's amazing, thank you!

I can't speak for anyone else, but I suggest that pushing your updates
to the science-team package would be very sensible; it would be silly
for someone else to have to redo your work.

What more is needed for it to be ready for unstable?

Best wishes,

   Julian

[toc] | [prev] | [next] | [standalone]

#15690 — Bug#970021: Seeking a small group to package Apache Arrow (was: Bug#970021: RFP: apache-arrow -- cross-language development platform for in-memory analytics)

From	Julian Gilbey <julian@d-and-j.net>
Date	2024-03-31 13:30 +0200
Subject	Bug#970021: Seeking a small group to package Apache Arrow (was: Bug#970021: RFP: apache-arrow -- cross-language development platform for in-memory analytics)
Message-ID	<Io1xT-30TQ-1@gated-at.bofh.it>
In reply to	#15689

Hi Diane,

On Sat, Mar 30, 2024 at 08:59:39PM -0700, Diane Trout wrote:
> Hi Julian,
> 
> On Sat, 2024-03-30 at 20:22 +0000, Julian Gilbey wrote:
> > Lovely to hear from you, and oh wow, that's amazing, thank you!
> > 
> > I can't speak for anyone else, but I suggest that pushing your
> > updates
> > to the science-team package would be very sensible; it would be silly
> > for someone else to have to redo your work.
> > 
> > What more is needed for it to be ready for unstable?
> 
> 
> The things I think are kind of broken are:
> 
> We've got 7.0.0 and upstreams current version is 15.0.2.

Yes, that does seem a little less than ideal!

> the pyarrow 7.0.0 tests fail because it depends on a python test
> library that breaks with pytest 8.0. Either I need to disable the
> python tests or upgrade to a newer version.

It may well be that newer versions would work with pytest 8.x.  I
don't think it's worth spending time trying to patch such a relatively
old version.

> My upgrade didn't go smoothly because uscan found also upstreams debian
> watch file which is too loose and matches some other tar balls on their
> distribution site.
> 
> (Though I don't know why uscan keeps looking for watch files after
> finding one in debian/watch)

Oh dear.  uscan(1) does say:

       Unless --watchfile is given, uscan looks recursively for valid source
       trees starting from the current directory (see the below section
       "Directory name checking" for details).

and then:

       For each valid source tree found, typically the following happens:
       [...]

so yes, it will look at more than one location.

> And you were probably right in that arrow needs to be a team, because I
> have no idea how to get other the other languages interfaces packaged.

I suggest that without anyone else volunteering to do those other
language interfaces (perhaps it's not a pressing need for people
working with language X), I wonder whether it's worth just packaging
the Python (and presumably C++) interfaces for now, and then if others
want to join the effort to support language X later on, a new version
of the Debian package can be uploaded with a new binary package for
language X.  It does mean more trips through the NEW queue if and when
that happens, but given that no-one's shown interest in language X for
the last several years, this is unlikely to be much of an issue.

Version 7.0 provided support (it seems) for: GLib (seems that a draft
framework for building this is already in the Debian package, and it
can then be used in lots of languages), C++ (this is the core
libraries), C# (not of interest to us), Go, Java, JavaScript, Julia,
Matlab (not of interest to us), Python, R, Ruby.

> Oh and I probably need to get the pyarrow installed somewhere, since it
> was stopping at the tests I hadn't run into dh_missing errors yet.

Oh.  Would pybuild do that automatically (perhaps specifying
PYBUILD_PACKAGE)?

Best wishes,

   Julian

[toc] | [prev] | [next] | [standalone]

#15692 — Bug#970021: Seeking a small group to package Apache Arrow (was: Bug#970021: RFP: apache-arrow -- cross-language development platform for in-memory analytics)

From	Dirk Eddelbuettel <edd@debian.org>
Date	2024-03-31 14:00 +0200
Subject	Bug#970021: Seeking a small group to package Apache Arrow (was: Bug#970021: RFP: apache-arrow -- cross-language development platform for in-memory analytics)
Message-ID	<Io1Rf-310B-9@gated-at.bofh.it>
In reply to	#15690

Julian,

Arrow is a complicated and large package. We use it at work (where there is a
fair amount of Python, also to Conda etc) and do have issues with more
complex builds especially because it is 'data infrastructure' and can come in
from different parts. I would recommend against packaging at old one -- we
also have seen issues with different (py)arrow version biting.

Have you seen https://github.com/apache/arrow-nanoarrow ?

It works via the C API to Arrow which interchanges data via two void* to the
the two structs for arrow array and schema -- and avoids linkage issue. (In
user space the pyarrow or R arrow packages can still be used also interfacing
via these.)  I have been using it for R package bindings for some time and we
plan to expand that (again, at work) -- as do others. It is already use by
duckdb, by the Arrow 'ADBC' interfaces (which are generic in the ODBC/JDBC
sense but for Arrow, and also by a python interface to snowflake.

Dirk

-- 
dirk.eddelbuettel.com | @eddelbuettel | edd@debian.org

[toc] | [prev] | [next] | [standalone]

#15736 — Bug#970021: Seeking a small group to package Apache Arrow (was: Bug#970021: RFP: apache-arrow -- cross-language development platform for in-memory analytics)

From	Richard Duivenvoorde <richard@duiv.nl>
Date	2024-04-06 09:52 +0200
Subject	Bug#970021: Seeking a small group to package Apache Arrow (was: Bug#970021: RFP: apache-arrow -- cross-language development platform for in-memory analytics)
Message-ID	<Iq8ZS-4oGU-3597@gated-at.bofh.it>
In reply to	#15644

On 3/25/24 7:17 PM, Julian Gilbey wrote:
> So this is a plea for anyone looking for something really helpful to
> do: it would be great to have a group of developers finally package
> this!  There was some initial work done (see the RFP bug report for
> details: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=970021),
> but that is fairly old now.  As Apache Arrow supports numerous
> languages, it may well benefit from having a group of developers with
> different areas of expertise to build it.  (Or perhaps it would make
> more sense to split the upstream source into a collection of different
> Debian source packages for the different supported languages.  I don't
> know.)  Unfortunately I don't have the capacity to devote any time to
> it myself.
> 
> Thanks in advance for anyone who can step forward for this!

As someone from the Debian-GIS community, I would also be very interested in this!

The Apache Arrow C++ library is one of the dependencies to make GDAL/OGR able to read/write (geo)parquet files, a data format with a lot traction in the geo community [0]. Thereby making it possible for QGIS to handle those (on Debian).

[0] https://cloudnativegeo.org/blog/2023/09/duckdb-the-indispensable-geospatial-tool-you-didnt-know-you-were-missing/

Regards,

Richard Duivenvoorde

[toc] | [prev] | [standalone]

csiph-web

Seeking a small group to package Apache Arrow (was: Bug#970021: RFP: apache-arrow -- cross-language development platform for in-memory analytics)

Contents

#15644 — Seeking a small group to package Apache Arrow (was: Bug#970021: RFP: apache-arrow -- cross-language development platform for in-memory analytics)

#15682 — Bug#970021: Seeking a small group to package Apache Arrow (was: Bug#970021: RFP: apache-arrow -- cross-language development platform for in-memory analytics)

#15689 — Bug#970021: Seeking a small group to package Apache Arrow (was: Bug#970021: RFP: apache-arrow -- cross-language development platform for in-memory analytics)

#15690 — Bug#970021: Seeking a small group to package Apache Arrow (was: Bug#970021: RFP: apache-arrow -- cross-language development platform for in-memory analytics)

#15692 — Bug#970021: Seeking a small group to package Apache Arrow (was: Bug#970021: RFP: apache-arrow -- cross-language development platform for in-memory analytics)

#15736 — Bug#970021: Seeking a small group to package Apache Arrow (was: Bug#970021: RFP: apache-arrow -- cross-language development platform for in-memory analytics)