Thursday, March 31, 2016

What is a path?

Wisecracker.

pathlib is a provisional stdlib module. However, as the current threads (here, here, here and here) on python-ideas show, it is not as easy to work with as originally intended. Once you have a Path object, it's quite easy to use what Path offers which is a lot.

One big problem, though, is the interaction of Path objects with existing stdlib functions. Most of the later are string-consuming functions whereas the former are no strings at all. As far as I can see, this is one reason why pathlib lacks broader adoption and many agree. This situation leads to the following possible resolutions:
  1. make Path objects compatible with strings (basically make them inherit from strings)
  2. make existing stdlib functions accept Path objects (basically make them accept both and convert if needed)
  3. do both but that seems superfluous
Solution 2 would also affect third-party libraries as noted here.

In order to decide appropriately, it becomes necessary to answer the following question:

What is a path in the first place?

Brett Cannon made a good point of why PEP 428 (that's the one which introduced pathlib as a stdlib module) deliberately chose Path not to inherit from string. I pondered over it for a while and saw that from his perspective a path is actually not a string but rather a complex data-structure consisting of parts with distinct meanings: literally the steps (of which a path consists) to a resource. The string which represents a path as most people know it is just that: a representation of a more complex object, just like a dict or a list. Let me make this a bit clearer. 

I think we agree on the following: writing down 21 characters in a row is a string, right? So, what about these 21 characters?

{1: 11, 2: 12, 3: 13}

If you see that in a Python program (and presumably in many other modern programming languages), you associate that with a dictionary, mapping, hash, etc. So, these 21 characters are a mere representation of a complex object with a very rich functionality.

The following paragraphs summarizes what makes the discussion about paths and strings to hard. Depending on whom you ask there are different interpretations of what a path actually is.

Paths as complex objects and strings as their representation

Let's put this analogy to work with paths. If you come from a language that treats strings as file paths (like Python), you can imagine and categorize the facilities of pathlib like so:
  • pure path - operating on the path string
  • concrete path - operating on files corresponding to the given path
The classic "extract the file extension" issue is done easily with the pure path methods. Writing to a file is also easily done with concrete path operations. So, it seems paths are pretty complex objects with some internal structure and a lot of functionality.

Paths as monolithic object for addressing resources

The previous interpretation is not the only one. Despite all the fine functionality of extracting file extensions, concatenating parts to a larger path, etc., building a path is not an end in itself. When you got a path, it addresses a resource on a machine. When doing so for reading or writing that resource, you actually don't care about whether the path consists of parts or not. To you, it's a monolithic structure, an address.

But, you might say, each part of a path represents a directory in hierarchical file systems. Sure that is true for many file systems but not for all. Moreover, how often do you really care about the underlying directory structure? It needs to be there to make things work, of course. When it's there, you mostly don't care. How often do you need to create a subtree in an directory in order to create a single config file? I encounter this once in a while and to be honest: it sucks.

touch /home/me/on/your/ssd.conf will fail if the directory "on/your/" has not been created by somebody before me.

Especially for me, as a Web developer, it's quite hard to understand what purpose this restriction serves. Within a Web application the hierarchy of URLs is an emergent property not a prerequisite.

Users of git are accustomed to not-committing directories in. Why? Because it's unnecessary and the directory structure is again emerging from the files names themselves (aka from the content).

This said, it's rather cumbersome to attribute semantics to the parts of a string that happens to be separated by "/" or "\". At least to me, a path made of one piece.

What about security then?

One can further argue that Web development and git repositories are different here. There is a clear boundary where a path can lead. A URL path cannot address a foreign resource on another domain. git file paths are contained within the repository root.

See the common theme? There is a container from where the path of a resource cannot escape.

If you have a complete file system available at your fingertips,  a lot harm can be done when malicious user input is concatenated unattendedly as a subpath; actually to address a resource within a container but misused to gain access to the complete file system.

I cannot say if the container pattern would work for everybody but it's definitely worth exploring as there are some prominent working examples out there.

Conclusion

I really like pathlib since it solves many frequently asked questions in the right way once and for all.

But I don't like using it as an argument again inheriting paths from strings saying paths have internal structure in contrast to strings. At least to me, they do not. That, on the other hand, does not necessarily mean inheriting path from string is a good idea but it makes it no worse one than it was before.

Best,
Sven

No comments:

Post a Comment