Commit Graph

12 Commits

Author SHA1 Message Date
Kristóf Tóth a06a9c4bb8 Use __main__.py for as cli tool entrypoint 2022-02-19 13:36:27 +01:00
Kristóf Tóth 8597e44d3b Add setup.py and make cli easily installable via entry_points 2022-02-15 22:38:02 +01:00
Kristóf Tóth bde2b0067e Implement independent directory tree hashing without tar
This patch implements the os.walk() based directory hashing mentioned
in 3fb80a43. Rough benchmarking shows that while it's not terrible
performance wise, the GNU tar based solution almost always outperforms
it. Therefore it is used as a fallback implementation in case GNU tar is
not found when trying to generate an identicon from a directory tree.
2022-02-14 22:32:00 +01:00
Kristóf Tóth 3179be1f76 Implement --fingerprint option for manual fingerprint checking 2022-02-14 01:32:12 +01:00
Kristóf Tóth 7de587148e Enforce consistent string quoting style 2022-02-14 01:29:30 +01:00
Kristóf Tóth 36c29627af Fix broken -f option on single files instead of directories 2022-02-14 00:08:10 +01:00
Kristóf Tóth f90ac943cf Use blake3 instead of blake2 for a considerable performance increase
Based on some rough benchmarking performed on reasonably modern,
but not over the top laptop hardware (i7-8665U + PCIe3 NVMe SSD)
this results in raw disk io (+ tar overhead) becoming the performance
bottleneck instead of hashing rate.
2022-02-14 00:07:30 +01:00
Kristóf Tóth 3fb80a4394 Add support for indetifying whole directory trees using tar
This patch uses a clever invocation of GNU tar to produce a
deterministic bytestream from a directory tree. This stream is fed to a
hash function in 64KiB chunks (== Linux default pipe capacity) to
produce a fingerprint which can be displayed as an identicon.
Why would we do this instead of using the tarfile stdlib package or just
using os.walk plus some code?

The tarfile package is not capable of producing the output as a stream
of N byte chunks. The "most granual" mode of operation it can do is
producing all of the chunks belonging to a given file all at once.
This is problematic, because we could run out of memory or be forced to
write the tar archive to a temporary file - which would be painfully slow,
we could run out of disk space, wear out SSDs or outright refuse to run in a
container with a read-only rootfs and no tmpfs mounted.

An os.walk solution is doable, but will require some problem solving
which I am too lazy to do right now:
  - Forcing os.walk to walk in a deterministic order (should be easy)
  - Walks on different directory structures could theoretically produce
    the same bytestream (doable but requires some thinking)
The GNU tar solution is far from ideal (it forces an external dependency
and requires a subprocess call and some pipe juggling) but is very easy
to implement and should be fine performance wise:
  - The bottleneck on reasonable hardware configurations should
    be hashing or disk IO
  - The cost of doing a fork/exec is negligible compared to either

TL;DR os.walk: maybe in a future patch
2022-02-13 22:24:04 +01:00
Kristóf Tóth ce52ea6e58 Increase buffer size to 64k (Linux default pipe capacity) 2022-02-12 14:02:35 +01:00
Kristóf Tóth 67772abe6d Fix STDIN pipe detection logic 2022-02-10 01:46:35 +01:00
Kristóf Tóth 5e4ac9681b Add trivial CLI tool for calculating identicons 2022-02-10 00:43:52 +01:00
Kristóf Tóth 0dd22bd7bc Implement OpenSSH style identicon calculation 2022-02-10 00:43:17 +01:00