identicon/identicon/__main__.py

75 lines
1.8 KiB
Python
Raw Normal View History

Add support for indetifying whole directory trees using tar This patch uses a clever invocation of GNU tar to produce a deterministic bytestream from a directory tree. This stream is fed to a hash function in 64KiB chunks (== Linux default pipe capacity) to produce a fingerprint which can be displayed as an identicon. Why would we do this instead of using the tarfile stdlib package or just using os.walk plus some code? The tarfile package is not capable of producing the output as a stream of N byte chunks. The "most granual" mode of operation it can do is producing all of the chunks belonging to a given file all at once. This is problematic, because we could run out of memory or be forced to write the tar archive to a temporary file - which would be painfully slow, we could run out of disk space, wear out SSDs or outright refuse to run in a container with a read-only rootfs and no tmpfs mounted. An os.walk solution is doable, but will require some problem solving which I am too lazy to do right now: - Forcing os.walk to walk in a deterministic order (should be easy) - Walks on different directory structures could theoretically produce the same bytestream (doable but requires some thinking) The GNU tar solution is far from ideal (it forces an external dependency and requires a subprocess call and some pipe juggling) but is very easy to implement and should be fine performance wise: - The bottleneck on reasonable hardware configurations should be hashing or disk IO - The cost of doing a fork/exec is negligible compared to either TL;DR os.walk: maybe in a future patch
2022-02-13 19:14:02 +00:00
#!/usr/bin/env python3
from sys import stdin
from sys import exit as sysexit
from io import BytesIO
Add support for indetifying whole directory trees using tar This patch uses a clever invocation of GNU tar to produce a deterministic bytestream from a directory tree. This stream is fed to a hash function in 64KiB chunks (== Linux default pipe capacity) to produce a fingerprint which can be displayed as an identicon. Why would we do this instead of using the tarfile stdlib package or just using os.walk plus some code? The tarfile package is not capable of producing the output as a stream of N byte chunks. The "most granual" mode of operation it can do is producing all of the chunks belonging to a given file all at once. This is problematic, because we could run out of memory or be forced to write the tar archive to a temporary file - which would be painfully slow, we could run out of disk space, wear out SSDs or outright refuse to run in a container with a read-only rootfs and no tmpfs mounted. An os.walk solution is doable, but will require some problem solving which I am too lazy to do right now: - Forcing os.walk to walk in a deterministic order (should be easy) - Walks on different directory structures could theoretically produce the same bytestream (doable but requires some thinking) The GNU tar solution is far from ideal (it forces an external dependency and requires a subprocess call and some pipe juggling) but is very easy to implement and should be fine performance wise: - The bottleneck on reasonable hardware configurations should be hashing or disk IO - The cost of doing a fork/exec is negligible compared to either TL;DR os.walk: maybe in a future patch
2022-02-13 19:14:02 +00:00
import click
from blake3 import blake3
from . import Identicon, get_deterministic_stream, ClosableStream
DIGEST_SIZE = 20
Add support for indetifying whole directory trees using tar This patch uses a clever invocation of GNU tar to produce a deterministic bytestream from a directory tree. This stream is fed to a hash function in 64KiB chunks (== Linux default pipe capacity) to produce a fingerprint which can be displayed as an identicon. Why would we do this instead of using the tarfile stdlib package or just using os.walk plus some code? The tarfile package is not capable of producing the output as a stream of N byte chunks. The "most granual" mode of operation it can do is producing all of the chunks belonging to a given file all at once. This is problematic, because we could run out of memory or be forced to write the tar archive to a temporary file - which would be painfully slow, we could run out of disk space, wear out SSDs or outright refuse to run in a container with a read-only rootfs and no tmpfs mounted. An os.walk solution is doable, but will require some problem solving which I am too lazy to do right now: - Forcing os.walk to walk in a deterministic order (should be easy) - Walks on different directory structures could theoretically produce the same bytestream (doable but requires some thinking) The GNU tar solution is far from ideal (it forces an external dependency and requires a subprocess call and some pipe juggling) but is very easy to implement and should be fine performance wise: - The bottleneck on reasonable hardware configurations should be hashing or disk IO - The cost of doing a fork/exec is negligible compared to either TL;DR os.walk: maybe in a future patch
2022-02-13 19:14:02 +00:00
BUF_SIZE = 65536 # Linux default pipe capacity is 64KiB (64 * 2^10)
Add support for indetifying whole directory trees using tar This patch uses a clever invocation of GNU tar to produce a deterministic bytestream from a directory tree. This stream is fed to a hash function in 64KiB chunks (== Linux default pipe capacity) to produce a fingerprint which can be displayed as an identicon. Why would we do this instead of using the tarfile stdlib package or just using os.walk plus some code? The tarfile package is not capable of producing the output as a stream of N byte chunks. The "most granual" mode of operation it can do is producing all of the chunks belonging to a given file all at once. This is problematic, because we could run out of memory or be forced to write the tar archive to a temporary file - which would be painfully slow, we could run out of disk space, wear out SSDs or outright refuse to run in a container with a read-only rootfs and no tmpfs mounted. An os.walk solution is doable, but will require some problem solving which I am too lazy to do right now: - Forcing os.walk to walk in a deterministic order (should be easy) - Walks on different directory structures could theoretically produce the same bytestream (doable but requires some thinking) The GNU tar solution is far from ideal (it forces an external dependency and requires a subprocess call and some pipe juggling) but is very easy to implement and should be fine performance wise: - The bottleneck on reasonable hardware configurations should be hashing or disk IO - The cost of doing a fork/exec is negligible compared to either TL;DR os.walk: maybe in a future patch
2022-02-13 19:14:02 +00:00
@click.command(
help=(
'Generate OpenSSH style randomart identicon for arbitrary data.\n\n'
Add support for indetifying whole directory trees using tar This patch uses a clever invocation of GNU tar to produce a deterministic bytestream from a directory tree. This stream is fed to a hash function in 64KiB chunks (== Linux default pipe capacity) to produce a fingerprint which can be displayed as an identicon. Why would we do this instead of using the tarfile stdlib package or just using os.walk plus some code? The tarfile package is not capable of producing the output as a stream of N byte chunks. The "most granual" mode of operation it can do is producing all of the chunks belonging to a given file all at once. This is problematic, because we could run out of memory or be forced to write the tar archive to a temporary file - which would be painfully slow, we could run out of disk space, wear out SSDs or outright refuse to run in a container with a read-only rootfs and no tmpfs mounted. An os.walk solution is doable, but will require some problem solving which I am too lazy to do right now: - Forcing os.walk to walk in a deterministic order (should be easy) - Walks on different directory structures could theoretically produce the same bytestream (doable but requires some thinking) The GNU tar solution is far from ideal (it forces an external dependency and requires a subprocess call and some pipe juggling) but is very easy to implement and should be fine performance wise: - The bottleneck on reasonable hardware configurations should be hashing or disk IO - The cost of doing a fork/exec is negligible compared to either TL;DR os.walk: maybe in a future patch
2022-02-13 19:14:02 +00:00
'If TEXT or --file is not supplied, data is read from STDIN.'
)
)
@click.argument('text', default=None, type=str, required=False)
@click.option(
'--file', '-f', default=None, type=click.Path(exists=True),
help='Calculate from file or directory (recursive).'
)
@click.option(
'--fingerprint', '-p', default=False, required=False, is_flag=True,
help='Print fingerprint instead of identicon.'
)
Add support for indetifying whole directory trees using tar This patch uses a clever invocation of GNU tar to produce a deterministic bytestream from a directory tree. This stream is fed to a hash function in 64KiB chunks (== Linux default pipe capacity) to produce a fingerprint which can be displayed as an identicon. Why would we do this instead of using the tarfile stdlib package or just using os.walk plus some code? The tarfile package is not capable of producing the output as a stream of N byte chunks. The "most granual" mode of operation it can do is producing all of the chunks belonging to a given file all at once. This is problematic, because we could run out of memory or be forced to write the tar archive to a temporary file - which would be painfully slow, we could run out of disk space, wear out SSDs or outright refuse to run in a container with a read-only rootfs and no tmpfs mounted. An os.walk solution is doable, but will require some problem solving which I am too lazy to do right now: - Forcing os.walk to walk in a deterministic order (should be easy) - Walks on different directory structures could theoretically produce the same bytestream (doable but requires some thinking) The GNU tar solution is far from ideal (it forces an external dependency and requires a subprocess call and some pipe juggling) but is very easy to implement and should be fine performance wise: - The bottleneck on reasonable hardware configurations should be hashing or disk IO - The cost of doing a fork/exec is negligible compared to either TL;DR os.walk: maybe in a future patch
2022-02-13 19:14:02 +00:00
def main(**kwargs):
if not (stream := get_input_stream(kwargs)):
print_usage_and_exit()
Add support for indetifying whole directory trees using tar This patch uses a clever invocation of GNU tar to produce a deterministic bytestream from a directory tree. This stream is fed to a hash function in 64KiB chunks (== Linux default pipe capacity) to produce a fingerprint which can be displayed as an identicon. Why would we do this instead of using the tarfile stdlib package or just using os.walk plus some code? The tarfile package is not capable of producing the output as a stream of N byte chunks. The "most granual" mode of operation it can do is producing all of the chunks belonging to a given file all at once. This is problematic, because we could run out of memory or be forced to write the tar archive to a temporary file - which would be painfully slow, we could run out of disk space, wear out SSDs or outright refuse to run in a container with a read-only rootfs and no tmpfs mounted. An os.walk solution is doable, but will require some problem solving which I am too lazy to do right now: - Forcing os.walk to walk in a deterministic order (should be easy) - Walks on different directory structures could theoretically produce the same bytestream (doable but requires some thinking) The GNU tar solution is far from ideal (it forces an external dependency and requires a subprocess call and some pipe juggling) but is very easy to implement and should be fine performance wise: - The bottleneck on reasonable hardware configurations should be hashing or disk IO - The cost of doing a fork/exec is negligible compared to either TL;DR os.walk: maybe in a future patch
2022-02-13 19:14:02 +00:00
digest = get_digest(stream.stream)
stream.close()
if not kwargs.get('fingerprint'):
i = Identicon(digest)
i.calculate()
print(i)
else:
print(digest.hex())
Add support for indetifying whole directory trees using tar This patch uses a clever invocation of GNU tar to produce a deterministic bytestream from a directory tree. This stream is fed to a hash function in 64KiB chunks (== Linux default pipe capacity) to produce a fingerprint which can be displayed as an identicon. Why would we do this instead of using the tarfile stdlib package or just using os.walk plus some code? The tarfile package is not capable of producing the output as a stream of N byte chunks. The "most granual" mode of operation it can do is producing all of the chunks belonging to a given file all at once. This is problematic, because we could run out of memory or be forced to write the tar archive to a temporary file - which would be painfully slow, we could run out of disk space, wear out SSDs or outright refuse to run in a container with a read-only rootfs and no tmpfs mounted. An os.walk solution is doable, but will require some problem solving which I am too lazy to do right now: - Forcing os.walk to walk in a deterministic order (should be easy) - Walks on different directory structures could theoretically produce the same bytestream (doable but requires some thinking) The GNU tar solution is far from ideal (it forces an external dependency and requires a subprocess call and some pipe juggling) but is very easy to implement and should be fine performance wise: - The bottleneck on reasonable hardware configurations should be hashing or disk IO - The cost of doing a fork/exec is negligible compared to either TL;DR os.walk: maybe in a future patch
2022-02-13 19:14:02 +00:00
def get_input_stream(kwargs):
stream = None
if (text := kwargs['text']) is not None:
Add support for indetifying whole directory trees using tar This patch uses a clever invocation of GNU tar to produce a deterministic bytestream from a directory tree. This stream is fed to a hash function in 64KiB chunks (== Linux default pipe capacity) to produce a fingerprint which can be displayed as an identicon. Why would we do this instead of using the tarfile stdlib package or just using os.walk plus some code? The tarfile package is not capable of producing the output as a stream of N byte chunks. The "most granual" mode of operation it can do is producing all of the chunks belonging to a given file all at once. This is problematic, because we could run out of memory or be forced to write the tar archive to a temporary file - which would be painfully slow, we could run out of disk space, wear out SSDs or outright refuse to run in a container with a read-only rootfs and no tmpfs mounted. An os.walk solution is doable, but will require some problem solving which I am too lazy to do right now: - Forcing os.walk to walk in a deterministic order (should be easy) - Walks on different directory structures could theoretically produce the same bytestream (doable but requires some thinking) The GNU tar solution is far from ideal (it forces an external dependency and requires a subprocess call and some pipe juggling) but is very easy to implement and should be fine performance wise: - The bottleneck on reasonable hardware configurations should be hashing or disk IO - The cost of doing a fork/exec is negligible compared to either TL;DR os.walk: maybe in a future patch
2022-02-13 19:14:02 +00:00
stream = ClosableStream(BytesIO(text.encode()))
elif file := kwargs['file']:
stream = get_deterministic_stream(file, BUF_SIZE)
2022-02-10 00:46:35 +00:00
elif not stdin.isatty():
Add support for indetifying whole directory trees using tar This patch uses a clever invocation of GNU tar to produce a deterministic bytestream from a directory tree. This stream is fed to a hash function in 64KiB chunks (== Linux default pipe capacity) to produce a fingerprint which can be displayed as an identicon. Why would we do this instead of using the tarfile stdlib package or just using os.walk plus some code? The tarfile package is not capable of producing the output as a stream of N byte chunks. The "most granual" mode of operation it can do is producing all of the chunks belonging to a given file all at once. This is problematic, because we could run out of memory or be forced to write the tar archive to a temporary file - which would be painfully slow, we could run out of disk space, wear out SSDs or outright refuse to run in a container with a read-only rootfs and no tmpfs mounted. An os.walk solution is doable, but will require some problem solving which I am too lazy to do right now: - Forcing os.walk to walk in a deterministic order (should be easy) - Walks on different directory structures could theoretically produce the same bytestream (doable but requires some thinking) The GNU tar solution is far from ideal (it forces an external dependency and requires a subprocess call and some pipe juggling) but is very easy to implement and should be fine performance wise: - The bottleneck on reasonable hardware configurations should be hashing or disk IO - The cost of doing a fork/exec is negligible compared to either TL;DR os.walk: maybe in a future patch
2022-02-13 19:14:02 +00:00
stream = ClosableStream(stdin.buffer)
return stream
def print_usage_and_exit():
Add support for indetifying whole directory trees using tar This patch uses a clever invocation of GNU tar to produce a deterministic bytestream from a directory tree. This stream is fed to a hash function in 64KiB chunks (== Linux default pipe capacity) to produce a fingerprint which can be displayed as an identicon. Why would we do this instead of using the tarfile stdlib package or just using os.walk plus some code? The tarfile package is not capable of producing the output as a stream of N byte chunks. The "most granual" mode of operation it can do is producing all of the chunks belonging to a given file all at once. This is problematic, because we could run out of memory or be forced to write the tar archive to a temporary file - which would be painfully slow, we could run out of disk space, wear out SSDs or outright refuse to run in a container with a read-only rootfs and no tmpfs mounted. An os.walk solution is doable, but will require some problem solving which I am too lazy to do right now: - Forcing os.walk to walk in a deterministic order (should be easy) - Walks on different directory structures could theoretically produce the same bytestream (doable but requires some thinking) The GNU tar solution is far from ideal (it forces an external dependency and requires a subprocess call and some pipe juggling) but is very easy to implement and should be fine performance wise: - The bottleneck on reasonable hardware configurations should be hashing or disk IO - The cost of doing a fork/exec is negligible compared to either TL;DR os.walk: maybe in a future patch
2022-02-13 19:14:02 +00:00
command = main
with click.Context(command) as ctx:
click.echo(command.get_help(ctx))
sysexit(1)
def get_digest(stream):
# pylint: disable=not-callable
hasher = blake3()
while data := stream.read(BUF_SIZE):
hasher.update(data)
return hasher.digest(length=DIGEST_SIZE)
if __name__ == '__main__':
main()