I have been working on a program to consolidate many, many TB of backup files.
So far it’s been coded in native bsh - shell on IOS. I have a ton of experience with it and C/C++. The problem is it will take forever to complete.
My basic algorithm/flow is to generate MD5 for every file, record the checksum and path (including volume) and then easily manually process the resulting file to come up with unique copies that I can then copy to a single volume (actually three - images, videos, and other documents) on a NAS RAID array. The final output would probably only run in the 2-4TB range (I have close to 2TB of video and images alone, and those would result in a manageable number file “descriptors”…)
So, my question for the hive is this: would Perl be a better option to code this thing? Given that the bulk of the processing time is generating the MD5 for every matching file type, I doubt it, but I figured I’d ask.
I want to answer this question 37 different ways and I’m vapor locking on where to start. But I think the practical one is good: I don’t think switching from (whatever “bsh - shell on IOS” is) to Perl is the obvious solution to speed up your program.
I don’t know how you’re generating the checksums (I mean, is the MD5 algorithm a built-in library in native code, or are you shelling out to an external program, etc.), but that could be a place to optimize. And in general, for something like this, there are some standard faster ways to do it if you’re not already. For example, look at the file size first, and if it’s different, you know the files aren’t identical, so you don’t have to waste time generating a checksum. Refining the algorithm probably leads to larger gains than switching languages, unless the one you’re using is pathologically slow for some reason.
There are also existing programs to do exactly this task of identifying duplicate files. I would probably not write my own, but likely spend as long researching and picking one out as it would to cobble something together with a bunch of pipes. Or a Perl one-liner, because I’m an old-school Perl fan. Possibly one of the last, though, so merely mentioning it is likely to lead to a flood of suggestions to use something else. And again, I don’t think the problem here is using the wrong language.
I’ve looked into this but not found anything that works across multiple volumes to compile into one single one.
Well, the MD5 calculation is native IOS so that’s why I doubt switching language will help much.
Yeah, that’s the real bottleneck. The spindle drives most of this stuff is on vary b/w 70-140 Mbps (I have some server-class hard drives)… the SSD’s I have are 250-600, even over USB. The fastest ones are almost as fast as the built-in SSD on my MacBook Pro (they have SSD on the main board…) The slower ones are some of the first commercial SSD’s that were available.
To calculate a checksum, you need to read the entire file, so it makes sense it will be I/O bound. If you’re MD5ing every file, you have to read every single byte. That’s why checking the sizes first makes sense, it can be done for much cheaper.
I’m confused by the IOS but I think I just clued in that you’re not doing this on an iPhone. Cisco router?
The default shell is either zsh or bash depending on the version of macOS (or tcsh in the really ancient versions). I’m not aware of any shell named bsh.
The basis of Apple’s OSes is called Darwin. It uses the XNU kernel inherited from NeXTStep (which is a BSD/Mach mashup with a lot of changes over the years) and a userspace libraries and utilities that are a mix of old NeXTStep stuff, newer Apple stuff, and things borrowed from other flavors of BSD (FreeBSD, NetBSD, OpenBSD).
IOS is Cisco’s operating system and iOS is the iPhone operating system.
As for speeding up MD5 calculations, that’s going to depend on the details of how you’re doing it.
If you’re calling out to /sbin/md5 in both cases then I don’t think it’ll matter whether you do so from Perl or a shell; the overhead of forking and execing it is probably higher than the speed difference between Perl and the shell so you probably won’t be able to measure much difference.
If you’re comparing invoking /sbin/md5 from a shell script vs doing the MD5 calculation within Perl, without calling out to an external program, it’ll depend on which MD5 implementation you use in Perl. If you use the standard Perl package Digest::MD5 that will undoubtedly be much slower than calling the external md5 program on anything but the shortest files, as Digest::MD5 is implemented in pure perl vs the external utility that’s written in (presumably highly optimized) C code.
If instead you use Net::SSLeay::MD5 in order to make use of the OpenSSL (actually LibreSSL on recent versions of macOS) implementation of MD5 directly within Perl that might very well be faster than calling out to the md5 utility or it might be slower. Hard to say though. You’d need to test it to be sure. Which one is faster may well depend on the sizes of the files, as there’s a fixed overhead involved in forking and execing an external program that would probably make the external program slower for small files but on larger files it’ll just come down to which implementation is better optimized.
Well, I know that you specified two tools: bsh or Perl. I have used both. When Larry Wall created Perl, he was trying to mash together several tools including sed, awk, grep, bsh and csh with the goal of creating a superset language that could do it all. Unfortunately, each of the above tools solved overlapping needs with completely different syntax. You find that almost universally each place with a large volume of Perl code has created their own dialect by choosing certain syntax idioms as their preferred ones. While it will give you a richer environment, the syntax is a mess and I don’t think you will get much, if any, speed improvement.
If you are looking to speed up the process, I agree with what others have already said: start with the file size since you can get that from a directory listing without having to read the entire file. Once you have that, group files with identical sizes and then calculate the MD5 only when you have multiple files of the same size. This gives you a significant speed improvement because you only have to read the file contents if you have a size collision.
If you really want to step out of bsh into a more pure programming language, I would suggest Python. Available almost anywhere and has libraries to do more than just about any other language. The syntax is so clean you can eat off of it and it is very concise, i.e. it takes very little code to do the work. It has taken the top spot as the most popular programming language for the last two years and would be a very usable skill in today’s market and a good resume builder. It is my language of choice for these kinds of tasks.
I know a lot of people that use Python. I’ll look into it. Somebody chastised me for referring to the OSX shell as bsh, I guess over thirty years of coding on UNIX shell, I just never really noticed a difference.
I like the idea of an initial pass using just file size as well. Thanks for the idea…
There’s not that huge of a difference. macOS Linix are both Unix derived. It’s be like complaining that windows command line isn’t command prompt or dos.
While technically correct(the best kind of correct) it’s functionally the same.
Yes, and no. macOS is derived from the BSD fork of Unix. Linux was written from the ground-up as Unix work-alike. While macOS contains a lot of the original AT&T Unix code, Linux was written from scratch and has a completely different origin.
You are not completely wrong. Although the code bases were developed separately, Linux quickly had open source versions of all of the standard Unix utilities developed. If you are a Unix user and sit down at a Linux system, you will likely feel pretty much at home until you want to start changing system-level behaviors. For regular interactive use, you have many clones of Unix shells, utilities, and tools available. I can sit down at a big-iron system running Unix, a Mac running MacOS, or PC running Linux and be right at home. My bash shell is there, the vi (or vim) editor that I have used for 40 years works on both systems. I can use grep, sed, awk, etc. and they generally function identically. You may find some separately evolved fringe features but the core just works in both systems.