Coding questing - Perl vs UNIX Shell (with heavy MD5 computation)

I have been working on a program to consolidate many, many TB of backup files.

So far it’s been coded in native bsh - shell on IOS. I have a ton of experience with it and C/C++. The problem is it will take forever to complete.

My basic algorithm/flow is to generate MD5 for every file, record the checksum and path (including volume) and then easily manually process the resulting file to come up with unique copies that I can then copy to a single volume (actually three - images, videos, and other documents) on a NAS RAID array. The final output would probably only run in the 2-4TB range (I have close to 2TB of video and images alone, and those would result in a manageable number file “descriptors”…)

So, my question for the hive is this: would Perl be a better option to code this thing? Given that the bulk of the processing time is generating the MD5 for every matching file type, I doubt it, but I figured I’d ask.

1 Like

I wonder what the bottleneck is? If it’s disk I/O then it won’t matter what you pick.

If it’s single cpu limits then parallelizing might help regardless of what you’re using.

1 Like

I want to answer this question 37 different ways and I’m vapor locking on where to start. But I think the practical one is good: I don’t think switching from (whatever “bsh - shell on IOS” is) to Perl is the obvious solution to speed up your program.

I don’t know how you’re generating the checksums (I mean, is the MD5 algorithm a built-in library in native code, or are you shelling out to an external program, etc.), but that could be a place to optimize. And in general, for something like this, there are some standard faster ways to do it if you’re not already. For example, look at the file size first, and if it’s different, you know the files aren’t identical, so you don’t have to waste time generating a checksum. Refining the algorithm probably leads to larger gains than switching languages, unless the one you’re using is pathologically slow for some reason.

There are also existing programs to do exactly this task of identifying duplicate files. I would probably not write my own, but likely spend as long researching and picking one out as it would to cobble something together with a bunch of pipes. Or a Perl one-liner, because I’m an old-school Perl fan. Possibly one of the last, though, so merely mentioning it is likely to lead to a flood of suggestions to use something else. And again, I don’t think the problem here is using the wrong language.

4 Likes

I’ve looked into this but not found anything that works across multiple volumes to compile into one single one.

Well, the MD5 calculation is native IOS so that’s why I doubt switching language will help much.

Yeah, that’s the real bottleneck. The spindle drives most of this stuff is on vary b/w 70-140 Mbps (I have some server-class hard drives)… the SSD’s I have are 250-600, even over USB. The fastest ones are almost as fast as the built-in SSD on my MacBook Pro (they have SSD on the main board…) The slower ones are some of the first commercial SSD’s that were available.

1 Like

GitHub - jbruchon/jdupes: A powerful duplicate file finder and an enhanced fork of 'fdupes'.?

To calculate a checksum, you need to read the entire file, so it makes sense it will be I/O bound. If you’re MD5ing every file, you have to read every single byte. That’s why checking the sizes first makes sense, it can be done for much cheaper.

I’m confused by the IOS but I think I just clued in that you’re not doing this on an iPhone. Cisco router?

3 Likes

Mac OS.

Native UNIX, with a bsh shell, C compiler, and I have Perl loaded as well.

1 Like

I was confused by this as well – I’ve been bashing in MacOS for years, but had never heard of a shell in IOS.

1 Like

Ja. Native UNIX and bsh is the “terminal” app, like CMD in DOS. That’s one reason I love it, as I started on and worked with UNIX for over 3 decades.

Native C compiler and easily installed extensions like Perl are a bonus.

Edit - I guess it’s more generally known as OSX.

1 Like

It’s macOS now, for the pedants who follow Apple’s various respellings.

3 Likes

Correct. I never paid much attention. It’s the basis of all Apple products’ operating systems, including mobile devices.

I just know it’s UNIX and includes a native Shell.

1 Like

The default shell is either zsh or bash depending on the version of macOS (or tcsh in the really ancient versions). I’m not aware of any shell named bsh.

The basis of Apple’s OSes is called Darwin. It uses the XNU kernel inherited from NeXTStep (which is a BSD/Mach mashup with a lot of changes over the years) and a userspace libraries and utilities that are a mix of old NeXTStep stuff, newer Apple stuff, and things borrowed from other flavors of BSD (FreeBSD, NetBSD, OpenBSD).

IOS is Cisco’s operating system and iOS is the iPhone operating system.

As for speeding up MD5 calculations, that’s going to depend on the details of how you’re doing it.

If you’re calling out to /sbin/md5 in both cases then I don’t think it’ll matter whether you do so from Perl or a shell; the overhead of forking and execing it is probably higher than the speed difference between Perl and the shell so you probably won’t be able to measure much difference.

If you’re comparing invoking /sbin/md5 from a shell script vs doing the MD5 calculation within Perl, without calling out to an external program, it’ll depend on which MD5 implementation you use in Perl. If you use the standard Perl package Digest::MD5 that will undoubtedly be much slower than calling the external md5 program on anything but the shortest files, as Digest::MD5 is implemented in pure perl vs the external utility that’s written in (presumably highly optimized) C code.

If instead you use Net::SSLeay::MD5 in order to make use of the OpenSSL (actually LibreSSL on recent versions of macOS) implementation of MD5 directly within Perl that might very well be faster than calling out to the md5 utility or it might be slower. Hard to say though. You’d need to test it to be sure. Which one is faster may well depend on the sizes of the files, as there’s a fixed overhead involved in forking and execing an external program that would probably make the external program slower for small files but on larger files it’ll just come down to which implementation is better optimized.

3 Likes

Well, I know that you specified two tools: bsh or Perl. I have used both. When Larry Wall created Perl, he was trying to mash together several tools including sed, awk, grep, bsh and csh with the goal of creating a superset language that could do it all. Unfortunately, each of the above tools solved overlapping needs with completely different syntax. You find that almost universally each place with a large volume of Perl code has created their own dialect by choosing certain syntax idioms as their preferred ones. While it will give you a richer environment, the syntax is a mess and I don’t think you will get much, if any, speed improvement.

If you are looking to speed up the process, I agree with what others have already said: start with the file size since you can get that from a directory listing without having to read the entire file. Once you have that, group files with identical sizes and then calculate the MD5 only when you have multiple files of the same size. This gives you a significant speed improvement because you only have to read the file contents if you have a size collision.

If you really want to step out of bsh into a more pure programming language, I would suggest Python. Available almost anywhere and has libraries to do more than just about any other language. The syntax is so clean you can eat off of it and it is very concise, i.e. it takes very little code to do the work. It has taken the top spot as the most popular programming language for the last two years and would be a very usable skill in today’s market and a good resume builder. It is my language of choice for these kinds of tasks.

3 Likes

I know a lot of people that use Python. I’ll look into it. Somebody chastised me for referring to the OSX shell as bsh, I guess over thirty years of coding on UNIX shell, I just never really noticed a difference. :roll_eyes:

I like the idea of an initial pass using just file size as well. Thanks for the idea…

1 Like

There’s not that huge of a difference. macOS Linix are both Unix derived. It’s be like complaining that windows command line isn’t command prompt or dos.
While technically correct(the best kind of correct) it’s functionally the same.

1 Like

Yes, and no. macOS is derived from the BSD fork of Unix. Linux was written from the ground-up as Unix work-alike. While macOS contains a lot of the original AT&T Unix code, Linux was written from scratch and has a completely different origin.

2 Likes

Well, guess I’m this many days old when I learned they aren’t exactly related.

You are not completely wrong. Although the code bases were developed separately, Linux quickly had open source versions of all of the standard Unix utilities developed. If you are a Unix user and sit down at a Linux system, you will likely feel pretty much at home until you want to start changing system-level behaviors. For regular interactive use, you have many clones of Unix shells, utilities, and tools available. I can sit down at a big-iron system running Unix, a Mac running MacOS, or PC running Linux and be right at home. My bash shell is there, the vi (or vim) editor that I have used for 40 years works on both systems. I can use grep, sed, awk, etc. and they generally function identically. You may find some separately evolved fringe features but the core just works in both systems.

3 Likes

hehe… just realized I said three years above. Autocorrect must have changed it because I meant to say 30.

I worked with various versions since the late 80’s, had Amiga Unix in college and at home until I moved over here. Every step of my career I worked with or coded on *IX systems of some type.

It’s not a badge of honor or anything, I just always liked the flexibility of the shell combined with easy coding, mostly in C for me.

1 Like

This topic was automatically closed 32 days after the last reply. New replies are no longer allowed.