Cool Commands: GNU find

A couple of days ago I noticed that my samba server running under Linux was listening on all interfaces, including the wireless interface for the public network that I share where I am, which is something I don’t want.  Why?  Because this server contains file shares for a large amount of my data.  This is not stuff I want publicly accessible.

Actually, the shares can only be accessed if a remote user authenticates against samba, which would require their credentials having been added to the samba password database with the command smbpasswd, therefore even though the server was visible, it was not actually accessible.  For security purposes however, it is better to simply not even have the server listening via the wireless interface, since I only need to access it locally through an Ethernet connection here.

Once I told samba to only listen to the Ethernet interface I noticed however that there was a large number of log files that it created since, for every host that tries to access the server, it creates a unique log file for that host in /var/log/samba.  Having sat exposed on the public network here meant that every machine that automatically browsed the network for available shares had tried to communicate with my samba server (I don’t think any of the connections were malicious attempts to access my data).  These log files for all these hosts did not get automatically cleaned up by logrotate.  I was actually looking through these logs to check on some issues I was having, and it was annoying having all these old, pretty much irrelevant log files laying around since they only contained irrelevant info about failed connection attempts.

In order to get rid of all these useless log files laying around I broke out one of the most useful command-line utilities ever made: the venerable GNU find command which is part of the findutils suite of tools.  find can usually do in one terse line what would normally take several commands patched together to do, if not more.  It has the ability to recurse through a directory and look for patterns and then perform actions on whatever matches that pattern.  This is one of those things in Information Technology which, no matter how advanced things seem to get, no matter how many great things get developed and fancy applications, it will always be useful to have because ultimately information gets stored in recursive directories which have attributes, the contents of which need to be processed in one way or another.  It simply does not get more essential than this*.

To perform my task, I simply had to run:

find . -type f ! -mtime -2 -print0 | xargs -0 rm -f

To translate this command into the English language, I simply told it to find everything in the current directory and lower (find .) that is a file (-type f) (this was a safety precaution, because it could have been possible that there were also directories under this one which met the criteria, and I did not want these deleted, only files.  As it turns out there were actually no directories.)

The next part is part of what is really cool about find, why I love it.  You can craft these cool expressions to match exactly what you want.  In this case, I wanted to match all files that had not been modified within the last three days.  Anything that samba had not needed to log within the past three days was probably irrelevant to me.  Here is the expression that did it: ! -mtime -2 Its so simple that its elegant!  In Unix filesystems mtime  means modification time, which is the most recent time that a file was changed in some way.  Any time a log file is created or written to its mtime is naturally updated.  For what its worth, there are also ctime which is creation time, and atime which is accessed time.  mtime is usually the most important one for administrative purposes.

To explain this expression, the value after mtime is a number n which is a multiple of 24 hours.  Therefore -mtime 0 means anything modified within the last 24 hours.  -mtime 1 means anything modified between 24 and 48 hours ago, -mtime 2 anything modifed between 48 and 72 hours ago, and so on.  Note that -mtime 2 does not mean anything within the last 72 hours, only within the 24 hour time period of n*24!  The expression syntax is very strict about this!  Since I wanted everything from now until 72 hours ago (n = 2), I put a sign in front of n, therefore -2, which means everything from -2 and less.  If I had put +2 instead that would have meant everything from 72 hours ago and later.

But wait a minute.  Now I have -mtime -2 to indicate everything that has been modified within the previous 72 hours.  But I actually want to delete everything that is NOT that.  Easy, just put a ! in front of the expression: ! -mtime -2 Now I have matched everything that has not been modified within the past 72 hours.

To get an idea of how useful this is, imagine if you had to perform this same task with Windows Explorer, and the directory contained hundreds of files, some of which matched, some of which didn’t.  Yes, you could manually go through and select all the candidates you want to delete, hopefully not making any mistakes, but it would be a tedious, arduous process at best.  Now imagine you have to do this on 5 different machines!  With the eminent find command we have reduced a potentially very arduous task that would be error prone and tedious to something very simple and fast.

The remainder of the command relates to the actual processing the files that match and deleting them:  –print0 | xargs -0 rm -f The print0 | xargs -0 basically get the output of find – all the files that matched ! -mtime -2 and prepares them for the action we want to perform in them, in this case deletion, which is accomplished with rm -f at the end.  Note that instead of deleting them with rm -f we could have performed any number of other actions, such as renaming them, moving them to a different location, etc.

Why print0 | xargs -0 is required is a little esoteric and not necessary for purposes of discussion now so I will let the reader find out more about this by consulting the excellent manual page for find (type man find) in a console window.

The GNU find command is so useful that I think it would be a good idea for every child to learn it in gradeschool because we are always going to have to process information and knowing how to do so efficiently is a skill that will always be valuable.

* This is also interesting from a philosophical viewpoint.  The fact that the archival and access of data electronically involves the maintaining and management of certain attributes with respect to that data almost reminds me of certain a priori types of knowledge which exist – for example in mathematics the fact that the sum of the internal angles of a triangle always equals 180 degrees.  There seem to be certain a priori aspects of information technology: no matter how certain problems are addressed or reduced, there always seems to be certain aspects associated with them which ultimately must be accounted for in order to facilitate the actual process of things like archival and access of data.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *