|
|
| ||||||||||||||||
|
| Unix utilities, Part 2 | |||
|
Peter Seebach (unixcomponents@seebs.plethora.net) We explore how to combine the components of the Unix programming environment to solve a variety of tasks. Unix pipes are effectively a component architecture, as we discussed in Part 1 of this series. In this installment, we'll strengthen the argument for saying so. But first of all, we'll look at how Unix supports a design goal that underlies component architecture, object-oriented architecture before it, and structured programming before that. This is the principle of decomposition, which allows us to build full results from partial ones. Partial results are better than none Sometimes, the goal is simply to produce an immediate result: You don't care much about performance (within broad limits) and you don't need to worry about special cases because you know a lot about your input stream. In these cases, focus on the simplicity and convenience of the system. Unix is very good at mostly solving a problem in a couple of steps, then taking a couple more to clean up the results into the desired format. Unix tools are built with the idea that it's fine for the first tool in a pipeline to solve only 90% of the problem. Indeed, it's fine if it only solves 10% of the problem, as long as that helps break the problem down into bite-sized chunks. So let's say I want to know how common a word is in a file. Rather than using the "utility that counts occurrences of each word in a file" (which would be rarely used), I break the problem down into steps:
Thus, for instance, I might use the following pipeline:
What does that do? Throughout this, you may find yourself wondering how you'll learn all these weird commands. Well, you learn them the same way you learn any other system of tools; gradual exposure and study, and heavy use of documentation. The Unix man pages, which every newbie hates, are beautiful references if you already know (mostly) what you want and you're trying to remember how to spell it. It's not necessarily harder to learn than anything else; it just looks hard if it's not the thing you learned first. The man pages are especially made useful by the Looking only for lines that contain "foo" wouldn't have solved my problem; it wouldn't have counted "foo bar foo" on the same line as two instances. Counting lines would have been useless. However, once I broke the stream into individual records matching the kind of data I was looking for, everything fell into place. Pipes are cheap Unix favors initial solutions that are cheap in programmer time, even if they aren't maximally efficient. Obviously, a carefully tuned program customized for a given task is likely to outperform a series of generalized programs communicating through pipes. (Of course, this will require careful tuning -- a badly tuned program may do all of its tasks poorly and end up being slower.) Experienced Unix users aren't afraid to add a tool to a pipeline that only does one trivial thing: It's cheaper than doing it by hand, and it's not going to slow you down measurably while you're solving the problem. If you end up repeating a task frequently, and you have a performance problem, then you try to improve. But you don't worry about it while you're looking at the problem for the first time. So, for instance, with the "counting occurrences of foo" example above, I could have used In some systems, pipes are implemented using hidden temporary files, and the first program in a pipe must be complete before the second can start running. Unix systems don't do this -- all of the programs in a pipe run simultaneously, and process output as it comes through. Distributed components using network sockets depend on the same feature -- both sides of a socket can be working at once. Indeed, some of the Unix tools end up providing a pipe that runs to another machine over a network. Build your own tools For instance, let's say I frequently want a list of words in a file, sorted by frequency. I might create a shell script that looks something like:
Once again, no one of these utilities is doing anything very hard, but the series ends up being quite powerful. However, the resulting utility itself can be used quite effectively in another pipeline. Note to readers who aren't used to Unix: That is the whole program -- no IDL wrappers, no frameworks, no templates, no makefiles, no nothing. If you type it in and flag it as executable, it's a program, and it will do exactly what it's supposed to do. You don't need a special tool to make this program work as a component. It already is a component. Some of you may not think this is a useful application, but it's really the same as, say, an application sorting a customer base by the number of times they've missed payments. You hand the application the list of people who missed payments, and the output is the list of people, sorted in order of how often they missed payments. Let's say I want to check on the least and most used words in a given file. I can now use:
to find the most common words, or
to find the least common. Having found a useful combination, I now treat it as a stepping stone, not just as a goal. Tools as plug-ins This is a great model. Instead of providing every feature you could ever want in an editor, Unix tends to give you a way to run the contents of a file through a tool, producing corrected output. Want arithmetic evaluation? Shove the current line through a calculator. This is one of the things Unix does very well. In most systems, merely knowing that a component exists wouldn't make it practical for a user to plug it into a word processor and apply it in place to a block of text. Challenge your expectations The user can set It turns out that everything works just fine. Imagine that we keep the human resources database as a tab-delimited flat file with several sections: the employee listings, the department listings, and the performance evaluation section. We want to remove all the various records associated with the employee who left the company. (Note that, as we'll see later, it doesn't matter if we actually keep the data tab delimited; we can export and import to make it temporarily tab delimited.) Or perhaps we maintain a customer contact database, and now we have to comply with a federal regulation to remove all information for customers who have signed on to an opt-out list. I can do this by hand. It will take a while, but I can do it. Or, I can create a shell script:
If I set Essentially, to Switching modes Of course, for a large component in a traditional component architecture, this is hardly a detectable load. However, it makes small components prohibitively expensive. Adding 1% to the code base of a large and powerful program is under the radar. Tripling the cost of writing a small component means you don't write it, or that you end up writing components that try to do it all instead of writing components that follow the excellent example set by Take our example above in which we used Unix comes with a variety of wrappers that perform key translations; others can be written easily, as we have seen in this article (and will see more in the next). The most powerful, and mind-bending, of all the Unix wrappers is probably For instance, let's say I wish to look for the string "foo" in every C source file in or under the current directory:
Perhaps I just wish to know which files contain the string?
Note that, in the Unix environment, the Let's say you want to run a given program on a series of files, but have the output be in place rather than in a stream. A simple version of this utility might take names on standard input, and the command to run as an argument. It might look like this:
Note that this is not safe or reliable. For a "real" application, you'd want to do a lot more testing, and protect against some common mistakes. However, if you know the inputs and the data set, you can do this and it'll work fine. Not sure you got it right? Change the middle line to
and see whether the list of commands it generates looks right. Better yet, if you want to run a test case, just run the utility into The shell itself is quite willing to be used as a tool. It takes commands on standard input, and puts the output of those commands on standard output. In other words, every tool is designed to be used to build other tools, and the resulting tools are themselves still designed to play well with others. If a tool doesn't work the way you want it to, there's probably a tool to perform the "translation" of interfaces. If not, you can write one easily. In Part 3, I'll discuss building new tools. Resources
About the author |
|
|
|