How can I maximize storage problems for me, my group, and all other users?
Here's a collection of best practices to follow if you prefer
- high risk of severe data loss
- lousy performance
- minimal availability
That's ridiculous? Maybe. But a significant fraction of users must like it that way, or these would not keep happening time and time again:
Prefer NFS over AFS
NFS is a great source of problems. It does not scale at all, does not support replication and transparent relocation of data, and the weak caching semantics provides for interesting and surprising effects occasionally. It is very easy to make an NFS server grind to a halt, and an additional effect is often that many clients won't recover without being rebooted.
- All you have to do in order to render a server unusable is to create O(10000) files in the same directory, and then access that. Try it, it's fun.
- Writing to the same filesystem from many clients (farm nodes) simultaneously will result in very good fragmentation of files, causing slow read performance on them ever after.
There is no way to move data to a different server while keeping it available to clients. During the process, there's always an extended period when it's read-only. Afterwards, you either have to access your old data under a different path, or all clients need to be reconfigured - which usually does not work smoothly -, or there's some period when the data is not available at all.
In addition, NFS is very insecure. Anyone with (a) a linux notebook, (b) intermediate knowledge, and (c) access to any network wall socket providing a subnet with at least one client to which a filesystem is exported read-write can read, wipe out, or (most interesting) modify your data on that filesystem with very little effort.
Maximize the size of your AFS volumes
Be tidy, keep it all in one place. This very effectively prevents any load balancing because it's impossible to distribute the data across multiple servers and RAID sets. It also prevents load sharing by making it impractical to replicate such volumes. Since it takes so long to move such a volume to a different server (even if the required huge chunk of space is available on one), you run no risk of being offered uninterrupted access to part of your data in case a fileserver needs to be serviced. Finally, you maximize your risk of data loss: if the volume ever gets corrupted (say, due to a hardware malfunction), it will probably affect all your files in the volume.
So, next time you create a repository for some event data or monte carlo,
- do not create one volume per year or run period, even though it may be very tempting especially if the data is to be organized in corresponding subdirectories anyway
- if you've got 15000 numbered files making up 150 GB, do not consider creating 15 subdirectories/volumes for them - you could get several times higher throughput that way and would even avoid the overhead due very many entries in a single directory
Best results are obtained by creating a single volume per vice partition, as large as possible, with names like "disk1", "disk2" etc. and mounting them under similarly meaningless paths. This inhibits many beneficial AFS features, and provides a user experience as close as possible to the familiar NFS one. Think disks! Fashionable concepts like volumes are for propellerheads only.
Store your most valuable data in scratch space without backup
It's most exciting to store the only copy of your source code, your publication, or next week's conference presentation in /usr1/scratch on your desktop PC. - DV - appreciates the occasion to kill some idle time and/or delay other tasks trying to recover your data.
Next best is AFS or NFS scratch space: Even though this is hosted on RAID5 arrays, we can expect an average loss of a few hundred gigabytes of such data per year due to multiple disk failures, defective controllers or cache memory, firmware bugs, and even mistakes made by administrators.
Both also provide best vulnerability to the not-so-uncommon rm -rf subdir/ * mistake.
Remember: Redoing it all from scratch usually yields better results each time.
Use the best storage with backup for scratch data
It's good practice to store huge amounts of easily retrievable data in your home directory or group space with daily backup. ISO images of CDs or DVDs are a good example, another one is large software packages or tarballs you can download from the internet anytime again.
Building large applications on remote filesystems is not only a very effective way of wasting bandwidth and fileserver throughput, it also provides you with your well deserved coffee breaks. Don't run make clean afterwards, or you will waste less disk and tape space than possible.
Your AFS home directory is the perfect place to install your personal copy of your exeriment's analysis framework. Simply request an increased quota if it runs full. Enterprise class Fibre Channel RAID systems are quite cheap these days, and so are tapes, drives and robots. Hence, unlimited amounts of these resources are available.
Copy and move data around as much as possible
Never write data to the correct location in the first place. Instead, just stow it somewhere temporarily (preferrably in a huge AFS volume), and start organizing your data later. Change directory structures often, copying or moving data from the new to the old location every time. This method easily triples the consumption of network bandwidth, server throughput, and disk wear. Avoid having related data grouped in specific volumes, because that would make it possible to just change the mountpoint to make it appear in the new location. Instead, have a few very large general purpose volumes and reorganize them regularly.
For reference, here are two examples of how (not) to do large scale monte carlo production:
Good: Have two large volumes, one for production and one for later use. With each set of jobs, write the monte carlo files to the first volume. After all jobs have finished, move the files into the second one. It doesn't really matter whether the volumes are on the same or different servers.
Bad: For each set of jobs, create a new volume - or even a few new volumes on different RAID sets if many jobs will run in parallel. After all jobs have finished, simply change the mount points from the production path to the path where your analysis jobs expect their input files. If all input files must be found in a single directory, mount the volumes as subdirectories of this one and create symlinks to the files. Maybe create a number of read-only replicas of this (very small) volume.
The latter method is awfully simple, fast and efficient. Avoid. Refining the first method by introducing a third volume for accumulating the output of several sets of jobs before moving everything to the final destination is left as an excercise to the reader.
Hint: If you want to unpack tar archives into central storage, it's most inefficient to first copy them into the destination directory and then unpack them there. This way you achieve almost (alas, not quite, if compression is involved) three times the amount of I/O compared to simply unpacking them from wherever they are (temporarily) stored.
Your colleagues will especially appreciate it when you do all this on the public login systems, instead of getting yourself an interactive session on a farm node with qrsh.