As I've worked this script into successively more sophisticated versions, the basic vision of the manner in which the records would get into the database has evolved as well. As I started, my basic assumption was that the script would have an insert routine that would, in essence, occupy the same role within the script as does the store subroutine. There are times, however, when the delay associated with connecting with a database server can be noticeable, and whatever fail-over action that should be structured into such to execute if the database connection is not available would likely represent an additional delay. In most contets, even those data-entry oriented, such delays would not represent a big deal. We are all accustomed to the occassional delay when a button is clicked, and we generally take such moments in stride. There are, however, a couple of considerations that accentuate the importance of even relatively minor additions of overhead. First, these pages are intended to represent an entry environment to be used to record an environment subject to bursts of activity. I've already I've already discussed measures intended to make the interface as responsive in that regard as possible; the ability to post partial pages during quiet periods is a sample of the kind of thing that goes a long way to smoothing the users's experience with the interface. It is important, however, to place as little additional load on the interface as possible to maximize its responsiveness. The second element that emphasizes the importance of keeping the user interface script as lean as possible is the fact that surplus equipment is not likely to have a massive amount of horsepower to devote to the execution of the script. The current Ralphzilla controller (the original bit the dust) is a P75 with 32 MB of ram and relatively slow hard drives. While it is quite likely that if I were to actually begin to use the system to record games I would be able to easily replace this machine with something in the P133-P200 range with a faster hard drive, my point is that the things you do to make the interface more responsive can become even more noticeable when the web server is on a faster machine. This perception rises from the fact that waiting for a response from a slow machine is part of the expected environment ... the same pause from a faster machine represents a noticeable hiccup and affects whatever rhythm the user is able to create with the entry environment.. Making a script as responsive as possible on a slower machine results in a script that, when run on a faster machine, the user experiences as smoooth ... and that tends to make them happy. <grin>
I identified many of the concerns regarding filenames I mentioned early in the previous section as I worked through implementation under a forked-process model. Obviously, it cannot be taken for granted that a forked process will have completed by the time you want it to be done, and if a running process is blocked to wait for the completion of another process that rationale for forking the process has been largely defeated. Therefore, the unique filename is required to avoid having to wait for a forked process to finish. As there is always a chance that a forked process will get out there and simply stall out, I was thinking through the steps to take to track the status of any given process and its associated file, and how to clean up orphaned processes, when I was struck by the blinding light of epiphany. Why was I doing this at all?
If you think about this for a minute you'll see what I mean. The script is writing a unique file to a common directory on each pass. Rather than spawning a process to handle an individual file, why not let the files pile up for a bit, the sweep through and handle everything present in the directory? Since this is a cluster environment I can start the script off on whichever machine I want and mosix can move it as it wishes. I've noticed that under normal loads mosix will frequently ignore slower machines in the cluster, and that one of those machines would be a good host for a script that simply performed the insert operation. So that's what I did.
The general structure of this script is as follows:
--Set up the environment for the script's execution.
--Create a connection to the directory holding the files, and read the contents of that directory into an array.
--Establish a connection to the database. If that connection fails, set a flag to that effect.
--Prepare the database insert statement. If that prepare statement fails, indicate that in a flag.
--For each element in the array of directory elements, open that element (file). If the database connection exists, execute the subroutine that parses each line and inserts the resultant record. If the connection does not exist, write the records to an external file that serves as a repository. Delete the source file.
|
#!/usr/bin/perl -w ##recs_insert.pl ##this script loops through the files of entered records and enters the records into the baseball database. if the connection to the baseball database ##cannot be made, the records are appended to a comma delimited ascii file BEGIN { # Set the DISPLAY variable to the name of the local machine # where the debugger window and web browser appear. $ENV{DISPLAY} = "mymachine:0" ; } ##use the database interface and postgresql driver use DBI; use DBD::Pg; DBI->trace(2,'/home/www/dbi.log'); |
The first part of the script should look pretty familiar to you by now. Note that the shebang line invokes perl with the warning pragma, but no mention of taint checking. The data we're going to be working with here has already been accepted from the user in a process that will ultimately be embellished with further validity checks. Although there are those who would argue that taint checking is beneficial in any script, I suspect that most would argue that if the data entered into the pages drawn by the entry script contained malicious code, it would likely have done its damage by this point. This is not to say that it is not relevant to perform additional checks if the environment in which the client is housed cannot be trusted, but that taint-checking what has already been checked represents redundancy to no real purpose. The client on which the script will be run, being interior to the cluster, is behind a firewall and should ultimately be invisible from outside of the cluster, and should represent as trusted an environment as is possible to envision. The very operation of the cluster depends on that. You'll note that after using the DBI module and the postgresql driver the scriipt sets up a DBI trace log in the same fashion as we used in an earlier script. This specific command was used to provide a resource for debugging efforts, but it may be worth leaving it operational after the script begins production use as a source of diagnostic information should difficulties arise with the database connection. For all practical purposes, however, once a database connection is up and running something has to change for the connection to fail, and this would not necessarily be reflected in the log beyond a "database server not found" message. The most likely cause of failure of a working database connection would have something to do with the failure of some element of the system infrastructure. The database server software might have been unloaded, for example, or a network card might have given out. In fact, I tests the fail-over process, described below, by simply unloading postgresql on Ralphzilla-raider and executing the insert script. While the trace log can be a valuable source of information for locating software problems, the primary difficulty with logs like this is that they can grow to substantial size, and finding the specific bits of information that brought about the failure you are tracking can be a chore in and of itself, especially if it started happening at some point before you caught it. (Allowing recovery from problems like that is, of course, the role of the fail-over procedure.) Soon, probably in the next iteration, I'll discuss trapping DBI error messages and what should be done with them when they are in hand. The trace log does, however, provide that basic capability. |
|
##the following scalars define the parameters for the connect statement my $host='ralphzilla-raider'; my $driver='Pg'; my $db='baseball'; my $user='baseball'; my $password='baseball'; ##the directory name stored in the following scalar is the ## source directory for the stored records. it is specified through the ##mfs mount point so the script can be run from any machine ## in the cluster. in the current incarnation of the system the ##script will be run as a superuser cron job, so there will be ## no problem with access rights. as this is happening entirely within ##the cluster, this is no more of a security problem than any ## of the intra-cluster communication used by mosix. my $source_dir='/mfs/1/home/www/save'; ##read the contents of the source directory into an array opendir(SOURCE_DIR,$source_dir); my @source = readdir(SOURCE_DIR); | After using the database modules a number of scalars are defined that represent the details used for specifying the connect statement for the database connection. Following these the script initializes the scalar holding the location of the directory in which the data files are located, $source_dir. Worthy of note here is that the directory specified, /mfs/1/home/www/save, is the directory location of the storage directory on Ralphzilla specified through the mfs mount point. (Remember that the mfs file system is mounted at /mfs in the /etc/fstab on each machine.) In other words, the script is reading the files from the space where the cgi script stored them. On one hand, this isn't such a big deal, we get stuff from network drives all the time. On the other hand, I still think it is very cool <grin>. It is important to note, however, that the script must be run with root privileges for this to work. Under mosix, root privileges on one machine transfer to other machines in the cluster, the rsh packages we installed some time ago enable this. While a non-root id on an interior machine could access the mfs mount point (remember the samba discussion?), it would not be able to read and delete the data files. While it would be possible to create a set of user rights that would allow that, it's far more appropriate to simply run the script with root privileges. After all, mosix-migrated process move between machines with root privileges, this is part of the system writ large, and if you have sufficient access to get logged on to an interior machine with root privileges to kick off the script and you are not sure of what you are doing, then concern about the script itself is probably the least of our worries, don't you think? The script then uses the open_dir() function to create a directory handle on the location held in the $source_dir scalar. Just as a filehandle creates a hook to a file that you can use to grab stuff from the file or throw stuff into it, a directory handle creates the same sort of thing at the directory level. Once the directory handle is in place I use the readdir() function to populate the array with the contents of that directory. |
|
my $conn=1; ##create the database handle my $dbh=DBI->connect("DBI:Pg:dbname=baseball;host=ralphzilla-raider","baseball") or $conn=0; ##prepare insert statement ... this statement will fail if there was a ## problem connecting to the database, but that error is trapped in a different ##fashion below ##if ($conn) { ##my $ins=$dbh->prepare("insert into participant_game (event_code,role_code,participant_id,event_result_code,event_text) values (?,?,?,?,?)") or $conn=0; ##} | At this point I initialize the scalar $conn to the value "1", followed by the creation of a database handle on the connection to the baseball database. As you can see, if the creation of the database handle fails, the value of the $conn scalar is set to "0". This is the basis of the fail-over procedure, as you'll se in a bit. The next step in my initial version of the script involved preparing a statement handle on an insert statement using placeholders, as we did in the very first cgi script that was part of this system. In this instance, however, the prepare() method simply did not work, refusing to create a statement handle. For a time I traced execution through the driver module before determining to perform a test by submitting a statement with the do() method, which worked just fine. Rather than spending further time tracking the bug through the module, I implemented this version around the do() method. While it involves greater overhead than to submit a do() many times than it does to execute a prepared statement handle many times, the manner of this script's execution renders that concern largely moot. I have, however, left the statements that would implement that method in the code, commented out, to make it easier to switch back and forth to facilitate further testing. You should feel free to try both methods. It may be that the versions of the libraries installed on your system will not exhibit this bug. Regardless, you can see that in the version using the prepare() method the statement handle is prepared if the $conn scalar is non-zero, which means that the database handle has to exist before the statement handle is prepared. Should the statement handle creation fail, the $conn scalar is set to "0". |
|
##the following scalar will hold the name of the specific file being processed my $recs_file; ##initialize the scalar to hold a line from the relevant file my $rec; ##initialize the scalars to hold a set of values my ($ec,$rc,$pc,$erc,$et); ##the following scalar will be used as a control variable. ## it's initial state (1), will signify that an exclusive lock was set on the ##file being processed. the rationale for trying to grab an exclusive ## lock on the file is very straightforward ... this is an easy way ##to make sure that this script doesn't grab a partially-written ## file before it is completed by the cgi script. since the cgi script ##also locks the file exclusively as it is writing to it, this script ## will not be able to get a lock on any file that is being written. ##as a result, that file will be left befind, to be swept up in the next ## run of the script. this should be an extraordinarily ##unusual event. my $lock=1; | The script now creates the scalars that will be key to the operation of the loop in which the contents of the source directory are processed. $recs_file will hold the name of the specific file being processed, $rec will successively hold each line of that file, and the set ($ec,$rc,$pc,$erc,$et) will hold the seperated set of values from a line. The final scalar, $lock, controls whether a file is processed, as you'll see shortly. |
|
##for each file in the source directory ... foreach $recs_file (@source) { if ($recs_file =~ /txt/) { ##open that file and assign it the filehandle RECS $recs_file = "/mfs/1/home/www/save/".$recs_file; open(RECS,"< $recs_file") or die "can't open $recs_file"; ##try to get an exclusive lock on it ... if not, store 0 to the $lock scalar flock(RECS,2) or $lock=0; ##if the value of $lock is still non-zero if ($lock) { if (! $conn) {&write_recs;} ##if the database handle is not defined, execute the fallback procedure elsif ($conn) {&insert_recs;} ##otherwise, execute the insert procedure } ##close the filehandle close RECS; ##for now, this script assumes that if the lock was ## obtained on the source file, the records in that file were either ##inserted into the database or written into the repository ## file, and the source file can be deleted if ($lock) {system("rm $recs_file");} } } | The main body of the script loops through each of the directory entries held in the @source array, associating each with the scalar $recs_file in turn. Within the loop, the first segment tests to make sure that the $recs_file scalar contains the string "txt", making sure that the element is not either the current directory (".") or parent directory ("..") designators. I could as easily have tested for the string "recs", of course. Regardless, if this test is passed the element qualifies as one of the files of interest, and the next statement expands the scalar to include the full path to the file as well as the file name. The scalar is then passed to the open() function to associate the file with the filehandle RECS, in read-only mode. After the filehandle is created, the script attempts to obtain an exclusive lock on the file with the flock() function. This is the other side of the rationale for obtaining an exclusive lock on the file as it is being written to by the cgi script. While the file creation operation is on-going within the cgi script, an attempt to obtain an exclusive lock on that file by this script will fail, and the value "0" will be assigned to the $lock scalar, resulting in the file being skipped and not deleted in the last line of the loop. This is how the system as a whole avoids the very small but non-zero probability that the insert script would process a file only partially-written and thus inadvertently drop some records. Since the file remains behind after the script's execution, it will be swept up and processed in the next execution of the script. The following if statement checks to make sure that the lock condition exists, and if it does processng of the file continues. This is straightforward ... if an active database connection does not exist, the write_recs subroutine is executed, which appends the lines in the file being processed to a repository file. If the connection does exist, the insert_recs subroutine is executed and the records inserted into the database. Following this if statement, the RECS filehandle is closed, and another if statement again checks the lock condition. and if it exists the file is deleted by calling the system rm command. If the file cannot be locked, of course, we want to leave it behind to be captured by the next run of the script. |
There are, of course, other ways to handle situations in which the cgi script has a file locked at the same time that the insert script is trying to read it. In the current incarnation, the cgi script writes at most eight records to a given file, and something like that happens faster than you can think about it, even writing to a relatively slow drive. It would not be a big deal to put this script into a short idle loop before trying to get the lock again. If the demands of the application were such that updates to the database should be visible as quickly as possible, I'd probably make that change. The resultant complication to the script would be relatively minor. At the same time, it is apprpriate to keep all of this in context. On the development Ralphzilla system, with a P75 client accessing both the storage directory and the database server through a 10 Mbps shared hub, the script will initialize itself, process four or five files, and return to the command line in two or three seconds. Most of that time is spent in initializing the interpreter and compiling the script; the execution time required to process twenty or thirty files would likely not be appreciably greater than the time to process four or five. The point to all that is that while the script currently executes once every five minutes on the Ralphzilla cluster, it could easily be run once a minute, which would mean that posted records would be into the database within two minutes at the outside, with most updates clustered somewhere within a minute. It might be possible to run the cron job every thirty seconds, but then we would be pushing up against the point at which allowance would have to be made for handling a script starting before the preceding one had finished, and realistically the application does not require that. (If it did, you'd probably want a heftier web server spawning insert processes with individual files.) Another reason for now waiting for a released lock on a file is that every now and then an operating system will think a lock exists on a file after it has been released. In the current execution framework that file stays in the original storage directory, but everything else gets stored in either the database or the repository file. If the script wsere waiting for a release, it would just appear to hang. While we could time that out at some point and move on, the benefits of whatever small decrease in processing time has been achieved would fast be out-weighed by increasing complexity of the code. Do what you wish. Just remember that the more complicated you make any given component, the greater the likelihood that it will break.