I got stuck with the task of extracting a large number of text strings from an old DOS program. At first I wasn't worried at all, but my optimism quickly faded when the .exe would not execute on any modern Windows installation. I followed a number of guides for compatibility mode but nothing I tried resulted in success. I decided to try another approach.
I decided to try running the script in Wine, but gave up after many promising web searchs resulted in failure with the same issue. I then moved on to DOS emulators. I knew Linux had a number of DOS emulators and after a few tests with DOSBox and DOSEMU, I decided to give DOSEMU a try. I particularly like DOSEMU because of the -dumb option that seemed to indicate I could just capture the output. Turns out the DOS program was writing to the console which appears to be nearly impossible to redirect without writing Visual Basic. So another dead end, but I did have one more idea to try.
Could I run the program in DOSEMU, grab a screen shot and OCR the result? I initially tried fbcat, but it didn't quite match what I was after so I quickly switched to scrot. I like scrot because it has the option to only capture the active window (-u). I put together a quick Perl script to spawn DOSEMU in a thread while the parent thread ran scrot and captured the screen shots.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
use Expect;
my $count = 1000;
my $pid = fork();
if($pid) {
my $times = $count;
while($times-- > 0) {
my $dosemu = Expect->spawn('/usr/bin/dosemu', '"/home/doby/crusty_old.exe"');
sleep 2;
$dosemu->hard_close();
}
} else {
my $times = $count;
while($times-- > 0) {
`/usr/bin/scrot -u -d 1`;
sleep 1;
`/bin/mv /home/doby/*.png /home/doby/scrot/`;
}
}
Anyone who has ever written any code in their life should be screaming right now. Yes, this code has synchronization issues (the first run of 1000 captured 780 successful screens) and, no, I didn't have to use Expect. I considered Expect because I was going to interact with DOSEMU, but killing it ended up being easier so Expect just stuck around. Also, yes, the extra double quotes in the argument to dosemu are there for a reason.
Now, I have 1000 images and I need to pull my precious information from them. I read a bit about open source OCR software and decided tesseract was the way to go. I was wrong, it had about a 80% hit rate and I did not want to proof read every single image. I then gave Ocrad a shot, but it's hit rate was even worse. Finally, I tried GOCR and got 100% correct text. I checked a few more images and their OCR text just to be sure then fired this off.
for i in `ls *.png`; do gocr $i | perl -nle 'if(/:\s+(\w+)$/){ print $1 }' >> out.txt; done
This command sends each PNG file to GOCR, pipes the result to a Perl regular expression which extracts the desired information and concatenates it to the output file.
I seem to have taken the hard way and tried the wrong tool for the job at every turn. But, in the end, I have all the data I need and know that when all else fails I can always rely on open source tools, a little ingenuity and Perl.