Converting Legacy Data to Drupal

In a project to develop a job search site, we needed to use official data from a government site which lists job vacancies. Using a combination of shell scripting, wget, perl and sql, we were able to load the data reliably into Drupal, including several taxonomies.

load-data: minsk.gov.data scheds.prl specialties.prl 

minsk.gov.data:
	sh get-minsk.gov.by > $@ 
	
scheds.prl:
	( echo "use utf8;"; \
	echo "sub get_sched { "; \
	echo 'my $$string=shift;'; \
	echo "%sched = ("; \
	echo 'select name, tid from term_data where vid = 3;' | \
		mysql --default-character-set=utf8 -u user -ppassword trud | \
	sed 1d |\
	awk -F\	 '{print "\"" $$1 "\", " $$2 ",";}'; \
	echo '); return($$sched{$$string}); } 1;' \
	) > $@

specialties.prl: /proc/cpuinfo
	( echo "use utf8;"; \
	echo "sub get_specialty { "; \
	echo 'my $$string=shift;'; \
	echo "%specialties = ("; \
	echo 'select name, tid from term_data where vid = 1;' | \
		mysql --default-character-set=utf8 -u user -ppassword trud | \
	sed 1d |\
	awk -F\	 '{print "\"" $$1 "\", " $$2 ",";}'; \
	echo '); return($$specialties{$$string}); } 1;' \
	) > $@


load-data:
	PERL_UNICODE=S perl load-db.prl < minsk.gov.data | mysql --default-character-set=utf8 -u user -ppassword trud

Makefile used for capturing data and loading it into Drupal

The makefile creates three files which are then used to load the data into Drupal. The minsk.gov.data is the main data file with one line per job opening. The files scheds.prl and specialties.prl are perl modules which are created on the fly to handle the schedule and specialty taxonomies.

The top level script queries the government site (which contains almost 80 pages of job listings) to get the link to each individual job. The URL's for the site were determined empirically, and found to remain unchanged over several years. The listings, in Russian, used an old Windows coding scheme that had to be converted to UTF.
The string which uniquely identified the URL for individual job listings ("find.job.more.pl") was also determined empirically.


for PAGE in `echo | awk 'BEGIN {for(i=1;i<=80;i++) print i;} {exit;}';`
do
# Get each page of job listings
wget -O - "http://minsk.gov.by/cgi-bin/find_job.pl?f_filter=yes&&page=$PAGE"  2> /dev/null | 
	iconv -c -f cp1251 -t utf8 | 
	grep find.job.more.pl |  
	tr '?' '\n' | 
	grep vks=  | 
	sed 's/.*vks=//' | 
	sed s/\'.*// |
# Run the get-1-job  script for every line of input
	awk '{system("sh get-1-job " $0);}'
done

get-minsk.gov.by - top level script to capture data into minsk.gov.data file


# Get the specific Job ID numbers from a page at minsk.gov.by

JOB_ID=$1
LINK=http://minsk.gov.by/cgi-bin/find_job_more.pl?k_vks=$JOB_ID

wget -O - $LINK 2> /dev/null  |
iconv -c -f cp1251 -t utf8 |
awk '
/<.b><.a><.div>/ {gsub(".*<b>", ""); gsub("<.*", ""); printf("%s|", $0); }
/<nobr>/ {gsub(".*<nobr>", ""); gsub("<.*", ""); printf("%s|", $0); }
/Должность/ {gsub(".*<b>", "");  gsub("<.*", ""); printf("%s|", $0); }
/Разряд/  {gsub(".*<b>", "");  gsub("<.*", ""); printf("%s|", $0); }
/Характер работы/ {gsub(".*<b>", "");  gsub("<.*", ""); printf("%s|", $0); }
/Образование/ {gsub(".*<b>", "");  gsub("<.*", ""); printf("%s|", $0); }
/Вид занятости/ {gsub(".*<b>", "");  gsub("<.*", ""); printf("%s|", $0); }
/Зарплата/ {gsub(".*<b>", "");  gsub("<.*", ""); printf("%s|", $0); }
/Количество вакантных мест/ {gsub(".*<b>", "");  gsub("<.*", ""); printf("%s|", $0); }
/Дополнительная информация/ {gsub(".*<b>", "");  gsub("<.*", ""); printf("%s|", $0); }
END {print "'$LINK'";}
'

get-1-job script


use utf8;
require "/media/disk/Trud/specialties.prl";
require "/media/disk/Trud/scheds.prl";
require "/media/disk/Trud/educ.prl";

$date = `date +%s`; chop($date);
$vid = `sh ./get-nid`; chop($vid);

$vid = $vid + 0;

while($REC = ) {
$REC =~ s/'//g; chop($REC);
@vacancy = split(/\|/, $REC);

if ($vacancy[7] eq '') {next;}
$vid++;

$specialty = $vacancy[2]; $specialty =~ tr/А-Я/а-я/; $specialty = ucfirst($specialty);

$sched = $vacancy[6]; $sched =~ tr/А-Я/а-я/; $sched = ucfirst($sched);
$educ = $vacancy[5];

print "insert into content_type_vacancy values (" . $vid . ", " . $vid . ", " .  ucfirst($vacancy[7]) . ", '" . ucfirst($vacancy[1]) . "', '"  . ucfirst($vacancy[0]) . "', 1, '" . $vacancy[10] . "', '', '');\n" ;
print "insert into node_revisions values (" . $vid . ", " .  $vid . ", 1, '" .  ucfirst($vacancy[0]) . "', '" . ucfirst($vacancy[9]) . "', '" .  $vacancy[9] . "', '', " . $date  . ", 1);\n";

print "insert into node values (" . $vid . ", " .  $vid . ", 'vacancy', 'ru', '" .  $specialty . "',1, 1, " . $date . ", " . $date . ", 2, 0, 0, 0, 0, 0);\n";


print "insert into term_node values (" . $vid . ", " .  $vid . ", " . get_specialty($specialty) . ");\n";
print "insert into term_node values (" . $vid . ", " .  $vid . ", " . get_sched($sched) . ");\n";
print "insert into term_node values (" . $vid . ", " .  $vid . ", " . get_educ($educ) . ");\n";

}

load-db.prl script

Summary: 
In a project to develop a job search site, we needed to use official data from a government site which lists job vacancies. Using a combination of shell scripting, wget, perl and sql, we were able to load the data reliably into Drupal, including several taxonomies.