Converting Legacy Data to Drupal
In a project to develop a job search site, we needed to use official data from a government site which lists job vacancies. Using a combination of shell scripting, wget, perl and sql, we were able to load the data reliably into Drupal, including several taxonomies.
load-data: minsk.gov.data scheds.prl specialties.prl
minsk.gov.data:
sh get-minsk.gov.by > $@
scheds.prl:
( echo "use utf8;"; \
echo "sub get_sched { "; \
echo 'my $$string=shift;'; \
echo "%sched = ("; \
echo 'select name, tid from term_data where vid = 3;' | \
mysql --default-character-set=utf8 -u user -ppassword trud | \
sed 1d |\
awk -F\ '{print "\"" $$1 "\", " $$2 ",";}'; \
echo '); return($$sched{$$string}); } 1;' \
) > $@
specialties.prl: /proc/cpuinfo
( echo "use utf8;"; \
echo "sub get_specialty { "; \
echo 'my $$string=shift;'; \
echo "%specialties = ("; \
echo 'select name, tid from term_data where vid = 1;' | \
mysql --default-character-set=utf8 -u user -ppassword trud | \
sed 1d |\
awk -F\ '{print "\"" $$1 "\", " $$2 ",";}'; \
echo '); return($$specialties{$$string}); } 1;' \
) > $@
load-data:
PERL_UNICODE=S perl load-db.prl < minsk.gov.data | mysql --default-character-set=utf8 -u user -ppassword trudMakefile used for capturing data and loading it into Drupal
The makefile creates three files which are then used to load the data into Drupal. The minsk.gov.data is the main data file with one line per job opening. The files scheds.prl and specialties.prl are perl modules which are created on the fly to handle the schedule and specialty taxonomies.
The top level script queries the government site (which contains almost 80 pages of job listings) to get the link to each individual job. The URL's for the site were determined empirically, and found to remain unchanged over several years. The listings, in Russian, used an old Windows coding scheme that had to be converted to UTF.
The string which uniquely identified the URL for individual job listings ("find.job.more.pl") was also determined empirically.
for PAGE in `echo | awk 'BEGIN {for(i=1;i<=80;i++) print i;} {exit;}';`
do
# Get each page of job listings
wget -O - "http://minsk.gov.by/cgi-bin/find_job.pl?f_filter=yes&&page=$PAGE" 2> /dev/null |
iconv -c -f cp1251 -t utf8 |
grep find.job.more.pl |
tr '?' '\n' |
grep vks= |
sed 's/.*vks=//' |
sed s/\'.*// |
# Run the get-1-job script for every line of input
awk '{system("sh get-1-job " $0);}'
doneget-minsk.gov.by - top level script to capture data into minsk.gov.data file
# Get the specific Job ID numbers from a page at minsk.gov.by
JOB_ID=$1
LINK=http://minsk.gov.by/cgi-bin/find_job_more.pl?k_vks=$JOB_ID
wget -O - $LINK 2> /dev/null |
iconv -c -f cp1251 -t utf8 |
awk '
/<.b><.a><.div>/ {gsub(".*<b>", ""); gsub("<.*", ""); printf("%s|", $0); }
/<nobr>/ {gsub(".*<nobr>", ""); gsub("<.*", ""); printf("%s|", $0); }
/Должность/ {gsub(".*<b>", ""); gsub("<.*", ""); printf("%s|", $0); }
/Разряд/ {gsub(".*<b>", ""); gsub("<.*", ""); printf("%s|", $0); }
/Характер работы/ {gsub(".*<b>", ""); gsub("<.*", ""); printf("%s|", $0); }
/Образование/ {gsub(".*<b>", ""); gsub("<.*", ""); printf("%s|", $0); }
/Вид занятости/ {gsub(".*<b>", ""); gsub("<.*", ""); printf("%s|", $0); }
/Зарплата/ {gsub(".*<b>", ""); gsub("<.*", ""); printf("%s|", $0); }
/Количество вакантных мест/ {gsub(".*<b>", ""); gsub("<.*", ""); printf("%s|", $0); }
/Дополнительная информация/ {gsub(".*<b>", ""); gsub("<.*", ""); printf("%s|", $0); }
END {print "'$LINK'";}
'get-1-job script
use utf8; require "/media/disk/Trud/specialties.prl"; require "/media/disk/Trud/scheds.prl"; require "/media/disk/Trud/educ.prl"; $date = `date +%s`; chop($date); $vid = `sh ./get-nid`; chop($vid); $vid = $vid + 0; while($REC =) { $REC =~ s/'//g; chop($REC); @vacancy = split(/\|/, $REC); if ($vacancy[7] eq '') {next;} $vid++; $specialty = $vacancy[2]; $specialty =~ tr/А-Я/а-я/; $specialty = ucfirst($specialty); $sched = $vacancy[6]; $sched =~ tr/А-Я/а-я/; $sched = ucfirst($sched); $educ = $vacancy[5]; print "insert into content_type_vacancy values (" . $vid . ", " . $vid . ", " . ucfirst($vacancy[7]) . ", '" . ucfirst($vacancy[1]) . "', '" . ucfirst($vacancy[0]) . "', 1, '" . $vacancy[10] . "', '', '');\n" ; print "insert into node_revisions values (" . $vid . ", " . $vid . ", 1, '" . ucfirst($vacancy[0]) . "', '" . ucfirst($vacancy[9]) . "', '" . $vacancy[9] . "', '', " . $date . ", 1);\n"; print "insert into node values (" . $vid . ", " . $vid . ", 'vacancy', 'ru', '" . $specialty . "',1, 1, " . $date . ", " . $date . ", 2, 0, 0, 0, 0, 0);\n"; print "insert into term_node values (" . $vid . ", " . $vid . ", " . get_specialty($specialty) . ");\n"; print "insert into term_node values (" . $vid . ", " . $vid . ", " . get_sched($sched) . ");\n"; print "insert into term_node values (" . $vid . ", " . $vid . ", " . get_educ($educ) . ");\n"; }
load-db.prl script
- Article Tags:

Drupal
OpenStreetMap
Ubuntu
PostgreSQL