A mass validation shellscript
by Pascal Opitz on December 14 2008, 19:45
I was looking for a CLI script that validates a whole site for me, but I couldn't find one that would work without installation issues. So I hacked together an example shell script that does the job for me by downloading the whole site and then running the files through a validation.
Prerequisites
The shell script uses CURL and WGET (WGET for OSX, in my case), plus the "SOAP API" of the w3c validator.
I am putting "SOAP API" in quotation marks, because it is not really supporting SOAP calls, but only wraps the response into a Soap Envelope. That's why I am using CURL to post the files.
For this example I installed Validator S.A.C. and followed the instructions to get it running as local service. Of course, if you are on linux, you can install it from source or as package. Alternatively you can change the script to use validator.w3.org/check instead of localhost/w3c-validator/check, but it might run pretty slow and create a lot of traffic.
The script
Also a word of warning: The script creates a temp directory and a log.txt file, which it deletes before creating them. I am in no way responsible for any of your stuff getting deleted by running this.
But hey: Feel free to alter this to fit your needs (and maybe post improvements in the comments, for example for my sloppy way of detecting whether it is an HTML file).
#!/bin/sh
#
# Script to validate files in directory
#
is_html() {
file=$1
htmlstart=`grep '<html' $1`
if [ "$htmlstart" != "" ];
then
echo "1";
fi
}
validate_file() {
curl -s -F uploaded_file=@$1 -F output=soap12 localhost/w3c-validator/check
}
download_site() {
cd temp
echo 'downloading files ...'
wget -r -q -k -x -E -l 0 $1
echo 'done downloading files'
cd ..
}
setup() {
rm -f log.txt
rm -Rf temp
mkdir temp
touch log.txt
}
run_validation() {
for file in `find $1`;
do
htmltrue=`is_html $file`
if [ "$htmltrue" = "1" ];
then
echo "request validation: $file"
rpc=`validate_file $file`
echo "checking response: $file"
noerror=`echo $rpc | grep '<m:errorcount>0</m:errorcount>'`
if [ "$noerror" = "" ];
then
echo "Error in file $file"
echo "----------------" >> log.txt
echo "Error in file $file\n" >> log.txt
echo $rpc >> log.txt
echo "\n" >> log.txt
echo "----------------" >> log.txt
fi
fi
done;
has_errors=`cat ./log.txt | grep Error`
if [ "$has_errors" = "" ];
then
echo "no errors found\n" >> log.txt
fi
}
setup
download_site $1
run_validation ./temp/
Update
I slightly modified it, so I do get better error messages. I use xsltproc for parsing the SOAP envelope returned by the validator. Here is the updates script:
#!/bin/sh
#
# Script to validate files in directory
#
is_html() {
file=$1
htmlstart=`grep '<html' $1`
if [ "$htmlstart" != "" ];
then
echo "1";
fi
}
validate_file() {
curl -s -F uploaded_file=@$1 -F output=soap12 localhost/w3c-validator/check
}
download_site() {
cd temp
echo 'downloading files ...'
wget -r -q -k -x -E -l 0 $1
echo 'done downloading files'
cd ..
}
setup() {
rm -f log.txt
rm -Rf temp
mkdir temp
touch log.txt
}
run_validation() {
for file in `find $1`;
do
htmltrue=`is_html $file`
if [ "$htmltrue" = "1" ];
then
echo "request validation: $file"
rpc=`validate_file $file`
echo "checking response: $file"
noerror=`echo $rpc | grep '<m:errorcount>0</m:errorcount>'`
if [ "$noerror" = "" ];
then
filelocation=`echo $file | sed "s/\/\//\//g"`
echo $rpc > temp_error.xml
xsltproc --stringparam location $filelocation error_template.xsl temp_error.xml >> log.txt
rm temp_error.xml
echo "Error in file $file"
fi
fi
done;
has_errors=`cat ./log.txt | grep Error`
if [ "$has_errors" = "" ];
then
echo "no errors found\n" >> log.txt
fi
}
setup
download_site $1
run_validation ./temp/ $1
As you can see, we need a file called error_template.xsl as well, here is an example file:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns="http://www.w3.org/TR/xhtml1/strict"
xmlns:m="http://www.w3.org/2005/10/markup-validator"
xmlns:env="http://www.w3.org/2003/05/soap-envelope"
>
<xsl:output
method="text"
omit-xml-declaration="yes"
/>
<xsl:param name="location" />
<xsl:template match="/">
<xsl:call-template name="divider" />
<xsl:value-of select="//m:errorcount" />
<xsl:text> Errors in </xsl:text>
<xsl:value-of select="$location" />
<xsl:call-template name="lb" />
<xsl:apply-templates select="//m:error" />
</xsl:template>
<xsl:template match="m:error">
<xsl:text> Line </xsl:text>
<xsl:value-of select="m:line" />
<xsl:text>, Col </xsl:text>
<xsl:value-of select="m:col" />
<xsl:text>:</xsl:text>
<xsl:call-template name="lb" />
<xsl:value-of select="m:message" />
<xsl:call-template name="lb" />
</xsl:template>
<xsl:template name="lb"><xsl:text>
</xsl:text></xsl:template>
<xsl:template name="divider">
<xsl:text>--------------</xsl:text><xsl:call-template name="lb" />
</xsl:template>
</xsl:stylesheet>
I think this would be easily adaptable to produce XML or HTML files. I'd like to figure out where WGET did download the file from, so I could insert that into the output generation, as hyperlink for example. But apart from that I think it performs pretty neat.
Comments
With Validator-SAC, you can also run the validator directly from the command-line without setting up a server. The simplest form is just to call it with an http of file URL:
A full query string can also be used with added parameters:
The weblet script outputs a CGI response, so the first first few lines are CGI headers:
by Chuck Houpt on December 15 2008, 01:35 #
Thanks Chuck, very helpful insight. Also thanks for creating the Validator-SAC, in the first place. A great tool to have for us Mac dummies.
by Pascal Opitz on December 15 2008, 10:51 #
By the way, I had various people on MSN that were saying I should have tried to avoid downloading the whole site with wget, but use the sitemap instead to pass the URL directly. Good idea, and maybe worth implementing in the future, maybe with a sitemap as optional parameter?
Also, my experiences with WGET are limited, but there is a spider mode. Maybe it's worth just taking the URLS that the spider gets instead of downloading to a temp folder?
Comments welcome!
by Pascal Opitz on December 15 2008, 11:01 #
by Chuck Houpt on December 20 2008, 14:32 #
by iñigo medina on August 10 2009, 09:47 #
by Jérôme Jaglale on October 28 2009, 21:06 #
by Connie Chung on September 4 2009, 02:46 #