2015-10-11

Pyslet goes https

After months of being too busy to sort this out I have finally moved the Pyslet website to SSL. This is a quick post to explain how I've done this.

Firstly, I've wanted to do this for a while because I want to use the above website to host a web version of the QTI migration tool but encouraging users to upload their precious assessment materials to a plain old HTTP URL should (hopefully would) have proved a challenge. I saw an advert for free SSL certificates for open source projects from GlobalSign so in a rush of enthusiasm I applied and got my certificate. There's a checklist of rules that the site must comply with to be eligible (see previous link) which I'll summarise here:

  1. OSI license: Pyslet uses the BSD 3-Clause License: check!
  2. Actively maintained: well, Pyslet is a spare-time activity but I'm going to give myself a qualified tick here.
  3. Not used for commercial purposes: the Pyslet website is just a way of hosting demos of Pyslet in action, no adverts, no 'monetization' of any kind: check!
  4. Must get an A rating with GlobalSign's SSL Checker...

That last one is not quite as easy as you might think. Here's what I did to make it happen, I'll assume you have already dome some openssl magic, applied for and received your crt file.

  • Download the intermediate certificate chain file from GlobalSign here, the default one for SHA-256 Orders was the correct one for me.
  • Put the following files into /var/www/ssl (your location may vary):

    www.pyslet.org.key
    www.pyslet.org.crt
    globalsign-intermediate.crt

    The first one is the key I originally created with:

    openssl genrsa -des3 -out www.pyslet.org.key.encrypted 2048
    openssl req -new -key www.pyslet.org.key.encrypted -out www.pyslet.org.csr
    openssl rsa -in www.pyslet.org.key.encrypted -out www.pyslet.org.key

    The second file is the certificate I got from GlobalSign themselves. The third one is the intermediate certificate I downloaded above.

  • Set permissions (as root):
    chown -R root:root /var/www/ssl/*.key
    chmod 700 /var/www/ssl/*.key
  • Add a virtual host to Apache's httpd.conf (suitable for Apache/2.2.31):
    Listen 443
    
    <VirtualHost *:443>
        ServerName www.pyslet.org
        SSLEngine on
        
        SSLCertificateFile /var/www/ssl/www.pyslet.org.crt
        SSLCertificateKeyFile /var/www/ssl/www.pyslet.org.key
        SSLCertificateChainFile /var/www/ssl/globalsign-intermediate.crt
        
        SSLCompression off
        SSLProtocol all -SSLv3 -SSLv2
        SSLCipherSuite AES128+EECDH:AES128+EDH    
        SSLHonorCipherOrder on
        
    #   Rest of configuration goes here....
    
    </VirtualHost>

This is a relatively simple configuration designed to get an A rating while not worrying too much about compatibility with really old browsers.

2015-02-16

Accessing the ESA Sentinel Mission Data with Python and OData

I've had a couple of enquiries now about how to access the OData feeds on the ESA Sentinel mission science data hub. Sentinel 1 is the first of a new group of satellites in the Copernicus programme to monitor the Earth. That's about all I know I'm afraid. This data is not pretty desktop pictures (though doubtless there are some pretty pictures buried in there somewhere) but raw scientific data from instruments currently orbiting the Earth.

The source code described here is available in the samples directory on GitHub, you must be using the latest Pyslet from master for this script to enable the metadata override technique used here.


The data hub advertises access to the data through OData (version 1) but my Python library, Pyslet, was not able to access the feeds properly: hence the enquiries.

Turns out that the data feeds use a concept called containment in OData. The model of OData is one of entity sets (think SQL tables) with relations between them modelled by navigation properties. There's one particular use case that doesn't work very well in this scenario but seems popular. Given an entity (think table row or record) people want to add arbitrary key-value pairs. The ESA's data model does this by creating 'sub-tables' which define collections of attributes that hang off of each entity. The attribute name is the key in these collections. This doesn't really work in OData v1 (or v2) because these attribute values should still be entities in their own right and therefore they need a unique key and an entity set definition to contain them.

This isn't the only schema I've seen that attempts to do something like this either, SAP have published a similar schema suggesting that some early Java tools exposed OData this way.

The upshot is that you get nasty errors when you try and load these services with Pyslet. It complains of a rather obscure condition concerning (possibly multiple) unbound principals. When I wrote that error message, I didn't expect anyone to ever actually see it.

There's a proper way to do containment in earlier versions of OData, described in Containment is Coming with OData v4 which explains how to use composite keys. As the name of the article suggests though, this is written with hindsight after a better solution has been found for this use case in OData v4.

The fix for the ESA data feed is to download and edit a copy of the advertised metadata to get around the errors reported by Pyslet and then to initialise your OData client using this modified schema instead. It isn't a perfect fix, as far as Pyslet knows those attributes really are unique and do reside in their own entity set but it doesn't really matter for the purposes of using the OData client. You can navigate and formulate queries without tripping over data inconsistencies.

I've written a little script that I've added to Pyslet's sample code directory to illustrate the technique, along with a fixed up metadata file. The result is a little UNIX-style utility for downloading products from the ESA data hub:

$ ./download.py --help
Usage: download.py [options]

Options:
  -h, --help            show this help message and exit
  -u USER, --user=USER  user name for basic auth credentials
  -p PASSWORD, --password=PASSWORD
                        password for basic auth credentials
  -v                    increase verbosity of output up to 3x
  -c, --cert            download and trust site certificate

The data is available via https and requires a user name and password (you'll have to register on the data hub site but it's free to do so). To make it easier to set up the trust aspect I've added a -c option to download the site certificate and store it. If you don't have the site certificate you'll get an error like this:

ERROR:root:scihub.esa.int: closing connection after error failed to build secure connection to scihub.esa.int

Subsequent downloads verify that the site certificate hasn't changed: a bit like the way ssh offers to store a fingerprint the first time you connect to a remote host. Only use the -c option if you trust the network you are running on (you can use Firefox or some other 'trusted' browser to download the certificate too of course).

The password is optional, if you don't provide it you'll be prompted to enter it using Python's getpass function for privacy.

You pass the product identifiers as command line arguments, here is an example of a successful first-time run:

$ ./download.py -c -u swl10 8bf64ff9-f310-4027-b31f-8e95dd9bbf82
Password: 
ERROR:root:Entity set Attributes has more than one unbound principal
dropping mutliplicity of Attribute_Node to 0..1.  Continuing
ERROR:root:Entity set Attributes has more than one unbound principal
dropping mutliplicity of Attribute_Product to 0..1.  Continuing
S1A_EW_GRDM_1SDH_20150207T084156_20150207T084218_004515_0058AE_3051 150751068

After running this command I had a scihub.esa.int.crt file (from the -c option) and a 150MB zip file downloaded to the current directory.

If you run with -vv to provide a bit more information you can see the OData magic in operation:

./download.py -vv -u swl10 8bf64ff9-f310-4027-b31f-8e95dd9bbf82
Password: 
INFO:root:Sending request to scihub.esa.int
INFO:root:GET /dhus/odata/v1/ HTTP/1.1
INFO:root:Connected to scihub.esa.int with DHE-RSA-AES256-SHA, TLSv1/SSLv3, key length 256
INFO:root:Finished Response, status 401
INFO:root:Resending request to: https://scihub.esa.int/dhus/odata/v1/
INFO:root:Sending request to scihub.esa.int
INFO:root:GET /dhus/odata/v1/ HTTP/1.1
INFO:root:Connected to scihub.esa.int with DHE-RSA-AES256-SHA, TLSv1/SSLv3, key length 256
INFO:root:Finished Response, status 200
WARNING:root:Entity set Attributes has an unbound principal: Nodes
WARNING:root:Entity set Attributes has an unbound principal: Products
ERROR:root:Entity set Attributes has more than one unbound principal
dropping multiplicity of Attribute_Node to 0..1.  Continuing
ERROR:root:Entity set Attributes has more than one unbound principal
dropping multiplicity of Attribute_Product to 0..1.  Continuing
INFO:root:Sending request to scihub.esa.int
INFO:root:GET /dhus/odata/v1/Products('8bf64ff9-f310-4027-b31f-8e95dd9bbf82') HTTP/1.1
INFO:root:Connected to scihub.esa.int with DHE-RSA-AES256-SHA, TLSv1/SSLv3, key length 256
INFO:root:Finished Response, status 200
S1A_EW_GRDM_1SDH_20150207T084156_20150207T084218_004515_0058AE_3051 150751068
INFO:root:Sending request to scihub.esa.int
INFO:root:GET /dhus/odata/v1/Products('8bf64ff9-f310-4027-b31f-8e95dd9bbf82')/$value HTTP/1.1
INFO:root:Connected to scihub.esa.int with DHE-RSA-AES256-SHA, TLSv1/SSLv3, key length 256
INFO:root:Finished Response, status 200

As you can see, the fixed up metadata still generates error messages but these are no longer critical and the client is able to interact with the service.

I was given this product identifier as an example of something small to test with. I haven't researched what the data actually represents but the resulting zip file does contain a 'quick_look' image:

2015-01-19

Yosemite Spotlight issues with HP drivers: check your console

I recently imported a bunch of email into Outlook for OS X and was disappointed that I was unable to search its contents. Outlook uses Apple's native spotlight search so, in theory, all I need to do is wait for spotlight to churn through the new material and I should be done. Hours passed and nothing seemed to happen.

The first thing I tried was to simply force a re-index of my hard-disk. I just added my main drive to the list of places to exclude from spotlight searching and then after waiting a minute or two (for superstitious reasons) I removed that item again and sat back waiting for the inevitable slowdown as mdworker launches into life and starts scanning all my data.

Nothing.

The next step I took was to try to figure out how to see if Spotlight was actually doing any indexing at all. There's no simple control panel or dashboard view of the indexing process. The only way I could find was to press command-space and search for something. It should show the indexing progress-bar if it is indexing (but if it is complete you'll see nothing). I was still getting nowhere and now I'd lost the ability to search for anything.

Check the Console

In these situations it is always worth checking the console. I don't mean the termainal, just the console utility app that spools system messages onto your screen and allows you to see what is happening on your Mac. There's a handy search box (which doesn't use Spotlight!) at the top which filters the current day's logs. Putting just 'md' in that box was enough to filter all the other stuff enabling me to see a constant stream of output from Spotlight's indexing application: mdworker.

Here's a sample:

Jan 12 23:06:45 LernaBookPro com.apple.xpc.launchd[1] (com.apple.mdworker.bundles[24329]): Could not find uid associated with service: 0: Undefined error: 0 1422
Jan 12 23:06:45 LernaBookPro com.apple.xpc.launchd[1] (com.apple.mdworker.bundles): Service only ran for 0 seconds. Pushing respawn out by 10 seconds.

You don't need to be an expert to see that this is some type of unexpected condition and the second line tells me that the resulting process exited out straight away. True to its word, every 10 seconds I got a pair of lines like this in my console log. Interestingly, the problem had been going on for some time, when I searched back through my logs I realised that Spotlight had probably not been indexing properly for ages. I have a feeling that things like new mail that arrives in your inbox gets indexed through some other mechanism and this can mask the fact that the long running indexer has not made a proper index of your hard disk. But that's just a hunch.

Check the web

Armed with this information I was able to find a thread on the internet that clearly dealt with the problem I was having: [...] com.apple.mdworker.bundles pollute logs with errors. Clearly the person who posted this thread didn't think (or perhaps realise) that Spotlight had actually failed and was chasing up unexpected slowness.

Interestingly, from this thread it is clear that the last number in the log entry is a numeric user id. In the case of the poster this was 502 which is typical of the range Apple uses for real users you add to your machine from the control panel. I think they start at 501. If you delete a user from your machine but leave lots of data lying around that is owned by that user then Yosemite seems to be having trouble and it is killing the Spotlight indexer.

The user id causing trouble for me is 1422, which is outside of this range so although the remedy might be similar the origin of my problem is different. I put 1422 into my search and found this thread: 27 inch iMac suddenly running very slowly where the person erases their disk and re-installs (yeah, that fixed it!).

Now use the Terminal

With the clues in the first thread it seems like I need to find some files on my disk owned by user id 1422. The terminal can do that using the Unix find command:

cd /
sudo find . -user 1422

This type of thing takes ages, especially if you have a large backup drive.

The Culprit

Turns out that the files owned by user 1422 are all part of the HP printer drivers. These may have been inherited from a previous installation, I'm not that sure, but either way there was information in /Library and inside the Application bundles themselves from HP that were identified this way. I had to use chown -R to change those to root ownership instead (I'm not sure what they are supposed to be).

You'll also find a few files in /private/var/ with paths similar to:

/private/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/0/com.apple.Spotlight/1422

The exact names will be different but the critical thing is the end part, which is a directory created just for spotlight. It is named after the missing user and it has its ownership set to this non-existent user. It is actually these latter files which are causing the problem, Spotlight is trying to create an index for that user but is surprised when it finds the user doesn't exist. Just removing this directory isn't enough though because Spotlight will re-index your disk and as soon as it finds a file owned by 1422 again it will create this folder in /private/var and grind to a halt again. You must remove or re-own everything that spotlight might see: a real hassle if you have a large backup drive because of the way Time Machine works. I've solved that problem by just excluding my backup drive from Spotlight.

FWIW, unix systems are usually very tolerant of non-existent user ids. Many archive programs will restore files from other machines and systems and, if run as root, will update their ownership to match the original ownership before the files were archived. On networks of Unix workstations that share a user directory this is useful because you can tar up files on one machine and transfer them to another and all the ownership information comes across too. On a personal machine this is less useful and perhaps even dangerous, hence the '--insecure' option on tar.

I considered removing and reinstalling the HP software but I'm not convinced that it isn't a problem with the installer itself. It works fine on the machine I upgraded from 10.8, through 10.9 to 10.10 (after these fixes) but I noticed that on a different machine that came with Mavericks and was upgraded to Yosemite I had to re-install the driver even though I used the migration assistant to set it up from its predecessor (running 10.8) which did have the HP drivers installed. I struggled to find the right download for my HP Officejet 6310 and perhaps now I know why!

Restart

I couldn't find any way of telling mdworker to give up on user id 1422. Removing it's special directory didn't seem to help so I assume that somewhere a process has a cache of that information and the only way I could figure to get it going again was a restart.

Conclusion

If you are using Yosemite and have an HP printer or scanner check your console just in case Spotlight has died for you too. Do battle with the terminal. Restart. Enjoy Spotlight indexing and smoother performance from your Mac.