Thursday, December 1, 2011

Fit testing the crawlers

Testing the crawlers has proved quite difficult.
The operation of each crawler can be divided into two separate actions:
  • Fetch a webpage
  • Process the contents of that webpage
For the fetch-part, I use Mojo::UserAgent so the correct way of testing would be to create a mock object for it and mock the get method.

but:
get returns a Mojo::Transaction::HTTP object.
The res attribute of Mojo::Transaction::HTTP is a Mojo::Message::Response which then in turn contains the parsed contents of the webpage.

So, if I were to use that particular method of testing, I would have to create a Mojo::Message::Response and feed it with the data I want to use in the test. Then I would have to create a Mojo::Transaction::HTTP and add the response to it.
It would look like this:

  1. use Test::Mock::Class ':all';
  2. use Mojo::Transaction::HTTP;
  3. use Mojo::Message::Response;
  4. use File::Slurp;

  5. my $mockUserAgent = mock_anon_class('Mojo::UserAgent')->new_object;
  6. my $getLastPageResponse = Mojo::Message::Response->new;
  7. $getLastPageResponse->code(200);
  8. $getLastPageResponse->body(sub{read_file('testpage.html')});
  9. my $getLastPageTx = Mojo::Transaction::HTTP->new;
  10. $getLastPageTx->res($getLastPageResponse);
  11. $mockUserAgent->mock_return('get', $getLastPageTx, args => ['http://url.under.test']);

  12. ... actual test...

I would have to do that to for each and every test that I wanted to do.
I didn't like this at all, so I started looking for a better option. A more generic one that I would be able to use during the rest of my testing (or even other projects).

After some mucking about, I came across the idea of a mock server. Instead of mocking the useragent itself, I would just mock the webserver on the other end!
Granted, the test depends on the good functioning of  Mojo::UserAgent ├índ on the good functioning of the mock server itself before testing can be done.
But then again, it's a fit test, not a unit test! I want to test the entire application chain!

The MockServer
The idea is quite simple. I create a small Mojolicious::Lite application which is started from within the test via Mojo::Server. The application is given a dictionary of urls and corresponding pages. So when the server receives a request, it looks in the dictionary to see if it knows the url that is being requested and if it does, it returns the contents of the corresponding page.

  1. #!/usr/bin/env perl
  2. use Mojolicious::Lite;
  3. use File::Slurp;
  4. my $config = {};
  5. helper SetConfig => sub {
  6.   shift;
  7.   $config = shift;
  8. };
  9. get '/*path'  => sub {
  10.   my $self = shift;
  11.   my $path = $self->stash('path');
  12.   my @params = $self->param;
  13.   if (scalar @params > 0) {
  14.     $path .= '?';
  15.     foreach my $param (@params) {
  16.       $path .= $param . '=' . $self->param($param) . '&';
  17.     }
  18.     # chop off final '&'
  19.     $path = substr($path, 0, length($path) - 1);
  20.   }
  21.   if (defined $config->{$path}) {
  22.     my @data = read_file($config->{$path});
  23.     $self->render_data("@data");
  24.   } else {
  25.     $self->render_text("unrecognized param: $path");
  26.   }
  27. };
  28. app->start;

Since I'm using mojolicious's wildcard placeholder, you would think that just fetching the path parameter would suffice, but unfortunately, the wildcard placeholder does not really match everything, it matches everything except the url query string (here is part of my quest to find that out).

So the query string needs to be parsed separately and need to be appended to $path to rebuild the url that was originally requested (lines 16 - 25)

Lines 27 - 32 show the actual rerouting. If the requested url is found in the dictionary ($config) then we read in the corresponding file and return that. If we can't find the requested url, then we just return an error message.

The MockServerRunner
Mojo::Server's run method is blocking. That means that, if I wanted to start it from within the test, I had to run the server in a separate thread. This means that starting the server needed a little bit of code too, but since that's just boilerplate, I threw that in a separate module: the MockServerRunner:

  1. package MockServerRunner;
  2. use strict;
  3. use warnings;
  4. use threads;
  5. use Mojo::Server::Daemon;
  6. sub new {
  7.   my ($class, $config) = @_;
  8.   my $server = Mojo::Server::Daemon->new();
  9.   $server->load_app('t/Crawlers/MockServer')->SetConfig($config);
  10.   my $self = bless {THREAD => 0,
  11.                     SERVER => $server}, $class;
  12.   return $self;
  13. }
  14. sub start {
  15.   my $self = shift;
  16.   $self->{THREAD} = threads->create(sub {
  17.                                       my $self = shift;
  18.                                       $self->{SERVER}->run;
  19.                                     }, $self);
  20.   sleep(3);                     # fukit
  21.   $self->{THREAD}->detach;      # live long and prosper!
  22. }
  23. 1;

You can ignore line 25 :-)
I still need to figure out how I can detect that the server is actually running. But for now, it always starts in under three seconds.

After the server is running I detach the thread (line 26), which essentially means that I don't care about it any more. It would be cleaner to send some sort of stop signal to the thread and then wait for it to exit via join, but that would require more programming, shared data and the whole bunch. I didn't want that.
I just want the server to exit when testing is done. So I detach and I trust that perl will clean up after me

Rerouting the request
Okay, so my server is running and ready to serve the pages that I want to run my tests on. There is only one thing left now, and that is making sure that the requests are sent to the mock server, and not to the original website! That's where Moose comes in with its fantastic method modifiers! In this case I'll use before to decorate the get method of Mojo::UserAgent so that the request is re-routed to my mock server.

This code to do that is added to the initialisation of the test. It needs to be repeated exactly once for every crawler. The sample below is from the testing of the ZestaCrawler (the one that crawls zesta.be)

  1. {
  2.   package UserAgent;
  3.   use Moose;
  4.   extends 'Mojo::UserAgent';
  5.   before 'get' => sub {
  6.     $_[1] =~ s/zesta\.be/localhost:3000/;
  7.   };
  8. }

The test
Now all that remains is the test itself, which should by now be a lot shorter!
Lets have a look:

  1. my $mockServer = MockServerRunner->new({'zoeken' => 't/Crawlers/zestacrawler_getlastpage_response.html',
  2.                                         'zoeken?page=0' => 't/Crawlers/zestacrawler_getpage_response.html',
  3.                                         'dummyrecept' => 't/Crawlers/zestacrawler_getrecipe_response.html'});
  4. $mockServer->start;

That's it! I want to test with three different pages, so I just pass the url's and the corresponding file to the MockServerRunner and I start it. Testing is now a breeze and my test-files are a lot cleaner!
Setting up a new test (with a different url) is now reduced to creating an .html file with the desired contents and adding the url and the filename to the list that's passed to the MockServerRunner.

This method is of course far from perfect, but I really like it. It took me a couple of hours of work but I'm quite sure that it'll save time in the long run. Feel free to discuss!


ldx



No comments:

Post a Comment