Wednesday, November 30, 2011

Hello there...

Hi everybody, and welcome to my blog!

I am currently developing  a search engine specifically for recipes.
As a start, I am only targetting the belgian market (since I am from Belgium myself)
Since a couple of years television cooks and cooking shows are a booming business in Belgium.
A natural consequence of that is that there are millions of recipe sites.

This all seems very nice and all, but they all work separately. None of these recipe sites link to each other, so if you're looking to make a certain dish, you have to go over a whole list of bookmarks to see all the different recipes they have to offer.

The intention I have with my website is to make all the other websites searchable.
This way, you can go to a single webpage and find all the different recipes for the dish you want to cook in one place! Now wouldn't that be nice :-)

On this blog, I will let you people in on the progress of the creation of the website (or sometimes, the attempt at progress :-))

At the moment, the site hasn't launched yet, but I have been working on it on and off for the last couple of months. When it's finished, you'll be able to visit it at moam.be.

Name
You probably all think that moambe is a really strange name. Well, you're right, it is a strange name!
Since I'm just targetting Belgium for now, I naturally wanted a .be TLD.
I also wanted a domain name where the TLD is part of the website name
(you know, like mojolicio.us or youtu.be)
After a bit of thinking I came up with moam.be derived from the traditional congolese dish moambe.

Technology
The technology used to develop the site is currently:
There will probably still be additions to this list over time, but for now, that's it :-)

Architecture
The basic architecture of the application looks like this


There is a server side program (the crawler) that crawls all the recipe websites (for now that's just zesta.be and njam.tv, but that list should become a lot larger over time) for each recipesite I write a separate crawler that scrapes the site and normalizes the recipe to an internal data structure.
That structure is saved as a document in the couchdb. If the site provides a thumbnail of the recipe, then that's also scraped and send to a content delivery network (I'm trying to think of scaling in an early stage ;-))
The CDN is a separate webservice. For now it just normalizes the size (100 x 100px) and filetype (png) of the thumbnail and stores it to disk in a structured directory structure.

When a document is added to the couchdb, elasticsearch is automatically notified of that (via a river) and immediately indexes that new document.

So when you visit the site and search for something, the webapp talks to elastic to get the desired results, gets the corresponding thumbnails from the CDN, renders that back to your browser et voila!

Because I don't want to infringe any copyrights, I don't show the full recipe, just the first 100 characters. To read the full recipe, you need to click it and that sends you to the original recipe on the site from which it was scraped
 
Next steps
At the moment, I'm writing unit tests for all the code that was written already (I know, I know, I should have done that a lot sooner), so actual development is very low at the moment.
Once everything is tested properly, all I need to do is find a hosting service for all my stuff and launch!
I'm sure that it will be a fantastic and interresting journey.
Wanna join?

See you guys later,
ldx
 

2 comments:

  1. If you're at a really early stage like that I'm not _so_ sure that mega unit test coverage is so important.

    Probably better to hack like crazy, write some mechanize tests to verify site functionality and get it live so you can validate the idea with real users.

    I've thought about similar ideas before but I was never clear on the legal implications? Is it okay to just scrape data from sites and redisplay it. I mean obviously lots of people do it but it always left me wondering if I actually got going with the project would it get shot down.

    Anyway, nice to see someone starting a new venture in Perl; good luck!

    ReplyDelete
  2. Thanks Adam!

    Actually the project is already at a much further state than just the part that's described above. I have two crawlers ready and the site itself is at about 80% completion for an initial launch. I just need to do some more graphics.

    The problem now is that the crawlers generate quite some errors, so that's why I'm now unit-testing and fit-testing everything.

    I'm not sure about the legal issues either.
    But it's impossible to read the entire recipe on my site, so people _have_ to click the link back to the original site to get the full recipe. So all I do for those sites is generate more traffic! I'd be suprised if they didn't like that :-)

    ReplyDelete