11 tips for scrapers at the next level

Nils Mulvad, editor at Kaas & Mulvad, associate professor at The Danish School of Media and Journalism, session When to Scrape: Tools and techniques. Saturday 1st of March 2014, Baltimore, Nicar

Scraping data on a daily basis on foreing workers in Denmark – purpose to help controlling fair payment is paid to the workers. Foto: Sidse Buch

Big data is here, meaning we have access to more and more data from different sources. We have also more and more tools to extract, match, analyze and present the stories in the data.

Here are 11 tips for what to be aware of. We use Kapow software for our scrapers and run the scrapers on a linux-server, we collect the data in MySQL-databases. We have two servers on Rackspace Cloud.

1.       Don’t be scared of big data

There are so many different definitions, normally either from the perspective of job type or possibilities in the data.

Data expert defines it out from the tools they need to handle the data. Data providers see it as the new sources of data. Analyst describes it from what you can do with the data.

Look at this as new big, area to gather and combine your material from.  Take your data from censors, GPS, authority data, social media, corporate data and scientific data

2.        Data journalists are the kings

For years we have been working with data with a single purpose. Can we extract stories from them? This skill of looking for content in data is perhaps the most important to add to the possibilities right now.

Data analysts are to narrow minded – they look at all data as equal, journalists are too narrow minded – they look at data as a total not understandable part of life.

We operate in between. Very few can do that smart. Time has come to you.  Go for the stories in the data.

3.       Combine scraping with other tools

Often you need to scrape the same source in a schedule. If it’s an official website you can directly use the content.

Going into a negotiation with the authority might lead you to get the material as an xml-file or as API-access.

This will make it a lot easier for you and them, saving server-time for both, and ensure higher quality of data.

4.       Matching data

Sometimes the scraping will combine material from different sources to add more context to your material. Sometimes the best solution combines different tools. If so, use these different tools.

WHERE (rutOK.IdCompany LIKE CONCAT(‘%’, rutCompany.RUTNumber, ‘%’) OR rutCompany.RUTNumber LIKE CONCAT(‘%’, rutOK.IdCompany, ‘%’) OR rutCompany.ForeignCompanyRegistration LIKE CONCAT(‘%’, rutOK.IdCompany, ‘%’) OR rutOK.RutNumber = rutCompany.RUTNumber)

This code above is the main part of combining script from our scraper-language with MySQL to match different fields in the extraction with fields in a table either with a perfect match or where the content of one field is part of another field. Codes can do much.

5.       The perfect scraper doesn’t exist – log errors

You can do many things to optimize your scraper, but there will always be a risk of errors. In some cases the authority build and later rebuild their website slightly different, and you could not foresee that when you build your scraper.

You should have an instinct for reverse engineering, finding patterns on their website and possibilities for errors. And then you also need to keep records of your scraping, so you have a warning system for errors.

6.       Always include metadata in your scraping

Scraping data into the same table in schedules need you to keep track of each record, meaning you have to include at least:

FirstExtracted

LastExtracted

LastChanged

ExtractedInLastRun

ScraperName

RunId

KeyField.

This should be in every record.

7.       Scraping thousands of website a day down in several levels

Making a surveillance of changes on big websites means you have to go down in several levels. Having for instance 20 websites to follow with 50 urls on each site, means you have 20 Urls on level 1, 1.000 url’s on level 2, then 50.000 urls on level3 and in the end 250.000 urls.

You then need to build a system, so you only open an url one time at each level and never opens an url on a specific level if it had been opened before. A clear structure is the answer.

8.       Tuning scrapers for loading url’s

If you have a chance to edit the way your scraper handle URL’s then turn these things of:

Load frames

Execute javascripts

It’s simply the most effective way to minimize server-time and speed up the scraping. Sometimes you need to have it on for loading the data. Then it is necessary.

9.       Always focus on the story – the context

As data journalists we love data, we simply can’t get enough. But never be satisfied just with the data. Always think of the user, the viewer, anybody who will interfere with your material.

Make it as simple and easy to use as possible. Rethink and rethink. Mobile means simple and clear focused.

10.   Know the limits of the scraping and machine-generated content

In some situations the scraped material will feed an output – either as email or as a presentation. But be very keen on the finish.

Sometimes it demands a human touch in the end to make it better and even mre focused on the end-user. Follow your products closely.

11.   Make scraper-operations scalable

If you begin this operation of scheduling daily scraping jobs, be sure that you have a system easy to scale up.

We run all scraping on scalable cloud-servers. We can just upscale everything on the fly. Make expanding easy.

Trackbacks/Pingbacks

  1. Slides, Links & Tutorials from NICAR14 // Ricochet by Chrys Wu - 1. March 2014

    […] Suen) • Cooking With Hardware (from Team Blinky) • Intro to Ruby (from Al Shaw) • When to Scrape (from Nils […]

  2. A beginners guide to data journalism | Ryan Frank - 1. May 2014

    […] Nils Mulvad: 11 tips for scapers at the next level […]

  3. 一览无遗:2014 NICAR会议数据新闻资源 | 数据新闻中文网 | When Data Meet Journalism - 17. June 2014

    […] 55、网页抓取的技巧和工具   by Nils Mulvad (Kaas & Mulvad编辑) […]

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.