API Explorer for MediaWiki

Api Explorer: As part of my work trying to make the Wikia API more accessible to developers, order I’ve created a basic version of an API Explorer.

Yo dawg: the entire API Explorer is actually written in javascript using the MediaWiki API to build the documentation about the API. Introspection win!

Open source: There is no reason this should be Wikia-specific, order so I contributed it upstream to MediaWiki. You can find more info on the ApiExplorer Extension page on MediaWiki.org.

Future plans: when there is time, illness I’d like to make the MediaWiki API return its example URLs also. When that is working, hopefully we can make the API Explorer let the user issue those example requests and see the results live (and modify the URLs to issue new test requests).

Don’t forget to follow @WikiaAPI on twitter for more updates!

Image Compression at Wikia

At Wikia, unhealthy we have a ton of images and we serve them up constantly on our 1 billion+ pageviews per month. Since we weren’t compressing our images, purchase this left a fairly big potential area for improvement.

Stating what is probably obvious: having lossless image compression gives smaller filesizes which means lower bandwidth costs, sildenafil but more importantly, it gives a speedup in the page load times since the user doesn’t have to spend as much time downloading the images.

Compressing user-generated images

Since I’d recently seen smush.it (which is built into Google Page Speed) give me a “lossless” image that looked way worse than the original, it didn’t seem right to just bulldoze the images that user uploaded. Instead, it seemed best to use a bot to upload better versions. If it turned out that one of our lossless compression algorithms actually hurt the image quality, the community could roll it back quite easily.

This means that we won’t save file-storage space (because we’ll actually keep the old version also), but we still get the other benefits.

Research!

There are a ton of tools for image compression floating around and it wasn’t clear from secondary-research (ie: googling) which were the best. So I decided to do some primary research. Side-note: A few years ago at a previous startup (which was later acquired by LinkedIn) I did some similar reasearch on PNG compression with a much smaller dataset but with more compressors.

This all started as a Wikia Hackathon project in which I wrote a bot which could download an image, compress it and re-upload it. There was a lot of buy-in for this idea, so my first tests were built from that script. I chose 10 wikis and used the Wikia API (via the Perl MediaWiki API library that I co-author) to find 100 images from each wiki. I compress PNGs and JPGs and ignore the others.

The raw data can be found on ImageBot’s User Page. But here are some takeaways:

  • We could save 11.10% across all images
  • pngcrush compressed more than optipng (when pngcrush was given a good long time to do its best methods) and jpegoptim compressed more than jpegtran.
  • But it really doesn’t matter which were better because BOTH was better than either. Meaning that if we compressed using both methods, then chose the best compression, that result was better. pngcrush saved 20.89% on average but always choosing the smallest PNG of the two, resulted in 23.83% savings. Similarly, jpegoptim was 5.40% on average, but JPGs in general compressed to 5.94% when using the better method for each image.

Open sourcing a MediaWiki bot framework

On my last post I asked what my readers wanted me to write about and all of the responses I got on the post or in person had the “how to write a MediaWiki bot in 10 minutes or less” at the top of the list.

I have that post mostly written, physiotherapy but in order to make that whole process easier, more about I’ve finally made the bot framework that I now use to be open sourced and easily accessible online.

Background

I used to use custom scripts for my bot, but this summer when LyricWiki transitioned over to Wikia, they all broke. My scripts pre-dated the MediaWiki API so they had depended on screen-scraping which no longer worked when we switched to Wikia’s skins which had a completely different layout.

When I had to get my bots running again, I looked at a few Perl frameworks for connecting to the MediaWiki API, and the one that seemed to have significantly less bugs than the others was a perl module by CBM.

Over the months, I’ve realized that there was some functionality that wasn’t implemented yet but which I needed – deleting pages, issuing purges, finding all templates included on a page – so I updated to the module. I tried to get access to the MediaWiki Tool Server where the project is currently hosted, but they must be really busy because they haven’t replied to the JIRA issue (request for an account) and it’s been months.

Since it has become quite a waiting game, I decided to just fork the project. Hopefully CBM will want access to the repository and we can just keep working on it together. Regardless, I’ve created all of the usual suspects for a project such as this (see next section).

Project links

So, without further delay, here are the beginnings of the Perl MediaWiki API

The links (especially the wiki) need a lot of work before it becomes obvious how to quickly get set up and use the module. The next blog post will take care of that!

However, if you’re curious & are already comfortable with Perl (and to some extent MediaWiki), you can jump right in. Let me know if you have any feedback. Thanks!

What do you want to see?

Note: If you’re seeing this on facebook, epilepsy it is just pulled in from my blog at http://seancolombo.com

I’m in the mood to do some blogging in the next couple of days but have more ideas than time. What would YOU find most interesting? I’m thinking along the lines of either analyzing LyricWiki statistics or doing quick tutorials (“How to Write a MediaWiki Bot in 10 Minutes or Less”, or something similar).

Here were some ideas of stats I could do. They each take a decent amount of time, so please let me know which ones you are most interested in:

  • Views / #Songs by Genre
  • Views / #Pages by Language
  • Views / #Pages by Publisher
  • Infographic of Lables/Publishers in the Music Industry, how they relate to each other, and their prevalence in the market.
  • Our prevalence in a country vs. it’s prevalence online
  • Impact on page-views of being SOTD / AOTW / FMOM vs. not.
  • Views from songs that were on the iTunes Top 100 during the month vs. those that weren’t. Views/page for that same group.
  • Views by page-age and views/page by page-age.
  • Views by page freshness (last touched) and views/page by page freshness. Include histogram of freshness across all pages.

Let me know in the comments what you want to see! Those stats, the tutorial mentioned, or anything else are all fair game. Since I don’t have many readers yet, if you comment then I’ll probably do the post you’re asking for.

Quick Tip: Do huge MySQL queries in batches when using PHP

When using PHP to make MySQL queries, stomatology it is significantly better for performance to break one extremely large query into smaller queries. In my testing, there was a query which returned 1 million rows and took 19,275 seconds (5 hours, 20 minutes) to traverse the results. By breaking that query up into a series of queries that had about 10,000 results, the total time dropped to 152 seconds… yeah… less than 3 minutes.

While MySQL provides LIMIT and/or OFFSET functionality to do batching, if you have numeric id’s on the table(s) you’re querying, I’d recommend using your own system for offsets (code example below) rather than the default MySQL functionality since the hand-rolled method is much faster. See table in next section for performance comparisons.

Timing table

I’ll provide some example code below to show you how I did the batching (based on potentially sparse, unique, numeric ids). To start, here is a table of query-result-size vs. the total runtime for my particular loop. All timings are for traversing approximately 1,000,000 rows.

Query Batch Size Handrolled method MySQL “LIMIT” syntax
1,000,000+ 19,275 seconds 19,275 seconds
10,000 152 seconds 1,759 seconds
5,000 102 seconds 1,828 seconds
1,000 43 seconds ?
750 40 seconds ?

At the end, it was pretty clear that no more data was needed to continue to demonstrate that the LIMIT method was slow. Each one of those runs was taking about half an hour and about halfway through the 1,000 row test for the LIMIT method, it started causing the database to be backed up. Since this was on a live production system, I decided to stop before it caused any lag for users.

Example Code

This code is an example of querying for all of the pages in a MediaWiki database. I used similar code to this to make a list of all of the pages (and redirects) in LyricWiki. In the code, you’ll notice that the manual way I do the offsets based on id instead of using the MySQL “LIMIT” syntax doesn’t guarantee that each batch is the same size since ids might be sparse (ie: some may be missing if rows were deleted). That’s completely fine in this case and there is a significant performance boost from using this method. This test code just writes out a list of all of the “real” pages in a wiki (where “real” means that they are not redirects and they are in the main namespace as opposed to Talk pages, Help pages, Categories, etc.).


< ?php

$QUERY_BATCH_SIZE = 10000;
$titleFilenamePrefix = "wiki_pageTitles";

// Configure these database settings to use this example code.
$db_host = "localhost";
$db_name = "";
$db_user = "";
$db_pass = "";

$db = mysql_connect($db_host, $db_user, $db_pass);
mysql_select_db($db_name, $db);

$TITLE_FILE = fopen($titleFilenamePrefix."_".date("Ymd").".txt", "w");
$offset = 0;
$done = false;
$startTime = time();
while(!$done){
$queryString = "SELECT page_title, page_is_redirect FROM wiki_page WHERE page_namespace=0 AND page_id > $offset AND page_id < ".($offset+$QUERY_BATCH_SIZE);
if($result = mysql_query($queryString, $db)){
if(($numRows = mysql_num_rows($result)) && $numRows > 0){
for($cnt=0; $cnt < $numRows; $cnt++){
$title = mysql_result($result, $cnt, "page_title");
$isRedirString = mysql_result($result, $cnt, "page_is_redirect");
$isRedirect = ($isRedirString != "0");
if(!$isRedirect){
fwrite($TITLE_FILE, "$title
");
}
}
$offset += $QUERY_BATCH_SIZE;
print "	Done with $offset rows. 
";
} else {
$done = true;
}
}
mysql_free_result($result);
}
$endTime = time();
print "Total time to cache results: ".($endTime - $startTime)." seconds.
";
fclose($TITLE_FILE);

?>

Hope that helps!