<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>SeanColombo.com &#187; mediawiki</title>
	<atom:link href="http://www.seancolombo.com/tag/mediawiki/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.seancolombo.com</link>
	<description>My little corner of the internet.</description>
	<lastBuildDate>Thu, 29 Jul 2010 17:30:31 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>Open sourcing a MediaWiki bot framework</title>
		<link>http://www.seancolombo.com/2010/03/18/open-sourcing-a-mediawiki-bot-framework/</link>
		<comments>http://www.seancolombo.com/2010/03/18/open-sourcing-a-mediawiki-bot-framework/#comments</comments>
		<pubDate>Fri, 19 Mar 2010 00:18:18 +0000</pubDate>
		<dc:creator>Sean</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[API]]></category>
		<category><![CDATA[mediawiki]]></category>
		<category><![CDATA[ohloh]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Perl]]></category>
		<category><![CDATA[Perl MediaWiki API]]></category>
		<category><![CDATA[trac]]></category>
		<category><![CDATA[wikia]]></category>

		<guid isPermaLink="false">http://www.seancolombo.com/?p=229</guid>
		<description><![CDATA[On my last post I asked what my readers wanted me to write about and all of the responses I got on the post or in person had the &#8220;how to write a MediaWiki bot in 10 minutes or less&#8221; at the top of the list. I have that post mostly written, but in order [...]]]></description>
			<content:encoded><![CDATA[<p>On <a href="http://www.seancolombo.com/2010/03/04/what-do-you-want-to-see/">my last post</a> I asked what my readers wanted me to write about and all of the responses I got on the post or in person had the &#8220;how to write a MediaWiki bot in 10 minutes or less&#8221; at the top of the list.</p>
<p>I have that post mostly written, but in order to make that whole process easier, I&#8217;ve finally made the bot framework that I now use to be open sourced and easily accessible online.</p>
<h2>Background</h2>
<p>I used to use custom scripts for my bot, but this summer when <a href='http://lyrics.wikia.com'>LyricWiki</a> transitioned over to Wikia, they all broke.  My scripts pre-dated the <a href="http://www.mediawiki.org/wiki/API">MediaWiki API</a> so they had depended on screen-scraping which no longer worked when we switched to Wikia&#8217;s skins which had a completely different layout.</p>
<p>When I had to get my bots running again, I looked at a few Perl frameworks for connecting to the MediaWiki API, and the one that seemed to have significantly less bugs than the others was a perl module by <a href='http://en.wikipedia.org/wiki/User:CBM'>CBM</a>.</p>
<p>Over the months, I&#8217;ve realized that there was some functionality that wasn&#8217;t implemented yet but which I needed &#8211; deleting pages, issuing purges, finding all templates included on a page &#8211; so I updated to the module.  I tried to get access to the MediaWiki Tool Server where the project is currently hosted, but they must be really busy because they haven&#8217;t replied to the JIRA issue (request for an account) and it&#8217;s been months.</p>
<p>Since it has become quite a waiting game, I decided to just fork the project.  Hopefully CBM will want access to the repository and we can just keep working on it together.  Regardless, I&#8217;ve created all of the usual suspects for a project such as this (see next section).</p>
<h2>Project links</h2>
<p>So, without further delay, here are the beginnings of the <strong>Perl MediaWiki API</strong></p>
<ul>
<li><a href='http://svn.seancolombo.com/perlmediawikiapi'>Perl MediaWiki API SVN Repository</a></li>
<li><a href='http://perlmediawikiapi.wikia.com/wiki/Perl_MediaWiki_API_Wiki'>Perl MediaWiki API Wiki</a></li>
<li><a href='http://svn.seancolombo.com/trac/perlmediawikiapi/timeline'>Perl MediaWiki API Trac</a></li>
<li><a href='http://www.ohloh.net/p/perlmediawikiapi/contributors/2069628775835382'>Perl MediaWiki API Ohloh project page</a></li>
</ul>
<p>The links (especially the wiki) need a lot of work before it becomes obvious how to quickly get set up and use the module.  The next blog post will take care of that!</p>
<p>However, if you&#8217;re curious &#038; are already comfortable with Perl (and to some extent MediaWiki), you can jump right in.  Let me know if you have any feedback.  Thanks!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.seancolombo.com/2010/03/18/open-sourcing-a-mediawiki-bot-framework/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>What do you want to see?</title>
		<link>http://www.seancolombo.com/2010/03/04/what-do-you-want-to-see/</link>
		<comments>http://www.seancolombo.com/2010/03/04/what-do-you-want-to-see/#comments</comments>
		<pubDate>Thu, 04 Mar 2010 04:46:52 +0000</pubDate>
		<dc:creator>Sean</dc:creator>
				<category><![CDATA[Blogs]]></category>
		<category><![CDATA[blogging]]></category>
		<category><![CDATA[lyricwiki]]></category>
		<category><![CDATA[mediawiki]]></category>
		<category><![CDATA[musicIndustry]]></category>
		<category><![CDATA[stats]]></category>

		<guid isPermaLink="false">http://www.seancolombo.com/?p=224</guid>
		<description><![CDATA[Note: If you&#8217;re seeing this on facebook, it is just pulled in from my blog at http://seancolombo.com I&#8217;m in the mood to do some blogging in the next couple of days but have more ideas than time. What would YOU find most interesting? I&#8217;m thinking along the lines of either analyzing LyricWiki statistics or doing [...]]]></description>
			<content:encoded><![CDATA[<p><em>Note: If you&#8217;re seeing this on facebook, it is just pulled in from my blog at <a href="http://seancolombo.com">http://seancolombo.com</a></em></p>
<p>I&#8217;m in the mood to do some blogging in the next couple of days but have more ideas than time.  What would YOU find most interesting?  I&#8217;m thinking along the lines of either analyzing LyricWiki statistics or doing quick tutorials (&#8220;How to Write a MediaWiki Bot in 10 Minutes or Less&#8221;, or something similar).</p>
<p>Here were some ideas of stats I could do.   They each take a decent amount of time, so please let me know which ones you are most interested in:</p>
<ul>
<li>Views / #Songs by Genre</li>
<li>Views / #Pages by Language</li>
<li>Views / #Pages by Publisher</li>
<li>Infographic of Lables/Publishers in the Music Industry, how they relate to each other, and their prevalence in the market.</li>
<li>Our prevalence in a country vs. it&#8217;s prevalence online</li>
<li>Impact on page-views of being SOTD / AOTW / FMOM vs. not.</li>
<li>Views from songs that were on the iTunes Top 100 during the month vs. those that weren&#8217;t.  Views/page for that same group.</li>
<li>Views by page-age and views/page by page-age.</li>
<li>Views by page freshness (last touched) and views/page by page freshness.  Include histogram of freshness across all pages.</li>
</ul>
<p><strong>Let me know in the comments what you want to see!</strong>  Those stats, the tutorial mentioned, or <em>anything</em> else are all fair game.  Since I don&#8217;t have many readers yet, if you comment then I&#8217;ll probably do the post you&#8217;re asking for.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.seancolombo.com/2010/03/04/what-do-you-want-to-see/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Quick Tip: Do huge MySQL queries in batches when using PHP</title>
		<link>http://www.seancolombo.com/2009/07/05/quick-tip-do-huge-mysql-queries-in-batches-when-using-php/</link>
		<comments>http://www.seancolombo.com/2009/07/05/quick-tip-do-huge-mysql-queries-in-batches-when-using-php/#comments</comments>
		<pubDate>Sun, 05 Jul 2009 18:52:47 +0000</pubDate>
		<dc:creator>Sean</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[mediawiki]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[php]]></category>
		<category><![CDATA[quickTip]]></category>

		<guid isPermaLink="false">http://www.seancolombo.com/?p=135</guid>
		<description><![CDATA[When using PHP to make MySQL queries, it is significantly better for performance to break one extremely large query into smaller queries. In my testing, there was a query which returned 1 million rows and took 19,275 seconds (5 hours, 20 minutes) to traverse the results. By breaking that query up into a series of [...]]]></description>
			<content:encoded><![CDATA[<p>When using PHP to make MySQL queries, it is significantly better for performance to break one extremely large query into smaller queries.  In my testing, there was a query which returned 1 million rows and took 19,275 seconds (5 hours, 20 minutes) to traverse the results.  By breaking that query up into a series of queries that had about 10,000 results, the total time dropped to 152 seconds&#8230; yeah&#8230; less than 3 minutes.</p>
<p><strong>While MySQL provides LIMIT and/or OFFSET functionality to do batching, if you have numeric id&#8217;s on the table(s) you&#8217;re querying, I&#8217;d recommend using your own system for offsets (code example below) rather than the default MySQL functionality since the hand-rolled method is much faster.  See table in next section for performance comparisons.</strong></p>
<h2>Timing table</h2>
<p>I&#8217;ll provide some example code below to show you how I did the batching (based on potentially sparse, unique, numeric ids).  To start, here is a table of query-result-size vs. the total runtime for my particular loop.  All timings are for traversing approximately 1,000,000 rows.</p>
<table border='1px'>
<tr>
<th>Query Batch Size</th>
<th>Handrolled method</th>
<th>MySQL &#8220;LIMIT&#8221; syntax</th>
</tr>
<tr>
<td>1,000,000+</td>
<td>19,275 seconds</td>
<td>19,275 seconds</td>
</tr>
<tr>
<td>10,000</td>
<td>152 seconds</td>
<td>1,759 seconds</td>
</tr>
<tr>
<td>5,000</td>
<td>102 seconds</td>
<td>1,828 seconds</td>
</tr>
<tr>
<td>1,000</td>
<td>43 seconds</td>
<td>?</td>
</tr>
<tr>
<td>750</td>
<td><strong>40 seconds</strong></td>
<td>?</td>
</tr>
</table>
<p><small><em>At the end, it was pretty clear that no more data was needed to continue to demonstrate that the LIMIT method was slow.  Each one of those runs was taking about half an hour and about halfway through the 1,000 row test for the LIMIT method, it started causing the database to be backed up.  Since this was on a live production system, I decided to stop before it caused any lag for users.</em></small></p>
<h2>Example Code</h2>
<p>This code is an example of querying for all of the pages in a <a href="http://mediawiki.org">MediaWiki</a> database.  I used similar code to this to make a list of all of the pages (and redirects) in <a href="http://lyricwiki.org">LyricWiki</a>.  In the code, you&#8217;ll notice that the manual way I do the offsets based on id instead of using the MySQL &#8220;LIMIT&#8221; syntax doesn&#8217;t guarantee that each batch is the same size since ids might be sparse (ie: some may be missing if rows were deleted).  That&#8217;s completely fine in this case and there is a significant performance boost from using this method.  This test code just writes out a list of all of the &#8220;real&#8221; pages in a wiki (where &#8220;real&#8221; means that they are not redirects and they are in the main namespace as opposed to Talk pages, Help pages, Categories, etc.).</p>
<pre><code>
< ?php

$QUERY_BATCH_SIZE = 10000;
$titleFilenamePrefix = "wiki_pageTitles";

// Configure these database settings to use this example code.
$db_host = "localhost";
$db_name = "";
$db_user = "";
$db_pass = "";

$db = mysql_connect($db_host, $db_user, $db_pass);
mysql_select_db($db_name, $db);

$TITLE_FILE = fopen($titleFilenamePrefix."_".date("Ymd").".txt", "w");
$offset = 0;
$done = false;
$startTime = time();
while(!$done){
	$queryString = "SELECT page_title, page_is_redirect FROM wiki_page WHERE page_namespace=0 AND page_id > $offset AND page_id < ".($offset+$QUERY_BATCH_SIZE);
	if($result = mysql_query($queryString, $db)){
		if(($numRows = mysql_num_rows($result)) &#038;&#038; $numRows > 0){
			for($cnt=0; $cnt < $numRows; $cnt++){
				$title = mysql_result($result, $cnt, "page_title");
				$isRedirString = mysql_result($result, $cnt, "page_is_redirect");
				$isRedirect = ($isRedirString != "0");
				if(!$isRedirect){
					fwrite($TITLE_FILE, "$title\n");
				}
			}
			$offset += $QUERY_BATCH_SIZE;
			print "\tDone with $offset rows. \n";
		} else {
			$done = true;
		}
	}
	mysql_free_result($result);
}
$endTime = time();
print "Total time to cache results: ".($endTime - $startTime)." seconds.\n";
fclose($TITLE_FILE);

?>
</code></pre>
<p>Hope that helps!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.seancolombo.com/2009/07/05/quick-tip-do-huge-mysql-queries-in-batches-when-using-php/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
