Wiki2html

Dumping is REALLY time consuming! Depending on the wikipedia you want to prepare this can take DAYS to WEEKS!

All benchmark results I present here were done on an Intel Core2Quad 6600 overclocked to 3GHz. =Synopsis=

You will import the wikipedia database snapshot into a local correctly configured and patched mediawiki, then you dump everything onto your harddrive as a postgresql data dump containing optimized and stripped down html.

=install prerequisites=

sudo apt-get install apache2 php5 php5-mysql mysql-server php5-xsl php5-tidy php5-cli subversion gij bzip2

or

yum install httpd php php-mysql mysql-server mysql-client php-xml php-tidy php-cli subversion java-1.5.0-gcj bzip2

apache2 is optional and only needed if you want to install via web interface or want to check check wether your data import looks correct.

=get to run a local mediawiki=

checkout latest mediawiki to whatever folder your webserver of choice publishes and install all you need to set mediawiki up on localhost.

svn checkout http://svn.wikimedia.org/svnroot/mediawiki/trunk/phase3 /var/www

delete the extensions dir and import the official extensions:

rm -rf /var/www/extensions svn checkout http://svn.wikimedia.org/svnroot/mediawiki/trunk/extensions /var/www/extensions

optional: configure /etc/apache2/sites-enabled/000-default so that the mediawiki websetup loads when you access localhost. goto http://localhost and finish the mediawiki install via the webinterface installer. To copypasta the rest of this walkthrough use the root account for mysql. You only have to fill in the values marked red. When everything works proceed to the next step.

a more easy setup is done via manually setting up mediawiki:

echo "CREATE DATABASE wikidb DEFAULT CHARACTER SET binary;" | mysql -u root

then import the table structure

mysql -u root wikidb < wikidb.sql

and put LocalSettings.${LANG}.php in place

GRANT SELECT, INSERT , UPDATE , DELETE , CREATE TEMPORARY TABLES ON `${LANG}wiki`. * TO 'wikiuser'@'%';

=configure/modify mediawiki=

append to your LocalSettings.php

$wgLanguageCode = "${LANG}"; ini_set( 'memory_limit', 80 * 1024 * 1024 ); require_once( $IP.'/extensions/ParserFunctions/ParserFunctions.php' ); require_once( $IP.'/extensions/Poem/Poem.php' ); require_once( $IP.'/extensions/wikihiero/wikihiero.php' ); require_once( $IP.'/extensions/Cite/Cite.php' ); $wgUseTidy = true; $wgExtraNamespaces[100] = "Portal"; #also to be changed according to your language $wgSitename = "Wikipedia";

Edit AdminSettings.php and set mysql user and password so that you can run the maintenance scripts:

cp AdminSettings.sample AdminSettings.php vim AdminSettings.php

Patch the DumpHTML extension to produce correct output with MediawikiPatch:

patch -p0 < mediawikipatch.diff

You may also enable embedded LaTeX formulas as base64 png images. Just follow these instructions: EnablingLatex

=import wikipedia to your mediawiki install=

get the template for huge databases

gunzip -c /usr/share/doc/mysql-server-5.0/examples/my-huge.cnf.gz > /etc/mysql/my.cnf

additionally set the following in /etc/mysql/my.cnf

[...] [mysqld] [...] max_allowed_packet=16M [...]
 * 1) log-bin=mysql-bin

and restart mysql-server

check out the available dump for your language at http://download.wikimedia.org/${WLANG}wiki/ $WLANG being de, en, fr and so on. set the appropriate language and the desired timestamp as variables.

export WLANG= export WTIME=

clean existing tables:

echo "DELETE FROM page;DELETE FROM revision;DELETE FROM text;" | mysql -u root wikidb

add interwiki links

wget -O - http://download.wikimedia.org/${WLANG}wiki/${WDATE}/${WLANG}wiki-${WDATE}-interwiki.sql.gz | gzip -d | sed -ne '/^INSERT INTO/p' > ${WLANG}wiki-${WDATE}-interwiki.sql mysql -u root wikidb < ${WLANG}wiki-${WDATE}-interwiki.sql

download and import database dump

wget http://download.wikimedia.org/${WLANG}wiki/${WDATE}/${WLANG}wiki-${WDATE}-pages-articles.xml.bz2 bunzip ${WLANG}wiki-${WDATE}-pages-articles.xml.bz2 wget http://download.wikimedia.org/tools/mwdumper.jar java -Xmx600M -server -jar mwdumper.jar --format=sql:1.5 ${WLANG}wiki-${WDATE}-pages-articles.xml | mysql -u root wikidb

add category links

wget -O - http://download.wikimedia.org/${WLANG}wiki/${WDATE}/${WLANG}wiki-${WDATE}-categorylinks.sql.gz | gzip -d | sed -ne '/^INSERT INTO/p' > ${WLANG}wiki-${WDATE}-categorylinks.sql mysql -u root wikidb < ${WLANG}wiki-${WDATE}-categorylinks.sql

if you installed and configured apache you can now access http://localhost and check if everything is set up as desired. =dump it all=

get the id count to do estimations on how to best split the work over your cores

echo "SELECT MAX(page_id) FROM page" | mysql -u root wikidb -sN

with a multicore setup you can dump with multiple threads using the start and endid. be aware that the first articles take longer than the later ones because they are generally bigger.

The following splits were found to be useful:

how long it takes very much depends on your hardware. for example my core2quad Q6600@3GHz is overall four-times faster in dumping mokopedia than my old athlon 64 x2 3600+ when using all four as opposed to two cores.

Dumping is also independent of the speed of the harddisk - even when dumping with a quadcore - the bottleneck is still the processor. So there is also zero speed loss when processes are run in parallel on every core.

php extensions/DumpHTML/dumpHTML.php -d /folder/to/dump -s -e --interlang

create categories

php extensions/DumpHTML/dumpHTML.php -d /folder/to/dump --categories --interlang

=Appendix=

for debian you might want to remove the database check on every boot - this can take ages with german or english wikipedia. just comment out check_for_crashed_tables; in /etc/mysql/debian-start