Difference between revisions of "2009-11-23 Report Developers Meeting 2009-2"

From openZIM
Jump to navigation Jump to search
(Created page with 'The second openZIM Developers Meeting took place November 20th to 22nd. == Participants == # Tommi Mäkitalo (tntnet) # Emmanuel Engelhart (Kiwix) # Tomasz Finc (Wikimedia Found…')
 
Line 13: Line 13:
== Topics ==
== Topics ==
=== better suitability for small devices ===
=== better suitability for small devices ===
Small devices are low on memory and don't have a powerful CPU. The OpenWrt team discovered a few problems when working with openZIM to display Wikipedia content:
;HTML parsing overhead
Using a full blown HTML parser uses up a lot of ressources. Available HTML engines are much more powerful than needed on these small devices. But as content is stored in HTML format using one of the available HTML engines is a logical way to go.
In fact on such small devices only very few markup is really needed: Headlines, bold, italic and anchors/links.
The idea of OpenWrt was to use a special markup for the content which is stable (HTML was considered being unstable as the standard changes once in a while) and much more reduced.
After a long discussion we came up with the solution to stick with HTML (to give all features of Wikipedia to users on full blown computers) but to use a special parser that ignores everything fancy in the markup and only renders the most neccessary things. That way we would still have some overhead in the ZIM file for small devices due to unused (ignored) HTML code, but it would be no difference in efficiency.
;Memory Footprint / Caches
As articles are clustered and stored in bigger compressed chunks, these clusters may not become to big, otherwise the memory available on small devices would be exhausted. The cluster size is currently by default 1 MB - this is the optimal size as compression algorithms themselve use blocks of 1 MB to compress data.
To reduce the memory footprint the streaming-mode of compression libraries offers a nice solution to only read these parts of a cluster that were needed. In streaming mode reading starts from the beginning of a compressed data stream and all data will be omitted until the pointer index in the uncompressed data stream is reached where the requested content starts.


=== more flexible MIME type list ===
=== more flexible MIME type list ===
In prior versions of ZIM the MIME type is specified by an integer, the list of available MIME types is hard-coded in the zimlib.
To be more flexible in future the hard-coded list will be replaced by a list of zero-terminated strings inside the ZIM data file. Therefore a mimeListPos is added to the ZIM header to specify the position of this MIME type list inside the ZIM file.


=== addressing articles; title vs. URL ===
=== addressing articles; title vs. URL ===
Line 52: Line 70:


The team decided to keep Manuel as project lead with the order to keep on giving talks, careing for marketing and maintaining contacts between openZIM and other projects.
The team decided to keep Manuel as project lead with the order to keep on giving talks, careing for marketing and maintaining contacts between openZIM and other projects.
[[Category:Press_Releases]]

Revision as of 17:52, 27 November 2009

The second openZIM Developers Meeting took place November 20th to 22nd.

Participants

  1. Tommi Mäkitalo (tntnet)
  2. Emmanuel Engelhart (Kiwix)
  3. Tomasz Finc (Wikimedia Foundation)
  4. Mirko Lindner (Qi Hardware)
  5. Mirko Voigt (OpenWrt)
  6. Pascal Martin (Linterweb)
  7. Guillaume Duhamel (Linterweb)
  8. Manuel Schneider (Wikimedia CH)

Topics

better suitability for small devices

Small devices are low on memory and don't have a powerful CPU. The OpenWrt team discovered a few problems when working with openZIM to display Wikipedia content:

HTML parsing overhead

Using a full blown HTML parser uses up a lot of ressources. Available HTML engines are much more powerful than needed on these small devices. But as content is stored in HTML format using one of the available HTML engines is a logical way to go.

In fact on such small devices only very few markup is really needed: Headlines, bold, italic and anchors/links.

The idea of OpenWrt was to use a special markup for the content which is stable (HTML was considered being unstable as the standard changes once in a while) and much more reduced.

After a long discussion we came up with the solution to stick with HTML (to give all features of Wikipedia to users on full blown computers) but to use a special parser that ignores everything fancy in the markup and only renders the most neccessary things. That way we would still have some overhead in the ZIM file for small devices due to unused (ignored) HTML code, but it would be no difference in efficiency.

Memory Footprint / Caches

As articles are clustered and stored in bigger compressed chunks, these clusters may not become to big, otherwise the memory available on small devices would be exhausted. The cluster size is currently by default 1 MB - this is the optimal size as compression algorithms themselve use blocks of 1 MB to compress data.

To reduce the memory footprint the streaming-mode of compression libraries offers a nice solution to only read these parts of a cluster that were needed. In streaming mode reading starts from the beginning of a compressed data stream and all data will be omitted until the pointer index in the uncompressed data stream is reached where the requested content starts.

more flexible MIME type list

In prior versions of ZIM the MIME type is specified by an integer, the list of available MIME types is hard-coded in the zimlib.

To be more flexible in future the hard-coded list will be replaced by a list of zero-terminated strings inside the ZIM data file. Therefore a mimeListPos is added to the ZIM header to specify the position of this MIME type list inside the ZIM file.

addressing articles; title vs. URL

global metadata

article metadata

fulltext search

integer encoding

lzma compression

future planning

Developers Meetings

As the personal meetings are vital for the project we think that at least two meetings during the year would be helpful. The next meeting should take place around April in 2010.

As always the location and organisation is open for everyones ideas, the planning page is already opened at Developers Meeting/2010-1.

For this meeting it was planned to rent an appartment which is big enough to accommodate all participants, offers internet access, a meeting room and a kitchen for cheap catering on-site. It was great that the number of participants increased that much, but we were not able to find a suitable appartment for that many persons. As an alternative we wanted to rent a conference venue with full-service. Even though we got good offers it was still too much to fit into our budget. So we sticked with the cheap self-made solution.

Marketing

As the target group of LinuxTag has big overlaps with ours, but is not the one we are aiming at, we are unsure if we should participate in LinuxTag 2010. We will decide that by the time as this is mainly dependant on volunteers offering support to run a booth.

Peering with other groups that have a similar mission is considered to be more fruitful. If possible we would like to see openZIM presented at SkoleLinux, Linux4Africa, OLPC and of course Wikimania. If someone has contacts to these groups or is willing to participate in conferences and give talks about openZIM, please go ahead and get in touch with us.

Budget

As described above we are planning to increase our budget to be more flexible on how to organise the Developers Meetings. As the team and the expectations to the projects grows we need to professionalise our organisation. Especially as we are all volunteers we have to reduce all work which does not directly support the development of ZIM and openZIM software.

The plan is to calculate two Developers Meetings plus two participations in other conferences according to our experiences and the offers we got this year. This calculations will be the basis for the budget we want to ask for from Wikimedia CH.

The budget will also include costs for the openzim.org domain and server hosting, but these are only a very low percentage (~ 250 EUR / a).

project lead

Manuel offered his position as project lead to whoever might be interested in doing it, especially as he feels committed to the project and gave some talks, but does not actively work on the development.

The team decided to keep Manuel as project lead with the order to keep on giving talks, careing for marketing and maintaining contacts between openZIM and other projects.