Simon Burkhardt


The other day when browsing the webz I stumbled upon the website, which is a dictionary for special Bernese words. I was happy to find such a website since I just had a discussion with a coworker about how we both miss some of those words. Today's Swiss German dialects are heavily influenced by the Enlish language and the very specific Swiss words turn into "grandma-language". My personal favorite is "reise" - not meaning "to travel" but "to steer" or "to aim". ("Der Traktor über s'Fäud reise.")

While browsing this online dictionary I soon had the idea to downoad all of those words, and perhaps build a Telegram channel to send a new word each day. So I coded a Python script and downloaded the 7'899 words from the A-Z list. But then I noticed that each word had a unique ID in the URL, which was way larger than 7'900. Then I tested out if the unlisted IDs would result in a word definition as well.

To me this looks like there are over 30'000 definitoins, instead of the listed 7'900 words on the index.

Scraping the site

I don't have a clean script to just put on - it's a very hacky mess but it downloaded all the 30'381 words into a json archive.

Below is something like the Python code I used to do the job as well as a snippet from the final json file

									import math
import sys
import os
import urllib2
from bs4 import BeautifulSoup
import json

id_list = []
with open('urls.txt', 'r') as ff:
	for l in ff:
		id_list.append(l.split('/')[4].replace('\n', ''))

list_json = {}
list_json ['words'] = []

id_min = 10607.0
id_max = 40998.0
percentage_val = math.ceil( (id_max - id_min)/100 )
	for i in range(int(id_min), int(id_max + 1.0)):
			url = '' + str(i)
			request = urllib2.Request(url)
			response = urllib2.urlopen(request).read()
			parsed_html = BeautifulSoup(response, "lxml")

			word = parsed_html.body.find_all('dd')[0].text.strip()
			expl = parsed_html.body.find_all('dd')[1].text.strip()

			print( str( round(int(i-id_min)/percentage_val, 2) ) 
				+ " % \t word: " + str(i) + ' - \"' 
				+ word.encode('utf-8') + '\"')

			list_json ['words'].append({})
			list_json['words'][i - int(id_min)]['index']        = i
			list_json['words'][i - int(id_min)]['word']         = word
			list_json['words'][i - int(id_min)]['explanation']  = expl
		except Exception, ex:
			print url
			print ex
			with open('failed.all.json', 'w') as f:
				json.dump(list_json, f, indent=4)

	with open('words.all.json', 'w') as f:
		json.dump(list_json, f, indent=4)

except KeyboardInterrupt:
	with open('failed.all.json', 'w') as f:
		json.dump(list_json, f, indent=4)
    "words": [
            "index": 10608, 
            "explanation": "fragend, ob jemand 
            gegen eine Behauptung wetten w\u00fcrde", 
            "word": "hiufsch e Wett", 
            "listed": "true"
            "index": 10610, 
            "explanation": "lachen", 
            "word": "gugele", 
            "listed": "false"

Official Statement

Of course I went on to investigate the matter on the official channels. I sent an e-mail to asking about the IDs in the URL and why there are only 7'900 words listed. And wether it had anything to do with empty definitions, double entries and garbage poping up among the words.

The answer is simple: The unlisted IDs were simply "deleted" entries which sometimes contain formspam. These entries were not actually deleted but rather unlisted. They thanked me for the notice and said that they fixed this issue.

Grüessdi Simon
Mit solchen Computer-Freaks haben wir natürlich nicht gerechnet.

Die "fehlenden" Wörter, die aber trotzdem aufgerufen werden können, sind schlicht und ergreifend "gelöscht" worden. Aber da Löschen nicht wirklich löschen sondern nur ausblenden ist, konnten sie mit der ID trotzdem aufgerufen werden. Das wurde dank deines Hinweises nun behoben.

Man könnte also sagen, du hast im Papierkorb des Wörterbuchs gegraben :-) Das ist auch der Grund weshalb du völlig Unsinniges (Formularspam) und viele Duplikate gefunden hast.

Merci für den Hinweis und schönes Wochenende

What do we learn from this?

Always do your Vorratsdatenspeicherung before you talk about it!

And while we're at it...

Always do your backups!