wiktionary-de-parser

A Python module to extract data from German Wiktionary XML files (for Python 3.11+).

Features

Extracts IPA transcriptions, hyphenation, language, part of speech information (basic), genus and flexion tables of a word.
Yields per entry, not per page (a page can have multiple entries/ words can have different meanings)

Installation

pip install wiktionary-de-parser

Or with Poetry:

poetry add wiktionary-de-parser

Usage

Loading the XML dump file

from wiktionary_de_parser import WiktionaryParser
from wiktionary_de_parser.dump_processor import WiktionaryDump

# To download the dump file, specify the directory where the
# dump file should be stored.
dump = WiktionaryDump(dump_dir_path="directory-of-dump-file")

# This will download "dewiktionary-latest-pages-articles-multistream.xml.bz2" to
# the directory specified in `dump_dir_path`.
dump.download_dump()

# Alternatively you can specify a different dump file to download.
dump = WiktionaryDump(
    dump_dir_path="directory-of-dump-file",
    dump_download_url="url-to-dump-file.xml.bz2",
)
dump.download_dump()

# If you already have the dump file locally, specify the path to the file.
dump = WiktionaryDump(dump_file_path="path-to-dump-file.xml.bz2")
dump.download_dump()

Parsing the dump file

from pprint import pprint
from wiktionary_de_parser import WiktionaryParser

# ... (see above)

parser = WiktionaryParser()

for page in dump.pages():
    # Skip redirects
    if page.redirect_to:
        continue

    if page.name == "Abend":
        # Parse all entries for "Abend"
        for entry in parser.entries_from_page(page):
            results = parser.parse_entry(entry)
            pprint(results)
        break

Output

All page entries for "Abend":

ParsedWiktionaryPageEntry(
    name="Abend",
    hyphenation=["Abend"],
    flexion={
        "Genus": "m",
        "Nominativ Singular": "Abend",
        "Nominativ Plural": "Abende",
        "Genitiv Singular": "Abends",
        "Genitiv Plural": "Abende",
        "Dativ Singular": "Abend",
        "Dativ Plural": "Abenden",
        "Akkusativ Singular": "Abend",
        "Akkusativ Plural": "Abende",
    },
    ipa=["ˈaːbn̩t", "ˈaːbm̩t"],
    language=Language(lang="Deutsch", lang_code="de"),
    lemma=Lemma(lemma="Abend", inflected=False),
    pos={"Substantiv": []},
    rhymes=["aːbn̩t"],
)
ParsedWiktionaryPageEntry(
    name="Abend",
    hyphenation=["Abend"],
    flexion=None,
    ipa=["ˈaːbn̩t"],
    language=Language(lang="Deutsch", lang_code="de"),
    lemma=Lemma(lemma="Abend", inflected=False),
    pos={"Substantiv": ["Nachname"]},
    rhymes=["aːbn̩t"],
)
ParsedWiktionaryPageEntry(
    name="Abend",
    hyphenation=["Abend"],
    flexion=None,
    ipa=["ˈaːbn̩t", "ˈaːbm̩t"],
    language=Language(lang="Deutsch", lang_code="de"),
    lemma=Lemma(lemma="Abend", inflected=False),
    pos={"Substantiv": ["Toponym"]},
    rhymes=["aːbn̩t"],
)

Development

This project uses Poetry.

Install Poetry.
Clone this repository
Run poetry install inside of the project folder to install dependencies.
There is a notebook.ipynb to test the parser.
Run poetry run pytest to run tests.

Name		Name	Last commit message	Last commit date
Latest commit History 153 Commits
.vscode		.vscode
test		test
wiktionary_de_parser		wiktionary_de_parser
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE.txt		LICENSE.txt
README.md		README.md
notebook.ipynb		notebook.ipynb
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.vscode

.vscode

test

test

wiktionary_de_parser

wiktionary_de_parser

.gitignore

.gitignore

CHANGELOG.md

CHANGELOG.md

LICENSE.txt

LICENSE.txt

README.md

README.md

notebook.ipynb

notebook.ipynb

poetry.lock

poetry.lock

pyproject.toml

pyproject.toml

Repository files navigation

wiktionary-de-parser

Features

Installation

Usage

Loading the XML dump file

Parsing the dump file

Output

Development

License

About

Contributors 2

Languages

License

gambolputty/wiktionary-de-parser

Folders and files

Latest commit

History

Repository files navigation

wiktionary-de-parser

Features

Installation

Usage

Loading the XML dump file

Parsing the dump file

Output

Development

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages