Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: adbar/py3langid
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: v0.2.2
Choose a base ref
...
head repository: adbar/py3langid
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: v0.3.0
Choose a head ref
  • 5 commits
  • 8 files changed
  • 2 contributors

Commits on Sep 2, 2022

  1. update setup and workflow

    adbar committed Sep 2, 2022
    Copy the full SHA
    08ae413 View commit details

Commits on May 13, 2024

  1. maintenance: update CI and setup (#9)

    adbar authored May 13, 2024
    Copy the full SHA
    9508f09 View commit details

Commits on Jun 17, 2024

  1. update setup, switch to Python 3.8+ and pyproject.toml (#11)

    * update setup, switch to Python 3.8+ and pyproject.toml
    
    * specify numpy version
    
    * fix syntax
    adbar authored Jun 17, 2024
    Copy the full SHA
    1817a4c View commit details

Commits on Jun 18, 2024

  1. maintenance: better syntax and simplified code (#10)

    * better syntax and simplified code
    
    * simplify syntax
    adbar authored Jun 18, 2024

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
    Copy the full SHA
    1aa0c67 View commit details
  2. prepare version 0.3.0 (#12)

    adbar authored Jun 18, 2024
    Copy the full SHA
    0d673d7 View commit details
Showing with 110 additions and 110 deletions.
  1. +10 −19 .github/workflows/tests.yml
  2. +7 −0 HISTORY.rst
  3. +0 −16 MANIFEST.in
  4. +2 −1 README.rst
  5. +1 −1 py3langid/__init__.py
  6. +15 −12 py3langid/langid.py
  7. +75 −0 pyproject.toml
  8. +0 −61 setup.py
29 changes: 10 additions & 19 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
@@ -17,48 +17,39 @@ jobs:
fail-fast: false
matrix:
os: [ubuntu-latest]
python-version: [3.6, 3.7, 3.8, 3.9, "3.10"]
# https://github.com/actions/python-versions/blob/main/versions-manifest.json
python-version: [3.8, 3.9, "3.10", "3.11", "3.12", "3.13-dev"]
include:
# custom tests
- python-version: "3.11-dev"
os: ubuntu-latest
experimental: true
allowed_failure: true
- python-version: pypy3
os: ubuntu-latest
experimental: true
allowed_failure: true
# other OS version necessary
- os: macos-latest
python-version: 3.7
python-version: "3.10"
- os: windows-latest
python-version: 3.7
experimental: true
allowed_failure: true
python-version: "3.10"
steps:
# Python and pip setup
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v2
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}

- name: Upgrade pip
run: python -m pip install --upgrade pip setuptools wheel
run: python -m pip install --upgrade pip

- name: Get pip cache dir
id: pip-cache
run: |
echo "::set-output name=dir::$(pip cache dir)"
- name: pip cache
uses: actions/cache@v2
uses: actions/cache@v4
with:
path: ${{ steps.pip-cache.outputs.dir }}
key: ${{ runner.os }}-pip-${{ hashFiles('**/requirements.txt') }}
restore-keys: |
${{ runner.os }}-pip-
# package setup
- uses: actions/checkout@v2
- uses: actions/checkout@v4

- name: Install dependencies
run: python -m pip install -e "."
@@ -67,4 +58,4 @@ jobs:
- name: Test with pytest
run: |
python -m pip install pytest pytest-cov
pytest --cov=./
pytest --cov=./ --cov-report=xml
7 changes: 7 additions & 0 deletions HISTORY.rst
Original file line number Diff line number Diff line change
@@ -2,6 +2,13 @@
History
=======

0.3.0
-----

* Modernized setup, dropped support for Python 3.6 & 3.7
* Simplified inference code
* Support for Numpy 2.0


0.2.2
-----
16 changes: 0 additions & 16 deletions MANIFEST.in

This file was deleted.

3 changes: 2 additions & 1 deletion README.rst
Original file line number Diff line number Diff line change
@@ -20,6 +20,8 @@ Execution speed has been improved and the code base has been optimized for Pytho

For implementation details see this blog post: `How to make language detection with langid.py faster <https://adrien.barbaresi.eu/blog/language-detection-langid-py-faster.html>`_.

For more information and older Python versions see `changelog <https://github.com/adbar/py3langid/blob/master/HISTORY.rst>`_.


Usage
-----
@@ -97,7 +99,6 @@ On the command-line
('it', 0.97038305)
Legacy documentation
--------------------

2 changes: 1 addition & 1 deletion py3langid/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
from .langid import classify, rank, set_languages

__version__ = '0.2.2'
__version__ = '0.3.0'
27 changes: 15 additions & 12 deletions py3langid/langid.py
Original file line number Diff line number Diff line change
@@ -17,6 +17,7 @@

from base64 import b64decode
from collections import Counter
from operator import itemgetter
from pathlib import Path
from urllib.parse import parse_qs

@@ -33,6 +34,9 @@
# affect the relative ordering of the predicted classes. It can be
# re-enabled at runtime - see the readme.

# quantization: faster but less precise
DATATYPE = "uint16"


def load_model(path=None):
"""
@@ -60,7 +64,7 @@ def set_languages(langs=None):
return IDENTIFIER.set_languages(langs)


def classify(instance, datatype='uint16'):
def classify(instance, datatype=DATATYPE):
"""
Convenience method using a global identifier instance with the default
model included in langid.py. Identifies the language that a string is
@@ -198,9 +202,7 @@ def set_languages(self, langs=None):
nb_ptc, nb_pc, nb_classes = self.__full_model

if langs is None:
self.nb_classes = nb_classes
self.nb_ptc = nb_ptc
self.nb_pc = nb_pc
self.nb_classes, self.nb_ptc, self.nb_pc = nb_classes, nb_ptc, nb_pc

else:
# We were passed a restricted set of languages. Trim the arrays accordingly
@@ -209,12 +211,12 @@ def set_languages(self, langs=None):
if lang not in nb_classes:
raise ValueError(f"Unknown language code {lang}")

subset_mask = np.fromiter((l in langs for l in nb_classes), dtype=bool)
subset_mask = np.isin(nb_classes, langs)
self.nb_classes = [c for c in nb_classes if c in langs]
self.nb_ptc = nb_ptc[:, subset_mask]
self.nb_pc = nb_pc[subset_mask]

def instance2fv(self, text, datatype='uint16'):
def instance2fv(self, text, datatype=DATATYPE):
"""
Map an instance into the feature space of the trained model.
@@ -227,11 +229,12 @@ def instance2fv(self, text, datatype='uint16'):

# Convert the text to a sequence of ascii values and
# Count the number of times we enter each state
state = 0
indexes = []
for letter in list(text):
state, indexes = 0, []
extend = indexes.extend

for letter in text:
state = self.tk_nextmove[(state << 8) + letter]
indexes.extend(self.tk_output.get(state, []))
extend(self.tk_output.get(state, []))

# datatype: consider that less feature counts are going to be needed
arr = np.zeros(self.nb_numfeats, dtype=datatype)
@@ -247,7 +250,7 @@ def nb_classprobs(self, fv):
# compute the partial log-probability of the document in each class
return pdc + self.nb_pc

def classify(self, text, datatype='uint16'):
def classify(self, text, datatype=DATATYPE):
"""
Classify an instance.
"""
@@ -262,7 +265,7 @@ def rank(self, text):
"""
fv = self.instance2fv(text)
probs = self.norm_probs(self.nb_classprobs(fv))
return [(str(k), float(v)) for (v, k) in sorted(zip(probs, self.nb_classes), reverse=True)]
return sorted(zip(self.nb_classes, probs), key=itemgetter(1), reverse=True)

def cl_path(self, path):
"""
75 changes: 75 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# https://pip.pypa.io/en/stable/reference/build-system/pyproject-toml/
[build-system]
requires = ["setuptools>=61.0"]
build-backend = "setuptools.build_meta"

[project]
name = "py3langid"
description = "Fork of the language identification tool langid.py, featuring a modernized codebase and faster execution times."
readme = "README.rst"
license = { text = "BSD" }
dynamic = ["version"]
requires-python = ">=3.8"
authors = [
{name = "Marco Lui"},
{name = "Adrien Barbaresi", email = "barbaresi@bbaw.de"}
]
keywords=[
"language detection",
"language identification",
"langid",
"langid.py"
]
classifiers = [
# As from http://pypi.python.org/pypi?%3Aaction=list_classifiers
'Development Status :: 5 - Production/Stable',
#'Development Status :: 6 - Mature',
"Environment :: Console",
"Intended Audience :: Developers",
"Intended Audience :: Information Technology",
"Intended Audience :: Science/Research",
"License :: OSI Approved :: BSD License",
"Operating System :: MacOS :: MacOS X",
"Operating System :: Microsoft :: Windows",
"Operating System :: POSIX :: Linux",
"Programming Language :: Python",
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.8",
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
"Programming Language :: Python :: 3.12",
"Programming Language :: Python :: 3.13",
"Topic :: Scientific/Engineering :: Artificial Intelligence",
"Topic :: Text Processing :: Linguistic",
]
dependencies = [
"numpy >= 2.0.0 ; python_version >= '3.9'",
"numpy >= 1.24.3 ; python_version == '3.8'",
]

# https://setuptools.pypa.io/en/latest/userguide/pyproject_config.html
[tool.setuptools]
packages = ["py3langid"]

# https://packaging.python.org/en/latest/guides/single-sourcing-package-version/
[tool.setuptools.dynamic]
version = {attr = "py3langid.__version__"}

[tool.setuptools.package-data]
py3langid = ["data/model.plzma"]

[project.scripts]
langid = "py3langid.langid:main"

[project.urls]
"Homepage" = "https://github.com/adbar/py3langid"
"Blog" = "https://adrien.barbaresi.eu/blog/language-detection-langid-py-faster.html"
"Tracker" = "https://github.com/adbar/py3langid/issues"

# Development extras
[project.optional-dependencies]
dev = [
"pytest",
"pytest-cov",
]
61 changes: 0 additions & 61 deletions setup.py

This file was deleted.