GitHub - WeebSearch/worker: ⚒ Web crawler that analyzes and dissects subtitles into database entries

An open source database of anime episode and character transcripts.

Why?

Anime is great, and while there's a lot of information out there on anime metadata on great sites like Anilist, there's no way to know what your favorite characters have said without going through all the episodes yourself. What exactly did Aoba say in S1 E1 of New Game!? How often did Louise speak in the first season of Familiar of Zero compared to the last? ¯\_(ツ)_/¯

These are interesting things to be able to answer. Why do I want to answer them? Stop asking so many questions.

How does (will) it work?

Crawlers fetch subtitles from websites
Subs that don't match one of the handful of known and consistent formats are filtered out
Some subtitles have information on speakers, those are parsed as well
Anime, episode and character information is looked up on MAL and Anilist
Data is given structure and saved on Postgres
Solr is updated with new information as they get added to Postgres
GraphQL is used as an API to interface with Elasticsearch
Requests are checked and cached on Redis for each query

Todo and Planned Features

Workers (Typescript)

Support multiple sub groups
Support multiple file types (rar, zip, 7z, tar.gz)
Support Japanese subtitles
Add more sub websites to crawl

Backend (Typescript)

~~Integrate Hifumi's API~~ or start the API from scratch with Prisma
User authentication, ~~JWT?~~ Sessions.
Internal Graphql to expose ORM features to the workers
Solr integration for indexing dialogues
Redis integration for caching user queries

Frontend (Angular) Frontend Repo

Start a website with Angular
Create a web-based transcript editor to fix parsing mistakes or add new information
- Available to users designated as data mods
- Supports:
  - Marking lines with the correct speakers [color coded]
  - Editing existing character information
  - Editing episode and character metadata
  - Deleting unnecessary dialogues and characters (which there are a lot of)
  - Merging animes, dialogues, characters and more

Getting Started

Manual

Copy .env.example to .env
Run npm install
Install Postgres
Install Redis
Run prisma deploy

Docker

Copy .env.example to .env
Download Docker
Run docker-compose up -d
Run prisma deploy

Tools

npm run subs starting the sub crawler
npm start start the API to serve data
npm run lint checks the code for tslint violations
npm test runs jest tests against the spec.ts files
- Remember to include tests for new changes

Contributing

Yes, I know the TSLint rules are very restrictive if you're not used to functional style. But you can do it, I believe in you, you don't need to use silly for loops when you have map, reduce and recursion.

I do expect the linter to pass for commits to get merged so you might want to keep an eye out for that.

Note:

This service is still a work in progress, meaning any documentation or service component may change or get added literally overnight

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
api		api
log		log
resources		resources
typescript-worker		typescript-worker
.editorconfig		.editorconfig
.env.example		.env.example
.gitignore		.gitignore
.graphqlconfig.yml		.graphqlconfig.yml
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
gulpfile.js		gulpfile.js
jest.config.js		jest.config.js
nodemon.json		nodemon.json
package.json		package.json
tsconfig.json		tsconfig.json
tslint.json		tslint.json

License

WeebSearch/worker

Folders and files

Latest commit

History

Repository files navigation

An open source database of anime episode and character transcripts.

Why?

How does (will) it work?

Todo and Planned Features

Workers (Typescript)

Backend (Typescript)

Frontend (Angular) Frontend Repo

Getting Started

Manual

Docker

Tools

Contributing

About

Topics

Resources

License

Stars

Watchers

Forks

Languages