https://archive.ics.uci.edu/ml/datasets/Reuters-21578+Text+Categorization+Collection Problem: Create a REST api that allows you to explore the data set above. You can use any one of the .sgm files in the data set, you may import the data into a data store if you want to. You are expected to use Java or Python and REST and any other technology of your choice Expected APIs
- API to list content
- API to search content
- API get a specific content by id/any identifier
Please share code git repo and be ready to demonstrate and discuss on a call.
install pyenv:
https://github.com/pyenv/pyenv-installer
install pipenv:
sudo -H pip install -U pipenv
set the version for repo
pyenv install 3.7.1
pyenv local 3.7.1
Install dependencies:
make install
Run server:
make start
Test functionality:
make verify
Test health:
curl http://localhost:8000/reuters/health
TBD:
- list articles according to time .....
- increase readability of the cooe
- be able to debug tests also in vscode (PYTHONPATH)
- add regex search for fulltext
- add precommit hook
- create better documentation for api (swagger)
import:
h2o.postman_collection
to your postman and playaround with the rest api
APIs:
localhost:8000/reuters/articles/<int:newid>
returns detail view of the article with body for display of the article to readerslocalhost:8000/retures/articles?
returns list of articles you can use querystring to filter out the articles e.g.http://localhost:8000/reuters/articles?metadata.topics=YES&places=usa
- metadata.newid
- metadtaa.oldid
- metadta.cgisplit
- metaddata.lewissplit
- metadata.topics
- places
- people
- orgs
- exchanges
- companies
- topics
http://localhost:8000/reuters/search?fulltext.body=businessmen
returns the fulltext search you can query these by fulltext: fulltext.title fulltext.dataline fulltext.body
- trying out node like tools for python
- sgml data do not have unique keys that is why I am using dot based selectors, imo fastest approach