Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better address #196

Open
brianvoe opened this issue Jan 16, 2022 · 12 comments
Open

Better address #196

brianvoe opened this issue Jan 16, 2022 · 12 comments

Comments

@brianvoe
Copy link
Owner

Right now the city may not be in the same state as well as the zip code and the full address isnt a good representation of actual usage

@dhartford
Copy link

most of the implementations I've seen with this capability requires a data file (csv, pipe, whatever) that has all 3 columns for US-based city,state,zipcode so it can be returned as a single tuple/struct equiv. Different Countries....different solutions.

Psuedo Example for U.S.:
mycity, ZZ, 98765
mysuburb, ZZ, 98765
mymetro, ZZ, 98789

notes:
yes, in the u.s., multiple cities for same zipcode

Usually the state is -either- the state spelled out, or state code, may need a helper function to convert back and forth pending on usecase.

Parameterized to support range-based randomization to focus on desired need, i.e. only from one or several state codes, or one or several zipcodes, to avoid over-sizing randomization and trimming later.

I'm not sure where other implementations get those data files however...so if someone knows, that might be the first step!

p.s. would not recommend 'real' street address for this library, but if 'real' city/state/zip would at least work with any geo-heatmap-analytic kind of work even if its randomized if ignoring street address.

@brianvoe
Copy link
Owner Author

I completely agree if anyone has that data and can make it easily usable. Im open to seeing that pr.

@rashmi-tondare
Copy link

Hi @brianvoe, I'd be interested in working on this issue. I was trying to find some data sources for addresses and came across this repo - https://github.com/dr5hn/countries-states-cities-database
Wanted to get your thoughts on this, do you think it would be a good idea to use the countries-states-cities.json file as a data source? We could use a random int to fetch the country first, then state, then city etc.

@brianvoe
Copy link
Owner Author

I do like the idea. Im going to give you some of the challenges I see and you(or anyone else) can let me know their thoughts.

  1. Data size - I just looked at the countries-states-cities.json file and its 36 mb. Even if we removed all the parts we didnt need and got the size down to 10 mb. I feel like that is still too much to add to everyones app especially for most people that may not even use address data. If we could get it to sub 1 mb then maybe thats something we can work with.
  2. Licensing - Im not too familiar with Open Data Commons licensing but we would have to figure that out as well.

@rashmi-tondare
Copy link

  1. Data size - We could use go-bindata which converts any file into Go code where the file data is converted to a byte slice along with helper functions to fetch & decode this data into in memory structs. I tried out a small PoC and it generated a 13.8mb file from the 36mb json file. We can clean up the data to only have fields that we need, that would considerably bring down the generated file size too. However, this data doesn't contain zipcodes, so if that is essential will need to look for another source.
  2. Licensing - Would that really be an issue since ODC allows for both commercial and private use?

@brianvoe
Copy link
Owner Author

  1. So one of the things that i was trying to cross reference with is a population for each city. If I knew population sizes then I could include only top 20 or so cities in each state and that would significantly lower the size of the data. My town only has 10,000 or so people in it and I dont think i would care if it was included in a random data selection. I want to try to stay away from doing something like go-bindata just for simplicity sake
  2. As far as license I dont know if i can maintain an MIT license and import something else that isnt MIT. But if we only use a subset of that data Im not sure licensing is still an issue. I never really know where that line is. If a package has a simple function that lets say adds two numbers together, can no one else have a function that does the same thing? I dont know.

@rashmi-tondare
Copy link

  1. We could clean it up to just include 20 random cities in each state. I think it should be fine even if that means we end up skipping some popular cities as long as we have the right city-state-country relation. We could mention this explicitly in the documentation.
  2. This is a very valid point and honestly I don't know what the right answer is. But as indicated in these 2 issues, the author seems to be fine with people using modified versions of the data as long as it's credited in the README:
    2.1 Project LICENSE dr5hn/countries-states-cities-database#179
    2.2 License clarification dr5hn/countries-states-cities-database#272

@brianvoe
Copy link
Owner Author

ok sounds good to me. Ill try to see if i can get this file size as small as possible. Ill see if i can find some sort of list that may indicate to population or popularity and cross reference that. Well figure it out.

@rashmi-tondare
Copy link

Hi @brianvoe, just wanted to touch base with you regarding the data clean up. Is there anything I can do to help with that? Were you able to find a data source for city-wise population?

@brianvoe
Copy link
Owner Author

Sorry I havent been able to look into this I am trying to finish a new implementation of another open source project I have called SlimSelect. Once I am done other there I can switch back and try to figure this out.

If anyone has time please try to look into this, if just finding population data that would be a huge help in getting this feature implemented.

@rashmi-tondare
Copy link

Found something here world-cities. From the description:

This datapackage only list cities above 15,000 inhabitants

The json file is ~2MB and has city, state, country info. We'll still have to cross-reference some other data for the pincodes, but let me know if this seems like a usable, reliable data source.

@brianvoe
Copy link
Owner Author

brianvoe commented Dec 4, 2022

I think this is a good try but it still doesnt allow me to lower the output limit based upon popuplation. I still think 15,000 would still be too large for this open source package.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants