Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tldjs.getDomain('acc1sub1-dot-moisestest.appspot.com') returns 'acc1sub1-dot-moisestest.appspot.com' #120

Open
mbelchin opened this issue Mar 16, 2018 · 3 comments
Labels

Comments

@mbelchin
Copy link

Hi,

I'm just testing this library and I'm surprised by this results:

tldjs.getDomain('acc1sub1-dot-moisestest.appspot.com') returns 'acc1sub1-dot-moisestest.appspot.com'

Shouldn't return appspot.com instead ?

tldjs.getSubdomain('acc1sub1-dot-moisestest.appspot.com') returns 'acc1sub1-dot-moisestest.appspot.com'

Shouldn't return acc1sub1-dot-moisestest instead ?

Thanks in advance.

@remusao
Copy link
Collaborator

remusao commented Mar 16, 2018

Hi @mbelchin,

Thanks for reaching out! I get slightly different results for this domain (with latest version of the rules):

> tldjs.parse('acc1sub1-dot-moisestest.appspot.com')
{ hostname: 'acc1sub1-dot-moisestest.appspot.com',
  isValid: true,
  isIp: false,
  tldExists: true,
  publicSuffix: 'appspot.com',
  domain: 'acc1sub1-dot-moisestest.appspot.com',
  subdomain: '' }
> tldjs.getSubdomain('acc1sub1-dot-moisestest.appspot.com')
''

The result is still surprising, but it can be explained by the fact that appspot.com is a valid public suffix according to the rule-set we use. It's a long standing issue that is not trivial to fix unfortunately. It was recently reported in #117 (and others before). The current state is this:

This is a long-standing and known issue which is not trivial to fix. It stems from the fact that the public suffix list was originally designed to check under which domains, sub-domains can be registered, and cookies can be set. In turn, it can lead to surprising/un-intuitive results such as the ones you encountered.

We've thought about this situation in the past, and I can see a few solutions, none of which is perfect. But maybe it would be "good enough":

  1. One hacky fix you can use right now without any update of TLD is to detect when domain is null, and instead use the value publicSuffix as the domain. This will work for a lot of domains (I will try to investigate more how many domain would return the wrong result with this solution, but I expect not so many).
  2. There are currently two parts in the public suffix list: ICANN and PRIVATE. I suspect that most of the surprising cases come from the PRIVATE part. We could add an option in tld.js to only take ICANN domains into account (this would fix the examples you found and many others).
  3. Combine both 1. and 2. (most of the counter-examples seem to be japan domain, but we need to investigate a bit more to see if there are some non-trivial cases).

None of the solution is perfect as there are known counter-examples. If this is an option for you, I would suggest you give a try to 1. and I will try to implement 2..
Also, as far as I know, this should be a limitation for all libraries using the public suffix lists unfortunately.

I did not have time to experiment with the proposed solution since then unfortunately. But if 1. works for you, that could be a reasonable work-around until then.

@mbelchin
Copy link
Author

Hi, thanks for that quick and detailed answer.
I'm gonna try the 1. approach.
Just one clarification, when you say

One hacky fix you can use right now without any update of TLD is to detect when domain is null, and instead use the value publicSuffix as the domain

you meant subdomain right?

becase

> tldjs.parse('acc1sub1-dot-moisestest.appspot.com')
{ hostname: 'acc1sub1-dot-moisestest.appspot.com',
  isValid: true,
  isIp: false,
  tldExists: true,
  publicSuffix: 'appspot.com',
  domain: 'acc1sub1-dot-moisestest.appspot.com',
  subdomain: '' }

parse will return empty for subdomain

@remusao
Copy link
Collaborator

remusao commented Mar 16, 2018

@mbelchin Right, I think the "trick" works for a slightly different use-case. If you would try to parse appspot.com, then publicSuffix would be appspot.com and domain would be empty. In this case you can use publicSuffix as the domain.

On the other hand, in your case, checking if subdomain is empty is not enough information to know what the actual domain is. Which means we will have to implement something to allow retrieving only ICANN domains and not PRIVATE ones, but even that will not work 100% of the time. I will try to find time for that in the next few days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants