Skip to content

Small script for analyzing Drupal content types on sites.

Notifications You must be signed in to change notification settings

kyletaylored/drupal-nodejs-sitemap

Repository files navigation

Drupal NodeJS Sitemap Analyzer

This is a simple node.js script that will process a sitemap to extract content types and other information for research purposes.

Drupal content audit

Scraper

All you need to do is install the node modules, then run the sitemap script.

npm install
npm run fetch-sitemap

You will be prompted to enter the URL for the sitemap.xml file. The script will then begin analyze each page to detect metadata, node type body classes, forms, and any pages returned with non-success HTTP statuses.

Alternatively, you can either pass in the sitemap URL as an argument, or wait for the prompt.

npm run fetch-sitemap -- https://www.example.com/sitemap.xml

Currently, the script does not detect the CMS to update the content type regex, but if you know the CMS, you can pass it in via an argument.

npm run fetch-sitemap -- https://www.example.com/sitemap.xml --cms="drupal | wordpress"

You can also pass in a custom regex match to detect your own content types regardless of CMS. This will always override any CMS detection.

npm run fetch-sitemap -- https://www.example.com/sitemap.xml --regex="/(?:.*node[-]+type-)/g"

Currently, the script does not detect the CMS to update the content type regex, but if you know the CMS, you can pass it in via an argument.

npm run fetch-sitemap https://www.example.com/sitemap.xml --cms="drupal | wordpress"

You can also pass in a custom regex match to detect your own content types regardless of CMS. This will always override any CMS detection.

npm run fetch-sitemap https://www.example.com/sitemap.xml --regex="/(?:.*node[-]+type-)/g"

The output will look similar to the following:

{
  "metadata": {
    "host": "www.example.com",
    "path": "/",
    "title": "Example website description.",
    "charset": "utf-8",
    "feeds": []
  },
  "nodeTypes": {
    "press-release": {
      "count": 889,
      "urls": ["Array"]
    },
    "gallery": {
      "count": 764,
      "urls": ["Array"]
    },
    "video": {
      "count": 977,
      "urls": ["Array"]
    },
    "publication": {
      "count": 1812,
      "urls": ["Array"]
    },
    "country": {
      "count": 85,
      "urls": ["Array"]
    },
    "article": {
      "count": 6318,
      "urls": ["Array"]
    },
    "page": {
      "count": 1362,
      "urls": ["Array"]
    }
  },
  "formTypes": {
    "mc-embedded-subscribe-form": {
      "count": 7,
      "urls": ["Array"]
    },
    "search-block-form": {
      "count": 4,
      "urls": ["Array"]
    },
    "webform-client-form-16622": {
      "count": 1,
      "urls": ["Array"]
    },
    "newsletter": {
      "count": 1,
      "urls": ["Array"]
    }
  },
  "statusCodes": {
    "403": {
      "count": 1235,
      "urls": ["Array"]
    },
    "404": {
      "count": 4,
      "urls": ["Array"]
    }
  },
  "langCodes": {
    "Vietnamese": {
      "count": 139,
      "urls": ["Array"]
    },
    "English": {
      "count": 205,
      "urls": ["Array"]
    },
    "Serbian": {
      "count": 1,
      "urls": ["Array"]
    },
    "Albanian": {
      "count": 233,
      "urls": ["Array"]
    },
    "Romanian": {
      "count": 1,
      "urls": ["Array"]
    },
    "Portuguese": {
      "count": 1,
      "urls": ["Array"]
    },
    "Mongolian": {
      "count": 4,
      "urls": ["Array"]
    },
    "Khmer": {
      "count": 14,
      "urls": ["Array"]
    }
  }
}

Dashboard

To view the dashboard (currently in dev) visit localhost:3000:

About

Small script for analyzing Drupal content types on sites.

Resources

Stars

Watchers

Forks

Packages

No packages published