July 8th, 2024 × #SEO#Sitemaps#Web Development
Perfect Sitemaps for SEO
Wes and Scott discuss why you need a sitemap, what should be in it, and how to generate and submit it properly for SEO.
- Wes is back from paternity leave
- Scott almost missed the boat to a conference
- Scott built a new website and realized he needs to optimize for SEO
- Scott wonders why we need sitemaps and what should be in them nowadays
- Sitemaps help search engines index and monitor pages better
- Sitemaps don't help ranking but help crawlers find relevant content
- Sitemap formats: XML, RSS, plain text file
- Plain text sitemap is just URLs, one per line
- Sitemap can be named anything, not just sitemap.xml
- But sitemap.xml is a standard worth following
- XML sitemap is most flexible and allows more metadata
- Last mod date is the only sitemap field search engines use now
- Bing uses sitemap priority but Google ignores it
- We should update the Syntax sitemap fields
- Getting all Syntax content indexed recently got much easier
- Parameters and future/unpublished pages should not be in the sitemap
- Only published, non-redirect pages should be included
- Ways to generate a sitemap: meta framework built-ins or custom route
- Hand-writing a large sitemap takes too much effort
- Store sitemap pages as data objects first before outputting as XML
- Validate sitemap with online tools before submitting it
- Submit sitemap to Bing and Google webmaster tools
- Cache sitemaps to avoid heavy DB and bandwidth loads
Transcript
Wes Bos
shout out to CJ for filling in there. But it feels good to be back in the horse
Wes is back from paternity leave
Wes Bos
I love that because you don't have to fight with the time zones either. You just, like, you just look at your phone. Where do I need to be right now? Although Wes I was there, I did I literally missed the boat.
Scott wonders why we need sitemaps and what should be in them nowadays
Wes Bos
Yeah. That's good. And, you've built the initial site map for the syntax website as Wes, and that was really nice because I've been going through not anymore, but probably over the last 6 months, I've been watching the Google Webmaster Tools trying to get our content indexed. Since we made, like, a pretty major shift in terms of, like, additional pages, there was a lot more content on the website from the old 1. It was kinda interesting to see, like, how do you tell Google, hey.
Wes Bos
There's now, what, an extra 1500 pages on this website.
Sitemaps don't help ranking but help crawlers find relevant content
Wes Bos
and understand, like, what the general structure of your application is without it having to guess. Right? Yeah. Yeah. You can't use it to trick Google into pages that are not linked from anywhere. Like, Google still has to be able to find that this is a page you're telling me about, but where have you linked from it? Right? Is it is it being linked from another website? Is it being internally linked from inside of the page? Like, for for us, it was the transcript page, which is Mhmm. It was a brand new page, and I wanted all of that to be indexed because it's a lot of lot of good information. And that's, like, very good for SEO if you're searching for a specific topic. In fact, I find that when I Google for a specific syntax episode, I'll often the transcript page will actually come up before the actual show notes page because the transcript page has literally every word we've we've spoken inside of it. Yeah. But, initially, I had a hard time getting those, like, indexed from Google. And, it was a mix of, like, how often is it updated, should I be crawling this page, and all the stuff we'll talk about today.
Plain text sitemap is just URLs, one per line
Wes Bos
And it does does sitemap have to be I might be getting ahead of us ourselves right now, but does a sitemap have to be named sitemap.xml? Or That's just a good question. Like, a meta tag that you can you can put?
Sitemap can be named anything, not just sitemap.xml
Wes Bos
That's really handy if you have for whatever reason, you don't have control over top level routes, because your application doesn't allow you to do that, it would be nice to to be able to do that. I probably would still try my darnest to make it sitemap.xml because it like robots Scott txt, it's a standard
XML sitemap is most flexible and allows more metadata
Wes Bos
Yeah. If you if you go and peruse I I do this a little bit myself to find unlisted URLs on websites.
Wes Bos
Like, my my wife was really excited about this dress coming out once, so I wrote a little scraper that would download the site map every so often, and it would the site map often lists even all the images that were uploaded to the website, all of the pages that are on the website. And, often, those pages are public, but they're not linked anywhere just yet. So it's kinda security bay by obscurity. So you can download the site map and, and see all of the pages of the website, and you can sort of peruse through that looking for unlisted pages.
Wes Bos
But often especially with, like, Shopify websites, you'll find Scott of like an index site map, and then it links off to tags site map and product site map and blog post site map. Each 1 has their own site map.
Last mod date is the only sitemap field search engines use now
Wes Bos
Priority, change frequency, and last modified.
Wes Bos
I would say, like, priority doesn't matter because the days of telling Google what's important are are are over. They can figure that out themselves.
Wes Bos
I'm gonna say frequency is is important because if you have a page that is frequently updated, that needs to be reindexed every hour or something that's like a blog post and you'll never update that again The answer is is that change frequency
We should update the Syntax sitemap fields
Wes Bos
are on the syntax 1. And you're telling us we you only need last mod?
Getting all Syntax content indexed recently got much easier
Wes Bos
the syntax website, and it's crazy looking at the webmaster tools both, like, over the last 6 months, a year, getting all of the pages indexed and finally to a point where Google knows about every single page. Because, like, even when we migrated, there was a point where, like, you couldn't find specific episodes on Google. Like, it was not finding at all, so we had to really work at that. But the Google changed their algorithm recently, and I I posted a tweet about this. We've mentioned it on the last episode as well.
Wes Bos
The amount that we're showing up on search results JS just we went right up with that algorithm change. So you Sanity.
Wes Bos
We're not even doing the best practice here, and Google's, like, obviously showing our stuff a lot more frequently,
Wes Bos
URLs on the syntax website is we have forward slash shows, and that needs to be indexed.
Parameters and future/unpublished pages should not be in the sitemap
Wes Bos
And then we also have forward slash shows and type equals hasty, tasty, or supper. That needs to be indexed.
Wes Bos
But the pages of every single 1 of them, like page 1, page 2, etcetera, those don't need to be indexed because well, no. The pages do need to be indexed, but the some of the search filters do not need to be indexed. And I remember I had to write a very complex thing to sort of figure out what the canonical URL was because there's unlimited combinations of the query params of, like, pages, how many per page. That was the other 1. And a couple other filters that, like, there's there's unlimited. And if you go into the Google Webmaster Tools, it says something like 6, 000 pages are not being indexed Yeah. Because you told us not to.
Wes Bos
And I was like, good. Like, those yeah. We don't want you to index the page 4 of 15 per page.
Wes Bos
things that you have, like, being blocked in your robots Scott txt. I got 1 more here, and this JS, a problem we had is these shows.
Only published, non-redirect pages should be included
Wes Bos
The basically, the way that we create our site map is we just query the database for all the shows, and we query the database for all the guests. And, basically, anything that's a page, we just query it and and use a function to generate the URL for it. Right? But in that case, we were we forgot to filter out for future shows, and it was telling Google, hey. There's a page here.
Wes Bos
And then Google would go to that URL, and it would find this page is coming soon.
Wes Bos
And that was a bit of an issue because when it was published, then Google would would not know about the content until it eventually crawled that page again. So we had to filter that out and say,
Hand-writing a large sitemap takes too much effort
Wes Bos
Yeah. Like, they they have this concept of pages. If it's, like, totally from scratch, like your personal website or, the Syntax website where, like, there's no concept of a page, right, You can just like Scott said, you can concatenate a string and throw it out the door. I would probably keep a array of pages and just store them as, like, objects and then grab some sort of, like, JSON to XML plug in off NPM and then convert it out the out the door.
Store sitemap pages as data objects first before outputting as XML
Wes Bos
Sitemaps are pretty simple, so I don't know if that's if that's overkill versus just concatenating a string or not. But when it comes to, like, oh, did I already add this 1? You Node, does this URL exist previously? Well, let me search for it in the array.
Wes Bos
If that's the case, then it's it's sometimes nicer to to deal with, like, a actual data object first. And then right before you kick it out the door, convert it to XML because that sucks working with XML.
Submit sitemap to Bing and Google webmaster tools
Wes Bos
I was just looking at our search console, and it says discovered videos.
Wes Bos
That's probably worth doing. I always I often wonder that. You Node, like, you go to the video tab of Google search, how to get your video to show up on that tab. I think I thought it was a mix of, like, the proper XML or or that what's that? LD JSON? Yeah. JSON LD, which is for linking data. That's used sort of like a meta tag. But instead of putting it in the head of the document, you simply just dump the JSON into the body, and Google will pick it up there. But it looks like you can also there's also specific video tags for sitemap.xml, which will tell Google about videos, which is neat.
Wes Bos
Cool. Yeah.
Wes Bos
1 more tip I have here is cache them. Your sitemap can be 1 of the largest files that is accessible to your website. And if they are generated on demand, that can be very taxing on your database. Yes. If it's it's literally querying every single record in your database in a lot of cases and looping over it or at least pages. And and then that file itself is is fairly large because it's it's all text. Right? And it's possibly an attack vector against your bill, both your database bill as well as your if you're if you're using something like, a render or a Vercel to to generate the site map .XML and you don't have the proper caching headers on those, then that could be somewhere where somebody could just continually hit it, and it will it will cause a very large bandwidth bill on your end. So throwing caching headers, putting a CDN in front of it, probably a good idea.
Cache sitemaps to avoid heavy DB and bandwidth loads
Wes Bos
Grab a t shirt. Century.shop.
Wes Bos
Peace.