Converting a string to slug with JavaScript
090503
Recently I've been working on implementing slugs in my CMS to be able to generate nicer URLs. In order to do so I've created a little JavaScript function that converts a string to a slug. I'll first give you the code and then explain it a bit. [update: 2.07.10 Fixed IE issue][update: 18.09.10 Minor improvements]
Note that you should not trust any JavaScript validation or processing. The submitted data should always be validated on the server. The reason to do JavaScript validation or processing is to provide an enhanced user experience, but is not a security measure.
function string_to_slug(str) {
str = str.replace(/^\s+|\s+$/g, ''); // trim
str = str.toLowerCase();
// remove accents, swap ñ for n, etc
var from = "àáäâèéëêìíïîòóöôùúüûñç·/_,:;";
var to = "aaaaeeeeiiiioooouuuunc------";
for (var i=0, l=from.length ; i<l ; i++) {
str = str.replace(new RegExp(from.charAt(i), 'g'), to.charAt(i));
}
str = str.replace(/[^a-z0-9 -]/g, '') // remove invalid chars
.replace(/\s+/g, '-') // collapse whitespace and replace by -
.replace(/-+/g, '-'); // collapse dashes
return str;
}
Here's a step by step description:
- The first thing we do is trim the string, that is, remove any whitespace at the beginning and end. The regular expression
/^\s+|\s+$/g
does exactly that:/
marks the start of the regular expression^\s+
means "one or more white-space caracteres at the beginning of the string"|
means "or"\s+$
means "one or more white-space caracteres at the end of the string"/g
ends the regular expression, and sets the global flag (otherwise only one substitution would be performed)
- Next, we convert the string to lower case
-
We are going to remove any invalid characters, but first we'll replace any 'special' letters for their 'plain' versions. For example in Spanish we have á, é and so on, and even though these are not valid characters in a slug, we don't want to simply remove them, so instead we replace them for a, e, etc. The JavaScript has nothing fancy here.
Note that I also choose to replace ·/_,:; for dashes (the first dot is the middle dot, used for example in Catalan), I think this will generate better slugs than if we simply remove this characters.
You might need/want to adjust this part of the function to suit your needs (your language might have other symbols that I haven't included here).
-
Now we're ready to remove any remaining invalid characters. The regular expression
/[^a-z0-9 -]/g
will match any character that is not a lowercase letter, a digit, a space or a dash. I won't explain this regexp in detail, this post is getting way too long! :) Do a search for "character classes", there's plenty of info around.Note that we include spaces as a valid character. Don't worry, we'll get rid of them in the next step. We can't just remove them from the string, because we want to replace them by dashes.
- Now it's time to replace any spaces with dashes. But we'll collapse any whitespace as well, so multiple spaces will be converted to a single dash. The expression
/\s+/g
should be easy if you understood the one about trimming the string. - Almost there! The expression
/-+/g
matches any series of consecutive dashes (which may occur as a result of the previous substitutions), so we replace that for a single dash. Job done!
There's room for improvement. For instance, we could replace the & sign for "and", but that brings problem with multilanguage sites. One could detect the language being used and replace by the appropriate word, but it seems a bit overkill to me... As it is, this should generate nice slugs in most cases.
Posted in: English, Web, JavaScript
Tags: regular expressions
Hi, Thanks for your post. Saved me a lot of time and work that I
would have if I had to convert my similar function from ruby to js.
http://snippets.dzone.com/posts/show/2384 Thanks again for sharing,
Tiago
Looks great except does not work in IE browsers.
You're right David, I'll have to look into that. The strange thing is, I remember it working in my CMS... I'll try to sort it out soon.
Awesome! Thank you. Saves me having to try get my head around more
regex... ;)
Glad you find it useful Shane. But make sure you test it in IE (see comments 2 and 3). I still haven't gotten around to checking that out.
@David Prek, I managed to fix that, there was a problem with Regular expressions that was triggering an Out of memory message. In fact I don't know why I was using a regular expression in the loop, it's not necessary at all. I must have done it for a reason, but obviously it was not a _good_ reason. :)
Hi! Thanks for sharing this snippet. Here goes two little fixed to
it. In the line: str = str.replace(from[i], to[i]);
replace with: str = str.replace(new RegExp(from[i], 'gi'),
to[i]); Since the original one will attempt to replace just
the first occurrence of the given char. Also, with 'gi' modifiers,
don't need to repeat UPPER and lower chars, so you can change:
var from =
"ÀÁÄÂÈÉËÊÌÍÏÎÒÓÖÔÙÚÜÛàáäâèéëêìíïîòóöôùúüûÑñÇç·/_,:;"; with:
var from = "ÀÁÄÃÂÈÉËÊ?ÌÍÏÎ?ÒÓÖÔÕÙÚÜÛ?ÑÇ·/_,:;"; (And his
pair, respectively. Enjoy!
@Thomas Lopes: actually, you can't do that, it will crash IE6/7 (see comment #2 by David Prek). In fact that was my first version of the script, but later got rid of the regular expression (see comment #6, btw thanks to Pat Allan for the fix). Another consideration is that Regexp replacement might be less efficient (although in this case that probably wouldn't be a problem, this is not likely to be a function that's executed repeatedly).
But your comment made think about the function, and I've made it a bit more compact (post has been updated): now it doesn't need to replace upper and lower case characters.
Hey, thanks to myself for writing this, now I can reuse it in my current Rails project. :)
This script does not replace repetitive non-latin letters, i.e.:
xxxààà wil become xxxa because of str.replace behaviour You should
use RegExp object to replace these strings globally: for (var i=0,
l=from.length ; i
@Paulius: you're absolutely right, gotta fix that. The problem is that using a RegExp crashes IE6 (see comment 6 - it was a good reason after all).
@Paulius: in fact it crashes both IE6 and IE7 (but not IE8). Here's the code I'm using:
slug = slug.replace(new RegExp(from[i], 'g'), to[i]);
Maybe there is a more efficient way to do this? Is this definitely an IE bug?
Btw, sorry for the lack of proper formatting in comments, I know it's annoying. It works well when I post comments though, another thing to look at...
@Paulius: found it! The problem wasn't the RegExp, but the square bracket notation for strings. Now it works, with:
str = str.replace(new RegExp(from.charAt(i), 'g'), to.charAt(i));
Maybe you were already suggesting that, since your comment got cut half-way through.
Polish language support added:
var from =
"àáäâèéëêìíïîòóöôùúüûñç·/_,:;??????ó??"; var to =
"aaaaeeeeiiiioooouuuunc------aceslnozz";
@DOgi: sorry, the polish characters got all messed up. Feel free to email me (blog AT dense13 DOT com), and I'll fix it.
Edit: Mmh, not easy, I can't just add them here, some of the characters you sent still get ignored (not sure if it's a WordPress or a browser issue). I'll try to sort that out.
Thanks a lot! Nice snippet
Thank you so much for this, I was going mad to achieve exactly this
result in sanitizing a string (but didn't succeed of course...).
You saved my day :) Muchissimas gracias!! Flavia
replace line 8 with: str =
str.split(from.charAt(i)).join(to.charAt(i)); and you've cut down
on most of the regexp. it turns out that it's faster too.
@Drew: thanks! I'll try that when I get a chance.
I've written a very extensive "slugify.js" that binds directly to the String object within Javascript. It's quite robust because it handles any character, in any language (see the comments in the link below):
https://gist.github.com/demoive/4249710
for archive, there is a module available for browser & server
(nodejs,...) https://github.com/pid/speakingurl
you can trim front/back dashes at the end: "my fine title!" ->
"my-fine-title-"
@Dimitar Raev: that's a good idea! I'll try to add it to the script, but no promises, doing very different things these days and not sure if I'll find the time. Thanks!
Great! but it has a problem.
It's not working for Persian or Arabic languages!
Any idea, please?