The syntax of an “email address” is defined by the ‘Mailbox’ grammar rule from Section 4.1.2 of RFC-5321 and extended to include Unicode characters (aka “internationalized”) in Section 3.3 of RFC-6531.
An email “address” is a character string that identifies a user to whom mail will be sent or a location into which mail will be deposited.
The general format is Local-part@Domain; where the Local-part can be a Dot-string (e.g. joe.q.public) or a Quoted-string (e.g. "Joe Q. Public") and Domain can be a DNS domain name qualified up to the top-level, but without the dot (‘.’) for the root node, (e.g. example.com) or an IP address literal in brackets (e.g. [192.168.3.4]).
A Dot-string is one or more strings joined by dot (Full Stop ‘.’) characters. The characters that can be used in the string part of the Dot-string (called an ‘atom’) include letters, digits, and any of ‘!#$%&'*+-/=?^_`{|}~’. These are the printable characters excluding those (‘()<>[]:;@\,."’) with special meaning within either the SMTP Protocol or the Internet Message Format.
Within a quoted string, any graphic characters or space is permitted, except that the double-quote and the backslash characters, must be escaped with a backslash. The backslash can be used to escape other characters, but that does not change their meaning; a ‘\x’ is the same as a plain ‘x’.
The rules for the Domain part are much the same as for other uses of domains on the Internet.
The interpretation of the Local-part is up to the domain owner: they can set any rules or policies within the defined syntax.
Some domain owners define “Foo.Bar@example.com” as the same mailbox as “foo.bar@example.com”; and some define “foo.bar+baz@example.com” as the same mailbox as “foobar@example.com”.
Message Transfer Agents (MTAs) use a protocol (aka “language”) called Simple Mail Transfer Protocol. This language is defined in RFC-5321, and extended to include Unicode characters (aka “internationalized”) in RFC-6531.
The syntax used by MTAs for email addresses on the Internet is the ‘Mailbox’ grammar rule from Section 4.1.2 of RFC-5321. I believe this is what most people think of as an “email address.”
Sometimes people refer to the addresses provided in the transport protocol that are not part of the message data content as “envelope” addresses.
No. The data content of an email message is specified by RFC-5322 which defines the syntax for header fields such as ‘From:’ and ‘To:’ which include email addresses (called ‘addr-spec’s in RFC-5322), as well as other things like ‘groups’ and ‘display-name’s — these other things are not parts of an email address, and are not used for mail delivery.
Since internet messages allow for comments (enclosed in ‘()’) and white space (including “folding” white space which is a CRLF pair followed by a space or a tab) inserted between most tokens to allow a more free-form layout that might be more pleasing to the human reader, the grammar rules in RFC-5322 include these elements. None of this stuff is supposed to have any semantic value.
Additionally, RFC-5322 defined a very lax syntax for domains and address-literals.
Every valid RFC-5321 ‘Mailbox’ is also a valid RFC-5322 ‘addr-spec’.
Software extracing an address from the conents of a message typically ignores display-names, comments, and white space in message header bodies. In this way converting an RFC-5322 ‘addr-spec’ into a valid RFC-5321 ‘Mailbox’.
RFC-5322 also requires acceptance of the various obsolete forms of the local-part (left of the ‘@’ sign) that should no longer be used in SMTP, but still “MUST be accepted and parsed by a conformant receiver.” Data in these forms must be further processed.
The standards for email address syntax that should be used for most (if not all) validation cases are RFC-5321 (+RFC-6531 to add Unicode) — NOT RFC-5322 (+RFC-6532).
Email addresses not conforming to RFC-5321 syntax are almost universally rejected by MTAs and thus can't be used to send or receive mail on the public Internet today.
The grammar in RFC-5321 is a regular language (type-3 in the Chomsky hierarchy) so, in principal, can be parsed by a regular expression (i.e. finite-state machine). The grammar in RFC-5322 is recursive (type-0) causing implementers much grief if they don't appreciate that fact.
Validation should be done exactly or not (very much) at all. It is especially egregious to reject valid addresses based on “common practice” or the use of “simplified regexes” that developers find easier to implement.
Validation should not reject addresses based on the presence of Unicode. We should not recognize two different formats for addresses, one allowing Unicode and one ASCII only. It's easy to scan an address for code points above U+0080 after the validation step if you know you can't work with those addresses. Unicode should be the norm now.
TL;DR you don't have to; but if you do, don't fuck it up.
“Validation on the front-end creates more ways to lose rather than more ways to win, and doesn't really protect the backend from vulnerabilities.”
“The user should able to enter an email address verbatim, with no second-guessing by input forms.”
Yes, validation done wrongly is bad. Admittedly many of our real world examples, including the HTML spec, do it wrongly. And admittedly syntactic correctness is of limited use in detecting erroneous user input.
A very common error people make is to enter “some.user@gmil.com” with a missing “a” in the middle. That address is not rejected by any grammar. A server will also accept a message to that address with “250 Ok, queued as […]” so be careful!
If the address is to be validated by sending an opt-in message to that mailbox address, your local submission agent (or bulk sender API, or whatever) will very likely do that syntax check for you. But that does not cover many use cases for syntactic validation; it's not always the case that you are dealing with a person entering their own email address at a form.
There are many use cases for email address validation: you may be entering records from a legacy database into some new system, and flagging bad records before they're put into the new system can be very useful. A syntactically invalid email address is an excellent indication that something is wrong with that record.
As long as no real, working email address is rejected due to syntax validation, it is generally helpful to detect errors sooner rather than later.
A comment by an actual email expert about this topic can be seen here.
simple@example.com
very.common@example.com
disposable.style.email.with+symbol@example.com
admin@mailserver1
" "@example.org
"user \" with"@example.com
"john..doe"@example.org
"<john-doe>"@example.org
"john.doe@example.com"@example.org
john.doe%example.com@example.org
name-@example.org
simple@[127.0.0.1]
simple@[IPv6:::1]
我買@屋企.香港
The domain ‘mailserver1’ is not a fully-qualified DNS domain, but the syntax of RFC-5321 does not forbid single label domain names.
user.with.no.at.domain
"user " with"@example.com
user@[300.0.0.1]
Manuéla@[127.0.0.0.1]
moe@[127.0.1]
Jacqueline@[127.00.0.1]
user@example.com#
<user@example.com>
user@example.com.
.user@example.com
user.@example.com
john..doe@example.com
foo\ bar@example.com
foo.bar@bad=domain.com
To: "Joe &
J. Harvey" < JJV@BBN.com >
The above ‘To:’ header field has a ‘display-name’ of "Joe & J. Harvey" (including the double quote characters) folded across two lines, followed by an ‘angle-addr’ with an embeded ‘addr-spec’ of “ JJV@BBN.com ” including the white space. You can divine a valid ‘Mailbox’ address from that text (it would be “JJV@bbn.com”) but the full contents of the ‘To:’ header include much more than just an email address.
The contents of email message headers can also include liberal use of comments (in parens) and (potentially “folding”) white space; which are also not parts of an email address. Some examples from the original RFC-822 document:
To: ":sysmail"@ Some-Group. Some-Org,
Muhammed.(I am the greatest) Ali @(the)Vegas.WBA
Bcc: (Empty list)(start)Hidden recipients :(nobody(that I know));
Those including comments and white space:
(a (nested) \(comment) pete@silly.test (his host is silly)
Obsolete forms MUST be accepted and parsed says RFC-5322.
MUST."be"."accepted"@example.com
Lax requirements for domains:
jane.doe@-lax-domain-
Very lax requirements for address literals:
jane.doe@[Almost anything goes here!]
The Wikipedia page gets the basic standard wrong for email address syntax up front, citing ‘addr-spec’ from RFC-5322. Then proceeds to mix in RFC-5321, yet still makes many factual misstatements about address syntax.
The Wikipedia page has been wrong about this for decades, and I suspect that is the reason many others have made the same mistake. The dangers of taking Wikipedia as authoritative without a thorough reading of the source documents.
Please see the talk page for my attempts to get this corrected.
“This requirement is a willful violation of RFC 5322, which defines a syntax for email addresses that is simultaneously too strict (before the "@" character), too vague (after the "@" character), and too lax (allowing comments, whitespace characters, and quoted strings in manners unfamiliar to most users) to be of practical use here.”
The HTML spec also gets the basic standard wrong. Even worse, they realize that the grammar of RFC-5322 is not what they want, but apparently unaware that the ‘Mailbox’ grammar rule from RFC-5321 is what they're looking for, they make up their own grammar for email addresses and include it in the spec.
See this comment and the follow-up comment for my attempt to get this issue worked out.