Base64 Encoding

Abstract

Nowadays, we all send emails with binary attachments such as images. However, the mail protocol was designed for plain ASCII text only. So how does it work? Thanks to Base64 encoding, the attachments are transformed from their binary format to a text representation when sending, and then back to their original binary format when receiving. This article describes the principle of Base64 encoding, then illustrates with a simple Java implementation. (August 2003).

Base64 content-transfer-encoding, commonly called Base64 encoding, is defined in RFC 2045 [4]. It is a method designed to represent an arbitrary sequence of octets (8-bit) in a printable text form. Thanks to that, Base64 encoding allows passing binary data through channels that are designed for flat ASCII text such as SMTP [3] [2]. It also allows embedding of binary data in media supporting ASCII text only such as XML files (see reference [6] on how to).

Alphabet

An alphabet of 64 encoding characters is used (hence the name Base64). Thus allowing 6 bits to represent the value of each encoding character. The alphabet is chosen as a printable subset of the US-ASCII. Table 1 shows the Base64 alphabet, with correspondence between the values and the encoding characters.

**Table 1** - Base64 alphabet.
Value	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16
Character	A	B	C	D	E	F	G	H	I	J	K	L	M	N	O	P	Q
Value	17	18	19	20	21	22	23	24	25	26	27	28	29	30	31	32	33
Character	R	S	T	U	V	W	X	Y	Z	a	b	c	d	e	f	g	h
Value	34	35	36	37	38	39	40	41	42	43	44	45	46	47	48	49	50
Character	i	j	k	l	m	n	o	p	q	r	s	t	u	v	w	x	y
Value	51	52	53	54	55	56	57	58	59	60	61	62	63	(pad)
Character	z	0	1	2	3	4	5	6	7	8	9	+	/	=

The subset for the Base64 alphabet is carefully chosen so that it is represented identically in all versions of ISO 646 [1], including US-ASCII, and in all versions of EBCDIC.

Encoding

The encoding process consists in representing groups of 3 octets (24 bits) of input bits as output strings of 4 encoded characters. Let's consider the input as a linear stream of octets. Proceeding from left to right, the input is divided into 24-bit groups, each formed by 3 consecutive octets of the input stream. These 24-bit groups are then treated as groups of 4 concatenated 6-bit groups. Each 6-bit group is a binary number, representing a decimal value between 0 and 63. That value is used as an index into the array of the Base64 alphabet shown in Table 1. The corresponding encoded character is placed in the output string.

As a simple example, let's consider as input a sequence of 3 octets whose decimal values are 197, 22 and 233. The 24-bit group formed from these 3 octets is 110001010001011011111011. It is treated as 4 6-bit groups: 110001, 010001, 011011 and 111011. Their respective decimal values are 49, 17, 27 and 59. Using the Base64 alphabet, the resulting output string is iRb7. Table 2 decomposes the encoding process, the first row being a representation of the input in ISO Latin-1 characters.

**Table 2** - Base64 encoding process example.
Å								SYN								û
197								22								251
1	1	0	0	0	1	0	1	0	0	0	1	0	1	1	0	1	1	1	1	1	0	1	1
49						17						27						59
i						R						b						7

The Base64 encoding rules specify that the output stream (resulting encoded bytes) must be represented in lines of no more than 76 characters each. A line break being defined by the sequence CR LF.

Padding

What if the input stream is not a multiple of 3 octets? If fewer than 24 bits are available at the end, zero bits are added on the right to form an integral number of 6-bit group. Since Base64 input is an integral number of octets, there are 3 possible endings:

The input ends with a whole 24-bit group. The output is a multiple of 4 Base64 encoded characters. No special action is needed.
The input ends with two octets or a 16-bit group. Two zero bits need to be added to form a whole 3 6-bit group, which translates into 3 Base64 encoded characters. A padding character '=' is needed to make the output a multiple of 4 characters.
The input ends with an octet or an 8-bit group. Four zero bits need to be added to have 2 encoded characters. And two padding characters are added.

Decoding

The decoding process works in reverse to the encoding process. That is 24-bit groups of 4 6-bit groups are translated into groups of 3 octets. So that in our previous example, the bottom row of Table 2 is the input and the top row is the output.

All line breaks or other characters not in the Base64 alphabet are to be ignored by the decoding software. The same applies to any illegal sequences of characters in the Base64 encoding, such as "====".

Java Implementation

First thing first, let's start by defining the maximum length of the encoded data lines, as well as defining the Base64 alphabet as an array of char.

1: private static final int LINE_MAX_LEN = 76;
2: private static final char[] EN64 =
3:   "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/".toCharArray();

For simplicity sake, let's assume in the following code that the input src is a java.io.InputStream and the output dest is a java.io.Writer. Note that for better efficiency, the actual class of src and dest will be some sort of buffered stream and buffered writer respectively.

The encoding process takes 3 octets from the input and represents them as 4 encoded characters. We will use an array of byte to hold the 3 input octets, and an array of char for the 4 output characters as defined in lines 4-5. Each 3-octet group is read from the src, which is then translated into 4 encoded characters by using the 6-bit value as index into the Base64 alphabet array EN64 (lines 10-16). Note that there could be one or two padding characters '=', depending on the number of octets n read from the input.

 4: byte[] b = new byte[3];
 5: char[] c = new char[4];
 6: int k = 0;
 7: while (src.available() > 0) {
 8:    // Collect three bytes from the source
 9:    Arrays.fill(b, (byte)0);
10:    int n = src.read(b, 0, 3);
11:
12:    // Convert into 4 base64 characters
13:    c[0] = EN64[(b[0]>>2)&0x3F];
14:    c[1] = EN64[ ((b[0]&0x3)<<4) | (b[1]>>4)&0xF ];
15:    c[2] = (n < 2)? '=' : EN64[ ((b[1]&0xF)<<2) | (b[2]>>6)&0x3 ];
16:    c[3] = (n < 3)? '=' : EN64[(b[2]&0x3F)];
17:
18:    // Ensure that the encoded data have a maximum of
19:    // LINE_MAX_LEN (76) characters per line.
20:    if (k < LINE_MAX_LEN) k += 4;
21:    else {
22:        dest.write("\r\n");
23:        k = 4;
24:    }
25:    dest.write(c, 0, 4);
26: }

Even though RFC 2045 [4] allows output lines of less than 76 characters, we will break each output line at exactly 76 characters (line 20-25). Using lines of less than 76 characters increases the size of the output, as more line breaks CRLF will be needed.

The decoding process works in reverse to the encoding process, and is left as an exercise to the reader.

Conclusion

There are other popular methods for binary data encoding such as the hexadecimal representation, uuencode or Base85 [5]. However, Base64 is relatively compact and more portable as its alphabet is represented identically in all versions of ISO 646 [1], and in all versions of EBCDIC. These properties make Base64 a premiere cross-platform binary transport encoding method.

References

[1] ISO 646: "Information technology -- ISO 7-bit coded character set for information interchange", 1991.
[2] RFC 822: "Standard for the format of ARPA Internet text messages", David H. Crocker, August 1982.
[3] RFC 821: "Simple Mail Transfer Protocol", Jonathan B. Postel, August 1982.
[4] RFC 2045: "Multipurpose Internet Mail Extension (MIME) Part One: Format of Internet Message Bodies", N. Freed and N. Borenstein, November 1996.
[5] RFC 1924: "A Compact Representation of IPv6 Addresses", R. Elz, 1 April 1996.
[6] Java Tip 117: " Java Tip 117: Transfer binary data in an XML document", Odysseas Pentakalos, September 2001.

Last updated: 2004-09-11 12:03:08 -0700