A Conversation for Digital Codes

Couple of Comments

Post 1

manolan


Excellent article. Very clear and well-structured.

Two things, though:

1. 'Delimiter' is 'er', not 'or'.
2. I wonder if you should mention SGML rather than (or as well as) XML.


Couple of Comments

Post 2

Martin Harper

Perhaps delimitor is the american spelling? I'm sure I got i from somewhere...

I confess I know next to nothing about SGML - and what the difference is between that and XML. Could you enlighten me? smiley - winkeye


Couple of Comments

Post 3

Ausnahmsweise, wie üblich (Consistently inconsistent)

Hi, Good job. It's a huge subject and you've summarized it nicely. Over here ... http://www.bbc.co.uk/h2g2/guide/F20631?thread=88472 .. we've been talking about Huffman compression in more detail. What about MIME and UUENCODING - for transmitting binary data as characters in emails? And there's base64 for representing binary in the HTTP protocol and, I think, within an XML document.


Couple of Comments

Post 4

Martin Harper

I am pleased to say I know absolutely nothing about MIME, UUEncoding, and base64. Do I get a prize? smiley - smiley


Couple of Comments

Post 5

Ausnahmsweise, wie üblich (Consistently inconsistent)

Hi,

MIME I'd have to look up to refresh my memory.

UUENCODE and UUDECODE date back to Unix but are used also used for encoding email attachments. You take six bits of binary and add two bits in front of that bit pattern.

So, you could have anything ranging from 0x00 to 0x3F from each 6 bits of binary. You add 0x20 to that putting it in the displayable range for ASCII (0x20 - 0x5F). The result is about 33% larger than the original, because 3 bytes (24 bits) become 4 bytes. The receiver does the opposite: consumes a byte at a time, subtracts 0x20 and moves the lower order 6 bits to the destination, stringing them together to form the binary file.

I bet you've seen stuff like this inline in an email?

ICAgIOwgIFoGRCDxUVVBRENBU0UgTTtBO0oNClsxXSAgICCmQ29udmVydHMgdGhlIGNhc2Ug
b2YgcXVhZCBuYW1lcyBpbiBhcnJheSAXDQpbMl0gICAgpiBDb252ZXJ0cyBuYW1lcyB0byB1
cHBlcmNhc2UgaWYgRD0xLCB0byBsb3dlcmNhc2UgaWYgRD39MQ0KWzNdICAgIKYgSWYgRD0w
LCBjb252ZXJ0cyBuYW1lcyB0byB0aGUgcHJpbWFyeSBhbHBoYWJldCAo8UFMUEhTWzE7XSkN

Base64 is similar. You take groups of 6 bits from the binary file and use the number as an index into a table containing A-Z, a-z, 0-9, + and /. The = character is used as a pad. So again, it takes 8 bits to represent 6 bits of binary. It's better than uuencode because you have control over a the very narrow range of characters. No risk of wierd characters {}[], etc. getting translated to something else.


Couple of Comments

Post 6

manolan


MIME is an e-mail (document) structuring mechanism. MIME = Multipurpose Internet Mail Extensions. The basic idea is to represent the general structure of an e-mail in standard ASCII (7 bits). I don't know all the technical stuff (i.e. how it is defined formally - it runs to 4 or 5 RFCs), but basically MIME extends the simple internet e-mail protocol where the addressing information is contained in a set of header fields. The MIME entensions allow for non-textual data through the use of Content types (e.g. text/html) and multipart messages (e-mail plus attachments).

Uuencoding and Base 64 are both allowable (and common) MIME content types. 'UU' stands for UNIX-UNIX. The description above is basically right, but uuencoded files would never contain lower case letters and each line starts with a control character indicating how many characters there are on the line (max. 45 = 'M'). If I took the file:

newfile.txt
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ

and encoded it:

uuencode newfile.txt newfile.uu

I would get:

begin 664 newfile.uu
M86)C9&5F9VAI:FML;6YO<'%R7I!0D-$149'2$E*2TQ-3D]045)3
M5%565UA96@IA8F-D969G:&EJ:VQM;F]PD%"0T1%1D=(24I+
03$U.3U!14E-455976%E:"FEJ

end


The result is actually about 35% bigger because of control overheads. As you see, you can always recognise a uuencoded file.

The example in the previous mail looks more like Base 64 encoding to me.

SGML is Standard Generalized Markup Language and is the parent of XML (i.e. XML is a specific implementation of SGML).


Couple of Comments

Post 7

Ausnahmsweise, wie üblich (Consistently inconsistent)

I stand corrected. You should never see lower case letters in UUENCODED data because they are of course 0x61 and higher. We're only going to see 0x20 to 0x5F) For speed I copied & pasted the sample off a web page though. Must have been a bad example.

The 35% is mainly because every 6 bits take 8 bits to represent (that would be 4/3 or 133% of the original). Then, as you say, there are the control characters. They weigh heavier, the smaller the file and can bring it up to 35%


Key: Complain about this post