10

Not sure if this is a ubuntu or osx question, but I'll start here. I'll leave it to the mods to move the question to AskDifferent if more apropriate.

I moved a file from ubuntu to osx using scp on the apple machine. I edited the file on the apple machine. Then I moved the file back, again using scp on the apple machine.

The filename of the source file was Documents/trettiårsfirarätare.

  • Sourcecode: Documents/trettiårsfirarätare

The filename I got back had the name Documents/trettiårsfirarätare.

  • Sourcecode: Documents/trettia˚rsfirara¨tare

While these might look similar, the letters å and ä is actually different between them. At no point did I change the name of the file.

This makes little technical difference to me, I just changed the name of the file back to what ubuntu considers å and ä, but it tickled my curiosity.

Can you explain to me why this happened?

Takkat
  • 140,996
  • 54
  • 308
  • 426
azzid
  • 844
  • 1
  • 10
  • 20
  • 1
    This issue will likely involve Unicode. *What happens if you **scp** (or equiv.) copy from **OS X** to **Ubuntu** (or Ubuntu to OS X), but on the Ubuntu machine?* – david6 Aug 22 '13 at 07:49
  • I looked at this question form a mac and didn't see any difference between the lines, now when I came back to my Ubuntu laptop I saw the squares immediately, even before Takkat's edit. – Alvar Aug 22 '13 at 08:17
  • I wont try scp-ing from ubuntu to osx on ubuntu due to the apple machine not having sshd, but scp-ing on osx is enough to change the file name. I only copied it back and forth once and the name was changed, so it seems that scp is the application changing the name. – azzid Aug 22 '13 at 14:04

1 Answers1

8

In the original name “Documents/trettiårsfirarätare”, the letter “å” is internally represented as U+00E5 LATIN SMALL LETTER A WITH RING ABOVE. This is the common representation of this character. In the filename you got back, it has been turned to the character pair U+0061 LATIN SMALL LETTER A U+030A COMBINING RING ABOVE. This is permissible, but not common; it means decomposing “å” into the base character “a” and a combining diacritic mark. These representations are declared to be canonically equivalent in Unicode; this means that the visual presentation is normally expected to be the same, but it need not (here, at SO, as viewed in Firefox, it is not – this depends on font and on rendering software). Programs may treat them as equivalent, but they need not. In a file system, for example, they might well be treated as different.

Similarly, the letter “ä” gets decomposed to U+0061 LATIN SMALL LETTER A U+0308 COMBINING DIAERESIS.

The reason to this is not obvious. Possibly some software “thinks” it should convert strings to a normalization form that decomposes all decomposable characters, probably Unicode Normalization Form D (NFD)

The rest is a bit more mysterious. What you specify as “Sourcecode” for the filename you got back, “Documents/trettia˚rsfirara¨tare”, the decomposed forms have been munged: the diacritic marks have been replaced by their spacing clones, the characters “˚” and “¨”. This is not normal, and it changes both the identity of data and its rendering.

  • The SourceCode part was not added by me. I can see that there is a difference between the letters in their visual representation, the first å has a smaller ring than the second, but other than that the strings look the same. They are not eqvivalent when using bash tab completion though. – azzid Aug 22 '13 at 13:37
  • Actually *å* is a letter on its own, it's not just an *a* with a diacritic mark, just like *h* is a letter and not just and *n* with a diacritic mark. – kasperd Dec 16 '18 at 23:09