Tuesday 23 June 2015

Files with non-standard encodings in Salesforce

This post is about one of my latest challenges at work. It looked like something simple to achieve, but it turned out to be a difficult one...

THE REQUIREMENTS

The requirement was to build attachments in Salesforce that could then be downloaded. Those files would then be imported into a third party software for processing. It doesn't seem to be a difficult task, on paper it is a straight forward thing to achieve...

The problem was that we were dealing with a Korean third party, and the content of the files was in Korean (really funny looking characters from my point of view...). They were plain text files, but written in Korean...

The obvious solution was to provide button with a link to a URLFOR pointing to the attachment Id, that would download the file directly from Salesforce.

The following snippet shows the simple solution that would download the attachment in one click:



Once the file was downloaded, it was ready to be imported by the third party software. The third party software provides a preview feature, that lets the user see what is going to be imported before submitting.

Unfortunately, the result was not the expected one. See the following images to compare the expected result with the actual result, there are some characters that could not be successfully read by the third party sofware... What are those question marks there?

Expected result
Actual Result

THE PROBLEM

After that came investigation time... What might be going wrong? Comparing the text files with comparison utilities did not provide any clues... Checking for additional blank spaces... Checking for implementation of the line breaks... It all looked the same!!!

Finally, after inspecting every single aspect of the files I made a binary comparison of the files with Beyond Compare, and the bytes were not the same for the same shown text. The problem was that the files had different encodings!!!

THE SOURCE OF THE PROBLEM

So we got to the root of the problem: the third party software was expecting an EUC-KR encoding, whereas the file provided by Salesforce was created with a UTF-8 encoding.

Now, I had to find a way to create my downloadable files with the EUC-KR enconding, and the body of those files was the body of the Salesforce attachments.

Salesforce only uses the UTF-8 standard enconding for all the files; now that stopped me from using the solution in the snippet above, because there was no way Salesforce was going to create the files in the format I needed in just one click and one line of code...

But that's where the challenge is, and that's what us developers love!! We love chanllenges!! And this case became one for me, hence this blog entry...

Another problem I found with this specific case was that the EUC-KR encoding is a legacy encoding, i.e. it's not a standard one and it is not supported by most of the browsers by default.

THE SOLUTION

Thankfully I found this github repository with a polyfill for the Encoding Living Standard's API by inexorabletash (many thanks for your solution). It is an implementation of the TextEncoder API.

The github repo includes the source file for the encoding.js and encoding-indexes.js libraries. The later is required when working with non-standard encodings; and following the repository's README explains the usage of the libraries.

This solution has been really helpful for me because it did what I needed, although it didn't work with the exact source code from the repository, so I had to apply a tiny tweak to the code in the encoding.js library.

I was using a version of Chrome that natively supports the TextEncoder API, so the polyfill wasn't being used, and therefore the EUC-KR encoding wasn't being recognised as a valid entry.

I had to modify the library to implement my TextEncoderCustom API, which was easily done by just replacing the TextEncoder string by the TextEncoderCustom string in the encoding.js file. Once this is done, the browser will use your own implementation of the TextEncoder API, which is your TextEncoderCustom API.

In the following code snippet, the variable uint8array is properly encoded with an EUC-KR format and it can be used to be saved in a file:


You need to create static resources in Salesforce for the libraries to be included in your Visualforce page as shown in the snippet above.

I hope you find this post useful if you need to work with non standard encodings in Salesforce. Once again, many thanks to inexorabletash for the github contribution.