Wednesday, August 26, 2009

Reading A Text File

Today I had to read the content of a text file, and send that content to a service running on another server (using an existing client library to communicate with the server).

Doesn’t sound hard does it ? Well, as I’ve said before, nothing is ever easy.

To be more specific about the problem, I needed to read a .sql file containing a script that applied various schema updates, then execute the script using the ExecuteNonQuery method of a connection object.

Still doesn’t sound difficult does it ?

Now I won’t go into the details, but suffice to say I had the content provided to me as a byte array and therefore needed to convert it to a string so I could execute it.

The first problem was I assumed the byte array contained ASCII characters, but in fact it contained unicode. That was simple enough, I just used the System.Text.UnicodeEncoding.Unicode.GetString() method.

Unfortunately that wasn’t the end of it, I now had two problems;

1. The command still wouldn’t execute, I got an error from Sql Server saying incorrect syntax near ‘’. but if I viewed the string with the text debug visualiser, copied it to Sql Server Management Studio and executed it there, it worked fine.

2. I realised I couldn’t rely on a .sql file using any particular encoding, it would depend on who last edited the file with which application and which configuration as to what encoding had actually been used.

It turns out the cause of problem #1 was the solution to problem #2.

Although I couldn’t see it in my decoded string, there was a 4 byte header that specified the encoding format. A quick Google for how to read a text file in various encodings gave me some sample code which I rewrote for clarification and placed into a static method on a utility class in one our shared libraries. Here’s the code;

/// <summary>
/// Converts a byte array (usually read from a text file) to text, based on the encoding provided by the first 4 bytes of the file.
/// </summary>
/// <remarks>
/// <para>Supports ASCII, Utf8, Ucs21e, Ucs41e, Ucs161e, Utf 16, Ucs2 and Ucs4 encodings.</para>
/// <para>The string returned is stripped of the header, as well as decoded using the relevant encoding.</para>
/// </remarks>
/// <param name="data">A byte array containing the data to convert to text.</param>
/// <returns>A string containing the decoded text from the specified byte array.</returns>
public static string ByteArrayToText(byte[] data)
{
  System.Text.Encoding encoding = null;
  int startOffset = 0;
  if (data.Length >= 4 && (IsUtf8Header(data) || IsUcs21e41eOr161eHeader(data) || IsUtf16OrUcs2Header(data) || IsUcs4Header(data)))
    encoding = System.Text.Encoding.Unicode;
  else
  {
    encoding = System.Text.Encoding.ASCII;
    startOffset = 4;
  }
  return encoding.GetString(data, startOffset, data.Length - startOffset);
}
private static bool IsUcs4Header(byte[] data)
{
  return (data[0] == 0 && data[1] == 0 && data[2] == 0xfe && data[3] == 0xff);
}
private static bool IsUtf16OrUcs2Header(byte[] data)
{
  return (data[0] == 0xfe && data[1] == 0xff);
}
private static bool IsUcs21e41eOr161eHeader(byte[] data)
{
  return (data[0] == 0xff && data[1] == 0xfe);
}
private static bool IsUtf8Header(byte[] data)
{
  return (data[0] == 0xef && data[1] == 0xbb && data[2] == 0xbf);
}


You can then get your file contents however you like, so long as the result is a byte array, and pass it to the ByteArrayToText function and it will return a string with the content decoded correctly and the encoding header removed.



This code should also work in the Compact Framework 3.5.






No comments:

Post a Comment