Groups > microsoft.public.scripting.vbscript > #12265

Re: file.ReadAll - another quirk

From	"Mayayana" <mayayana@invalid.nospam>
Newsgroups	microsoft.public.scripting.vbscript
Subject	Re: file.ReadAll - another quirk
Date	2019-11-23 10:09 -0500
Organization	A noiseless patient Spider
Message-ID	<qrbi41$7gk$1@dont-email.me> (permalink)
References	(6 earlier) <qr8t33$1j3e$1@gioia.aioe.org> <qr8tqi$tsl$1@dont-email.me> <qr96d7$v89$1@gioia.aioe.org> <qr9rm7$qmh$1@dont-email.me> <qraq3k$1sh$1@gioia.aioe.org>

Show all headers | View raw

"R.Wieser" <address@not.available>

| > Unicode - 2-byte characters as used in Windows,
|
| Which I remember as "Wide character" (as the name used in the conversion
| function).
|

   Yes. That one confuses me. It took me a long
time to figure out which was wide and which was multi.
Dual and multi would have made more sense. I don't
find a spatial description of bytes to be intuitive. A
2-byte character is not "fat".

| > which may not be the same as all unicode 16 and
|
| Did I already mention I'm hazy with those names ?  Whell, stuff like that
| (two-byte unicode != unicode 16) certainly does that to me. :-)
|

   For a long time I had no awareness of anything other
than "unicode", which was the 16-bit, 2-bytes-per-character
that Win32 uses internally, notable not only for the double
byte characters but also for the prepended, 4-byte length
indicator that allowed for embedded nulls.

  The term "unicode 16" is only made necessary by the invention
of unicode 32. If everyone would just speak English like normal
people we wouldn't have this mess. :)

  I always heard/read Windows programming people talking
about simply "unicode". My assumption is that at the time it
was thought that, to paraphrase the Gatester, "64,000
characters should be enough for anyone". And anyway, no
one actually used unicode, except *maybe* if they were writing
software for Asians, Africans, Israelis, etc. So basically it was
preparation for the future.
    My text files are all ASCII/ANSI to this day.
Not all software even recognizes unicode. Then UTF-8 brings
in further complication because an encoding indicator is
discouraged. So Notepad can see a file as plain text unless
there's a BOM at the beginning, in which case it's unicode.
But how does Notepad or anything else recognize UTF-8?
If I save a file as UTF-8 in Notepad it wll be prepended with
EF BB BF, but webpages don't have that. So it ends up creating
a politically correct culture war: We shouldn't use ANSI because
it's language-specific. We should use UTF-8, even if it screws
things up, because UTF-8 respects "diversity".


| Yep, I do know.  And for some reason I got the idea that UTF-8 was 
referring
| to that encoding scheme.   I normally refer to it as "multi byte" (again,
| from the conversion function).

  I suppose it is multi-byte. And there is a UTF-8 codepage.
But it's unicode insofar as it assigns unique numbers for
all characters. So it's not really a codepage in the ANSI sense
of detailing what characters bytes 128-255 should map to.

| Than again, I seem to vaguely remember that UTF-16 (two bytes per 
character)
| could do the same "multi byte" encoding ...
|
  I don't think so. Not on Windows.

| > and if they used UTF-8 it would potentially change
| > the number of characters when rendered as ANSI
|
| Yep.  Which I would/do not find strange in any way.   The same happens 
with
| C strings, in which you have to escape certain characters (gave me quite a
| puzzle the first time I encountered it). :-)
|
   Not strange, but problematic. In the world of late
90s, early 00s, when people were mostly only thinking
about Euro languages, where there was either one byte
or 2 bytes per character, it's not too hard to convert
between ANSI and unicode. The first byte was always 0. :)
But if real world ANSI usage were actually multibyte then
it would quickly get complicated to deal with text. Of
course it is complicated now, in theory, but mostly only
for Asians, in practice.

  I ended up writing a VB6 function for my HTML editor to
check for UTF-8. I find it takes less than 15 ms to check
up to 100KB of data, so it's an almost instant ID, which
allows me to support UTF-8 seamlessly. I open the file
and inspect the bytes before loading it into the RichEdit,
at which point I have to tell the RichEdit how to load it.
But there are still complications. This is for HTML so it
assumes an ANSI-type file. In other words, not unicode-16
and without a BOM. It only searches until it finds, or
doesn't find, a byte combination invalid in UTF-8.

Public Function IsItUTF8(sFile As String) As Boolean
  Dim bFile() As Byte
  Dim iB As Long, SizFile As Long, LenF As Long
  Dim FF As Integer
  Dim BooU8 As Boolean, BooU8Char As Boolean

    IsItUTF8 = False
     On Error Resume Next
        FF = FreeFile()
        Open sFile For Binary As #FF
          LenF = LOF(FF)
          If LenF > 100000 Then
             ReDim bFile(100000) As Byte
          Else
             ReDim bFile(LenF) As Byte
          End If
           Get #FF, , bFile()
        Close #FF


       '--just quit and call it ansi if there's an error opening file.
     If Err.Number <> 0 Then Exit Function
      SizFile = UBound(bFile) - 3
        If SizFile < 10 Then Exit Function '-- don't go negative for a tiny 
file.
      BooU8Char = False
      BooU8 = True
        iB = 0
         '-- UTF-8 characters will be: 240+/128+/128+/128+  224+/128+/128+ 
192+/128+
         '-- anything not fitting that pattern will not be a UTF-8 
character. So
         '-- a single byte over 127, a byte over 240 not followed by 3 bytes 
over 127, etc.
         '-- Most functions like this are designed to default to UTF-8: If 
it's not
         '-- *faulty* UTF-8 then it's UTF-8. This function does it the other 
way:
         '-- If it's faulty UTF-8 or if it's ASCII then it's not UTF-8.
      Do While iB < SizFile
        Select Case bFile(iB)
          Case Is < 128  'ascii range
            iB = iB + 1

          Case Is < 194, Is > 244 '128-191 can only appear as continuation 
bytes.
            BooU8 = False         '245 to 255 are invalid in utf-8. 192, 193 
are invalid.
            Exit Do

          Case Is > 239
            If ((bFile(iB + 1) < 128) Or (bFile(iB + 1) > 191)) _
               Or ((bFile(iB + 2) < 128) Or (bFile(iB + 2) > 191)) _
                Or ((bFile(iB + 3) < 128) Or (bFile(iB + 3) > 191)) Then
              BooU8 = False
              Exit Do
            Else
              BooU8Char = True
            End If
             iB = iB + 4

          Case Is > 223
            If ((bFile(iB + 1) < 128) Or (bFile(iB + 1) > 191)) _
               Or ((bFile(iB + 2) < 128) Or (bFile(iB + 2) > 191)) Then
              BooU8 = False
              Exit Do
            Else
              BooU8Char = True
            End If
             iB = iB + 3

          Case Else  ' > 193 and < 224
            If (bFile(iB + 1) < 128) Or (bFile(iB + 1) > 191) Then
              BooU8 = False
              Exit Do
            Else
              BooU8Char = True
            End If
             iB = iB + 2

         End Select
       Loop


   If BooU8 = False Or BooU8Char = False Then
      IsItUTF8 = False
   Else
      IsItUTF8 = True
   End If
End Function

Back to microsoft.public.scripting.vbscript | Previous | Next — Previous in thread | Next in thread | Find similar

Thread

file.ReadAll - another quirk "R.Wieser" <address@not.available> - 2019-11-20 12:54 +0100
  Re: file.ReadAll - another quirk "Mayayana" <mayayana@invalid.nospam> - 2019-11-20 08:39 -0500
  Re: file.ReadAll - another quirk JJ <jj4public@vfemail.net> - 2019-11-22 00:00 +0700
    Re: file.ReadAll - another quirk "Mayayana" <mayayana@invalid.nospam> - 2019-11-21 12:50 -0500
    Re: file.ReadAll - another quirk "R.Wieser" <address@not.available> - 2019-11-21 20:39 +0100
      Re: file.ReadAll - another quirk "Mayayana" <mayayana@invalid.nospam> - 2019-11-21 15:04 -0500
        Re: file.ReadAll - another quirk "R.Wieser" <address@not.available> - 2019-11-21 21:46 +0100
          Re: file.ReadAll - another quirk "Mayayana" <mayayana@invalid.nospam> - 2019-11-21 16:31 -0500
            Re: file.ReadAll - another quirk "R.Wieser" <address@not.available> - 2019-11-22 09:23 +0100
              Re: file.ReadAll - another quirk "Mayayana" <mayayana@invalid.nospam> - 2019-11-22 09:14 -0500
                Re: file.ReadAll - another quirk "R.Wieser" <address@not.available> - 2019-11-22 15:58 +0100
          Re: file.ReadAll - another quirk JJ <jj4public@vfemail.net> - 2019-11-22 19:13 +0700
            Re: file.ReadAll - another quirk "R.Wieser" <address@not.available> - 2019-11-22 15:47 +0100
              Re: file.ReadAll - another quirk "Mayayana" <mayayana@invalid.nospam> - 2019-11-22 10:10 -0500
                Re: file.ReadAll - another quirk "R.Wieser" <address@not.available> - 2019-11-22 18:37 +0100
                Re: file.ReadAll - another quirk "Mayayana" <mayayana@invalid.nospam> - 2019-11-22 18:40 -0500
                Re: file.ReadAll - another quirk "R.Wieser" <address@not.available> - 2019-11-23 09:19 +0100
                Re: file.ReadAll - another quirk "Mayayana" <mayayana@invalid.nospam> - 2019-11-23 10:09 -0500
                Re: file.ReadAll - another quirk "R.Wieser" <address@not.available> - 2019-11-23 17:36 +0100
      Re: file.ReadAll - another quirk JJ <jj4public@vfemail.net> - 2019-11-22 19:10 +0700

csiph-web