Announcement

Collapse
No announcement yet.

Wrong encoding -> bad EOL handling

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Wrong encoding -> bad EOL handling

    Summary: Beyond Compare 3 fails to match two files that differ only in CRLF/LF modes.

    I have two files with identical textual content but differing EOL modes. They're attached as "a.txt" (PC:CRLF) and "b.txt" (Unix:LF)

    Beyond Compare 3 fails to match these two files without settings tweaking, making comparing directories containing numerous such files an extreme pain.

    I've attached my result of the "a.txt <-> b.txt" comparison as "comparison.png". Note that "ANSI" on my system is the same as "Japanese (Shift-JIS)". To get the same result with an English locale OS, one would probably need to select "Japanese (Shift-JIS)" for the files' encodings.

    The problem seems occur because 1) the files are "Japanese (EUC)" but are decoded as Shift-JIS and 2) the Shift-JIS decoder is consuming the EOL as the "second-byte" of a multi-byte Shift-JIS character.

    These are the flaws that I think arise in this scenario:
    1. The files are auto-detected as "ANSI" when they should be "Japanese (EUC)".
    2. Code-page decoding and EOL handling is done in such a way that code-page decoding takes precedence. (Otherwise a failed decoding wouldn't matter, as evident from a "b.txt <-> b.txt" comparison.)
    3. The Shift-JIS decoder is able to consume an EOL as a second-byte when neither LF nor CR should be part of a double-byte character[1] to begin with.

    I know that 1 is probably unfixable.. but perhaps 2 or preferably 3 could be remedied?

    [1] http://en.wikipedia.org/wiki/Shift_J...t_JIS_byte_map

    The attached files are textually identical. Their modes are:
    a.txt = Japanese EUC, CRLF
    b.txt = Japanese EUC, LF
    c.txt = Japanese Shift-JIS, CRLF
    d.txt = Japanese Shift-JIS, LF

    The last two are attached merely for convenience.

  • #2
    Thank you for reporting the problem.

    We'll add this to our example list for improving the automatic detection of character encoding.

    If you manually select the correct character encoding, the files should display correctly. You can select the character encoding by changing the dropdown that says "ANSI" to "Japanese (EUC)".

    You can also force BC to use a specific character encoding for certain file extensions. To force .txt files to open as EUC encoding, select "Tools > File Formats". Click "New". Select "Text Format" as the type. Name the format "TXT". In the General tab, enter *.txt as the mask. In the Conversion tab, change encoding from "Detect" to the encoding that matches your files.
    Chris K Scooter Software

    Comment

    Working...
    X