Enhanced character set support in DCMTK

When the DCMTK module “dcmdata“, which is used for managing DICOM data structures and files, was designed in the mid of the 90s, there was already a distinct class for DICOM data elements that are affected by the attribute Specific Character Set (0008,0005), i.e. for the value representations Long String (LO), Long Text (LT), Person Name (PN), Short String (SH), Short Text (ST) and Unlimited Text (UT). Of course, I should be more precise: The latter was introduced a few years later in 1998 with CP-122. Nevertheless, the support for different character sets in the DCMTK stayed quite limited until only recently.

The string handling in the dcmdata module is still what we call “transparent”, because you can put whatever you want into the element values and there is still no check whether you included the Specific Character Set (0008,0005) element with the correct value. However, with the latest development version, which is available in the public git repository, you can now convert any DICOM file or dataset into an equivalent UTF-8 (Unicode) encoded version. As input, DCMTK supports all DICOM character sets that are currently defined in part 5 of the DICOM standard, i.e. the ISO 8859 family (used for European languages, Arabic, Hebrew and others), Japanese, Thai, Korean and Chinese. This includes single and multi-byte character sets as well as the code extension technique using special escape (ESC) sequences based on ISO 2022. The latter is mainly used for Asian languages where you need to switch between different character sets within an element value.

The conversion is based on the “libiconv” toolkit, so this library has to be enabled during the configure process in order to make use of this new feature. In addition to a new convertToUTF8() method at various class levels, there is a new –convert-to-utf8 option for many DCMTK command line tools like “dcmdump” and “dcmconv”. There is also a general character encoding class that allows for converting between arbitrary character sets (if supported by libiconv). This class can e.g. be used to convert an UTF-8 encoded character string back to ISO 8859-1 (Latin-1) if the communication partner does not (yet) support UTF-8.

Of course, this is only a first step of enhanced character set support …

This entry was posted in DICOM, English and tagged , , , . Bookmark the permalink.

Leave a Reply