if you use
$TWiki::cfg{Site}{CharSet} = 'utf8';
then
---++ This is a headline with some _emphasis_ in it
will render
<h2><a name="This is a headline with some <em>em"></a> This is a headline with some _emphasis</em> in it </h2>
which is fatal.
Patch:
--- lib/TWiki/Render.pm (revision 11645)
+++ lib/TWiki/Render.pm (working copy)
@@ -400,10 +400,7 @@
# For most common alphabetic-only character encodings (i.e. iso-8859-*),
# remove non-alpha characters
- if( defined($TWiki::cfg{Site}{CharSet}) &&
- $TWiki::cfg{Site}{CharSet} =~ /^iso-?8859-?/i ) {
- $anchorName =~ s/[^$TWiki::regex{mixedAlphaNum}]+/_/g;
- }
+ $anchorName =~ s/[^$TWiki::regex{mixedAlphaNum}]+/_/g;
$anchorName =~ s/__+/_/g; # remove excessive '_' chars
if ( !$compatibilityMode ) {
$anchorName =~ s/^[\s\#\_]*//; # no leading space nor '#', '_'
Why are iso-8859-* treated special?
MD
I believe it was due to performance, as discussed in
Item2032 - not sure how much impact we are talking, though.
--
SP
Hm, but the check above to distinguish
iso-***
charsets from others only happens when normalizing anchor names (replacing suspicious chars with an underscore). So it can't be related to form values.
MD
This isn't a performance issue. However, answering MD's question is surprisingly complex since this is a complex area...
On a side note: I'm surprised there aren't more bugs with use of underscores for emphasis in TOC entries - never thought this would work, and semantics of what
TWikiML is allowed in TOC entries should be better defined anyway.
I18N of TOC entries is quite painful and really needs some work - what it should really do is look at the
intended language of the page (e.g. French or Chinese) and then check whether that language is alphabetic. (This could be configured per site perhaps, as
{langType}
set to alphabetic or nonalphabetic).
- UPDATE: Although we can't mix Perl locale features with Perl Unicode support features (see TWiki:Codev.UnicodeSupport
for reason), we could just use current locale setting to get a simple view of site-wide language settings, even though the UnicodeSupport will ignore the locale. However, this doesn't address sites where multiple translated versions of pages are available, in which case the language of the page can sometimes be deduced from that. It also doesn't address pages that include multiple languages, of course.
Then, for TOCs in pages (or just specific TOC entries) using alphabetic languages, all non-alphabetic characters (except for allowed
TWikiML such as emphasis) are stripped, so that the anchor is normalised (but can still include accented characters).
For non-alphabetic languages such as Chinese (see
TWiki:Support.TOCnotWorkingForChineseHeadings
), all non-script characters need to be stripped in a similar way, but with a different regex. This really needs full Unicode support turned on in Perl (
TWiki:Codev.UnicodeSupport
) to be practical - e.g. the Unicode regexes enable
[\p{Letter}\p{Mark}]
to match any letter or accent from alphabetic and non-alphabetic scripts.
Without full Unicode support in TWiki, non-alphabetic mode would need to be done in a less safe 'filter out bad characters only' mode, as now - otherwise most Chinese TOC entries simply don't work as they are all 'invalid' characters. See
TWiki:Codev.InternationalisationIssues
for links to Chinese TOC issues, and
TWiki:Support.TOCnotWorkingForChineseHeadings
in particular.
Mixing alphabetic and non-alphabetic languages is no worse than just non-alphabetic and would definitely need Unicode.
Today, UTF-8 can of course be used with alphabetic or non-alphabetic characters - however, since
I18N for TWiki doesn't support Unicode yet (WikiWord
I18N doesn't work for example) it's best to assume that you are using a non-alphabetic characters.
So - the test for
iso-8859-*
is a rather Western European biased and sub-optimal way of checking for "current language is alphabetic" that will not work with UTF-8 (or many 8-bit alphabetic character sets used for Cyrillic and so on).
UPDATE: There's a comment from me on the topic of non-alphabetic languages in
TWiki:Codev.InternationalCharactersInFormFields
.
--
RD
Perhaps related:
Item2455
- RD - not related, that's a configuration error.
AC
In fact, configuring the following is also an error - please configure based on the docs using locale, then try again:
$TWiki::cfg{Site}{CharSet} = 'utf8';
RD
Some updates above, flagged as
UPDATE - the more I think about this, the harder it is to really determine the language used in a specific TOC entry. Some compromise is necessary.
RD
This topic was lost from the lists due to not having a codebase field. Rediscovered 3/2/07. Just set it to "No Action" if it is dead.
CC
In any case, please make sure to keep anchor links compatible, there are many URL out there pointing to
.../SomeWeb/SomeTopic#Some_auto_generated_anchor_from_subject
--
PTh
It was agreed at release meeting 02 Jul 2007 that it is OK to filter out chars and break old anchor links for links to heading with strange formatting in them to fix this.
KJL
No interest in this since issue since February, assuming patch done is working.
Closing, re-open as new bug if more work needs to be done.
--
TWiki:Main.SteffenPoulsen
- 09 Sep 2007