Ticket #407 (closed defect: fixed)

Opened 4 years ago

Last modified 4 years ago

{{{darcs_buildbot}}} uses {{{.encode('ascii')}}, but {{{.encode('utf-8')}}} works better

Reported by: zooko Owned by:
Priority: major Milestone: undecided
Version: 0.7.9 Keywords: darcs
Cc: dustin

Description

I committed a darcs patch written by François Deppierraz. The darcs_buildbot script failed with:

Traceback (most recent call last):
  File "/usr/local/bin/darcs_buildbot", line 164, in <module>
Finished applying...
    sendChanges(MASTER)
  File "/usr/local/bin/darcs_buildbot", line 131, in sendChanges
    changes = findNewChanges()
  File "/usr/local/bin/darcs_buildbot", line 114, in findNewChanges
    changes = getSomeChanges(lookback)
  File "/usr/local/bin/darcs_buildbot", line 100, in getSomeChanges
    return getChangesFromCommand(cmd, count)
  File "/usr/local/bin/darcs_buildbot", line 95, in getChangesFromCommand
    changes.append(makeChange(p))
  File "/usr/local/bin/darcs_buildbot", line 59, in makeChange
    comments = comments.encode("ascii")
UnicodeEncodeError: 'ascii' codec can't encode character u'xe7' in position 217: ordinal not in range(128)
Posthook failed!
Apply failed!

Then I hacked that script by replacing all instances of the string 'ascii' with the string 'utf-8' and it worked.

Change History

comment:1 Changed 4 years ago by dustin

  • Cc dustin added

I think that, for the moment, the more appropriate solution is to add the "replace" error-handling flag to the encode() call. Presumably someone added this encode() because the target does not expect unicode strings.

comments = comments.encode("ascii", "replace")

What do you think?

comment:2 Changed 4 years ago by zooko

I don't understand the specific situation, other than that I tried this change and it worked for me.

I have the vague notion that utf-8 is always better than ascii and that everything would have been a lot easier if Python had just made utf-8 its default encoding instead of ascii, but I don't know whether that general notion is true in this case.

comment:3 Changed 4 years ago by dustin

Yeah, Py3k makes utf-8 the default encoding, but that's really another story.

In this case, I see:

 60     # note that these are all unicode. Because PB can't handle unicode, we
 61     # encode them into ascii, which will blow up early if there's anything we
 62     # can't get to the far side. When we move to something that *can* handle
 63     # unicode (like newpb), remove this.

which means that we're stuck with the ascii charset -- we just can't get anything that's not ascii over to the other side. The other side is not expecting utf-8-encoded data, and will not know to decode such data back to a unicode string. Adding "replace" just means that any non-ascii characters are handled gracefully (replaced with a placeholder character) instead of triggering a traceback. It's not pretty, but it gets the commit into buildbot.

Obviously, the larger solution is to support unicode universally, and make arrangements to encode and decode strings when they pass through ascii-only channels, but that's going to be a much more significant task.

What do you think?

comment:4 Changed 4 years ago by zooko

Hm, but when I told it to encode this patch with utf-8, it worked, even though it contained a non-ascii char (in the comment, toward the end, in the name "François")

 http://allmydata.org/trac/tahoe/changeset/3274

Buildbot received the notification about that patch and triggered appropriately, and even included the sedilla:

 http://allmydata.org/buildbot/changes/694

Could it be that Twisted PB has fixed this limitation since that buildbot code comment was written?

comment:5 Changed 4 years ago by zooko

Whoops, sorry, wrong hyperlink. The changeset patch in question was this one:

 http://allmydata.org/trac/tahoe/changeset/3309

Which led to this buildbot event (already hyperlinked correctly in my previous message):

 http://allmydata.org/buildbot/changes/694

comment:6 Changed 4 years ago by dustin

I think that's actually the browser interpreting the UTF-8, rather than Buildbot. But it suggests that the worst possible outcome here is some "funny" ASCII characters in status displays. So I'll commit :)

comment:7 Changed 4 years ago by dustin

  • Status changed from new to closed
  • Resolution set to fixed
Note: See TracTickets for help on using tickets.