byte compiler problem: miscompiling emacs-w3m

Aidan Kehoe kehoea at parhasard.net
Tue Jan 15 16:50:19 EST 2008


 Ar an cúigiú lá déag de mí Eanair, scríobh Stephen J. Turnbull: 

 > Stephen J. Turnbull writes:
 >  > Aidan Kehoe writes:
 >  > 
 >  >  > Katsumi, Mike and Dieter: this change
 >  >  > http://hg.debian.org/hg/xemacs/xemacs?cs=e8f448f997ac 
 >  >  > fixes the problem for me. I'd appreciate confirmation of this from your end,
 >  >  > if you have the time to check. 
 >  > 
 >  > Aidan, would you please take the time to explain why this DTRTs?  A
 >  > good place would be as a comment in the code, and you should feel free
 >  > to point that out to a reviewer (probably me ;-) who asks for more
 >  > explanation.
 > 
 > Oh, I see; I gather this is a follow-on to "bind print-gensym to a cons".

Aye. Relatedly, I’ve just committed a fix to the bug I described in 
http://mid.gmane.org/18315.32323.858144.704976@parhasard.net , with a test
case; see http://hg.debian.org//hg/xemacs/xemacs/?cs=cacc942c0d0f . 

 > But
 >
 >  > The reason in this case is that byte compiler expertise is now a rare
 >  > commodity.  Even a small thing like explaining why that
 >  > print-gensym-alist interferes with byte compilation would help raise
 >  > consciousness (speaking for myself, of course, and possibly others).
 > 
 > still applies.  A note that it is conceptually part of that earlier
 > patch might be appropriate?

It’s hard to judge how much detail I need to go into, though. If I’m just
writing for you, then I can be a bit more terse and put a bit less effort
into it than if I’m writing for the general case, and writing for the
general case is hard, because I’m still learning this stuff myself. But I
*should* be writing for the general case, if other people are going to be
able to dive in.

Anyway, when it comes to that question in particular:

   #'gensym returns an uninterned symbol. This symbol is not in obarray [=
the global name->value map for elisp], and as such does not have a canonical
name. If some existing variable does not refer to it, it will be garbage
collected, like most Lisp objects but unlike most symbols. See footnote
[1] for example code. 

   When serialising [printing] an interned symbol [= normally one in obarray,
with a canonical name], deciding on a representation for it is trivial; use
its canonical name. This means that the first time you encounter this symbol
on deserialising [reading], you intern the symbol [= create an entry for it
in obarray], and each subsequent time you encounter it, you look it up in
obarray and re-use it.

   Deciding on a representation for serialising an uninterned symbol is
harder. For one, it is entirely possible to create two distinct uninterned
symbols with the same name:

(let ((a (make-symbol "hi there"))
      (b (make-symbol "hi there")))
  (eq a b))
=> nil

So we can’t use only the name, if we’re to preserve the property that the
two are distinct. For another, this property (= that uninterned symbols with
the same name are distinct) is constantly used in Lisp programs--the gensym
counter, which tries to generate distinct names for distinct uninterned
symbols, is dumped, so when byte-compiling one file from a -vanilla start it
will have the same initial value as when byte-compiling another file also
from a -vanilla start. We want the #'gensym calls in Gnus to refer to
different objects than the #'gensym calls in AuCTeX, otherwise using both at
the same time will lead to subtle bugs.

   The CL macros make heavy use of #'gensym calls, and there’s nothing wrong
with that; #'gensym is the best way to create temporary variable names at runtime
without polluting the Lisp namespace, or indeed stepping on your own toes if
you’re a heavy macro user.

   The really old-school emacs Lisp way to serialise [print] an uninterned
symbol was just to print its name. Then, at deserialisation [read] time it
is interned [= inserted into obarray]. This sucks, and leads to subtle bugs
with large code bases.

   The slightly less old-school emacs Lisp way is to bind print-gensym to
t. What this does is, within a single #'print [serialise] call, the first
time an uninterned lisp symbol has to be printed, it’s printed as
#N=#:SYMBOL-NAME, where N is a counter, and SYMBOL-NAME is the name of the
symbol ("hi there" for my two examples above). The symbol and its
corresponding counter value is stored in an table for later retrieval. The
next time that same symbol is to be printed within that #'print call, the
code looks through the table to see if the symbol itself is in that alist;
it finds it, and instead of printing #N=#:SYMBOL-NAME, it just prints #N#.

The Lisp reader interpretes #N=#:SYMBOL-NAME as a directive ‘create an
uninterned symbol with the name SYMBOL-NAME, and store N as its index in a
table of uninterned symbols specific to this top-level form’. It interprets
#N# as a directive ‘look up the Nth entry in the uninterned symbol table for
this form, and use that.’

So, for example, expanding a call to one of the CL macros gives this:

(let ((print-readably t)
      (print-gensym t))
  (cl-prettyexpand '(loop for i in '(a b c d e f g h)
                     do (message "hi there %S" i)))
  nil)

=> (block nil
     (let* ((#1=#:G32994 '(a b c d e f g h))
            (i nil))
       (while (consp #1#)
         (setq i (car #1#))
         (message "hi there %S" i)
         (setq #1# (cdr #1#)))
       nil))

   The table used by the printer is called print-gensym-alist. If
print-gensym is bound to t, this is reset on entry to and exit from all
printing functions--pretty much #'prin1-to-string, #'prin1, #'princ and
#'print. With print-gensym bound to t, you can’t byte-compile two different
functions and expect state stored in uninterned symbols to be preserved
across the function boundaries, because the byte compiler uses two separate
calls to #'prin1 when writing the functions to the byte compile output
buffer, and the index decided on when printing the first function is no
longer available when printing the second. 

   If print-gensym is bound to a cons, then print-gensym-alist is preserved
across calls to the printer functions. It’s not reset. So if you bound
print-gensym-alist to a cons during the entire byte compilation process, you
could re-use state stored in a single uninterned symbol across function
boundaries, except that:

   The table used by the reader [deserialiser] is called Vread_objects (it’s
not visible to Lisp) and is reset with each top-level form encountered. So
with two functions and a single uninterned symbol, the use of the uninterned
symbol in the second function will provoke an error when the reader looks
through Vread_objects and doesn’t see any entry with the corresponding
index. 

   What’s actually done (now) by the byte compiler is it binds print-gensym
to a cons and print-gensym-alist to nil for each top-level form it
outputs. This preserves the identity of uninterned symbols within top level
forms, and does not preserve it across them. This suits the Lisp reader
quite well.

   GNU have gone and extended this syntax a little; they now serialise and
deserialise circular objects with it, as well as gensyms. So this:

(read
 (let ((thing '(1 2 3 4 5 5))
       (print-circle t))
   (setf (cddr thing) thing)
   (prin1-to-string thing)))

doesn’t error for them, and gives back a circular list. This is something we
should merge, though I hope people will not make active use of it.

Hope that helped a little--I’m not if, at all, it should be committed to the
documentation somewhere. 

[1] Code to demonstrate that an uninterned symbol will be garbage collected:

(setq box (make-weak-box (gensym)))
=> #<weak_box>

;; At this point, the only reference to the symbol that #'gensym returned is
;; by means of box, and since box is a weak box, references by means of it
;; do not count for the sake of garbage collection.

(weak-box-ref box)
=> G32999

(garbage-collect)
=> [value omitted]

(weak-box-ref box)
=> nil

-- 
¿Dónde estará ahora mi sobrino Yoghurtu Nghé, que tuvo que huir
precipitadamente de la aldea por culpa de la escasez de rinocerontes?



More information about the XEmacs-Beta mailing list