[Bug: 21.5-b26+CVS] Crash in GC

Tue May 2 11:03:22 EDT 2006

Jerry,

sorry for the delay, currently my time for XEmacs development is very
limited.

>>>>>"JJ" == Jerry James <james at xemacs.org> writes:
>> Contrary to the print statement above, the crash is not due to address
>> 0x8019c.  It is due to address 0x10008019c (the value of irdata), which
>> is 32 bytes past 0x10008017c (the value of the idata parameter in frame
>> 6).  This is because the print statement in vdb-posix.c that prints the
>> addresses prints them as integers.  On my platform, integers are 32 bits
>> and addresses are 64 bits, so the print statement chops off the upper 32
>> bits.

Thanks for pointing this out, I'll fix the output.

JJ> As mentioned in the intial report, address 0x10008017c is an invalid
JJ> address.  The information in data2 above (assuming it is a
JJ> sized_memory_description) is:
[...]
JJ> Any theories on how that bad pointer got into a kkcc_gc_stack_ptr
JJ> element in the first place?  I still have the core file.

Starting from the root set, kkcc_marking examines all outgoing
pointers.  Every outgoing pointer it comes across is wrapped into a
kkcc_gc_stack_ptr element and pushed on the mark stack.  In your case,
kkcc_marking came across an object (= the parental object) that
contains the bad pointer in a position where the object's description
denotes an outgoing pointer.

The first scenario is that the bad pointer is caused by a memory
corruption, maybe due to a gc-unrelated bug.  In this case, your best
chance to catch the bug is to reproduce it reliably: `make check',
building the core-.ell-files, building the packages, or the XEmacs
benchmark suite bench.el may be your friends.  Then, find out from
where the bad pointer comes from.  To identify the parental object,
you might use `call kkcc_backtrace' in gdb.  Then, put a watch on the
location in the parental object and see if you can identify the code
that corrupts the pointer and fix it.

The second possibility is that you've found a hole in the write
barrier: Some time ago, the object at 0x10008017c was a separately
allocated object (not a Lisp object managed by the Lisp allocator).
In a previous cycle of the incremental mark phase, the mark algorithm
pushed the address 0x10008017c that was valid at that time on the mark
stack.  Then, XEmacs suspends the mark phase, regains control, and
frees the separately allocated object at 0x10008017c while the
collector is not looking.  When the mark phase resumes its work, it
finds the meanwhile invalid pointer on the mark stack and crashes.

Unfortunately, there are still some separately malloc'ed objects in
XEmacs that should actually be converted to Lisp objects so that they
are covered by the write barrier.  But why doesn't this known problem
crash XEmacs more often?  These objects are long-living objects that
are never or rarely modified; I already converted all other objects to
Lisp objects.  That's why my XEmacs with NEWGC enabled runs very, very
stable.  But of course, it is on my list to patch all the holes in the
write barrier, it is the only way to go.  BTW, Ben's pending "unicode
internal" patch may cover most of it.

It is hard to say, which scenario bites you here.  For many old bugs
that I fixed meanwhile, I first suspected that a hole in the write
barrier caused them.  But in fact, I *never* found a hole, it always
was some kind of memory corruption.

I hope this helps, thank you for looking into this,
-- 
Marcus