Ticket #587 (closed defect: wontfix)

Opened 4 years ago

Last modified 4 years ago

Buildbot slave raises MemoryError while monitoring log file

Reported by: mike23w Owned by: warner
Priority: blocker Milestone: 0.7.12
Version: 0.7.10 Keywords:
Cc: dustin

Description

This bug is consistently reproducible on my system.

Problem Description

When buildbot monitors a log file (via buildbot/slave/commands.py:LogFileWatcher) if the file grows to a size more than 1GB a MemoryError? exception will be raised and the slave will fail and not be able to recover; meaning all subsequent buildbot builds will also fail due to this exception. The buildbot slave process must be restarted to resume proper execution.

System Configuration

The Windows machine it's running on has 32GB of memory with about 30GB physical memory available while buildbot is running.

BUILDBOT MASTER and BUILDBOT SLAVE Version Information

  • Buildbot: 0.7.7
  • Twisted: 8.0.1
  • Python: 2.5 (r25:51908, Sep 19 2006, 09:52:17)

[MSC v.1310 32 bit (Intel)]

  • Buildmaster platform: win32

The buildbot master and slave are running on a single machine running Windows Server 2003 Service Pack 1.

Note that buildbot is running an older 0.7.7 and we can't immediately upgrade it at this point since it's a production machine. We will try to reproduce the bug on a more recent buildbot version soon. Maybe on Monday.

Reproduction Steps

Here's the steps to reproduce the bug.

My buildbot slave has a log file that is 1GB in size named test0.log. The buildbot slave executes a command to rename it from test0.log to test.log. The ShellCommand? is configured to monitor the log file named test.log. Once the log file is renamed to test.log the LogFileWatcher? opens() this file, starts reading 10KB chunks from it and sends these chunks from the slave to the buildbot master using it's poll() method. At some point the system memory will be used up and a MemoryError? exception is raised.

Here's the ShellCommand?() configuration:

   factory.addStep(ShellCommand(
         logfiles={'testlog':'tests\test.log'},
         command=['move','test0.log','test.log'
                  ]))

Note: the reason we use the "move" command is LogFileWatcher?() only monitors log files that have been created/modified after the ShellCommand?() has executed.

Impact

This bug is a major problem for us because buildbot fails on every build that generates a large log file; which is almost every build.

Change History

comment:1 Changed 4 years ago by mike23w

Sorry the buildbot master configuration should have looked like this (removed the "tests\" part):

   factory.addStep(ShellCommand(
         logfiles={'testlog':'test.log'},
         command=['move','test0.log','test.log'
                  ]))

test0.log is 1.1GB in size; and even though the system had more than 30GB free memory a MemoryError? was still raised by the buildbot slave.

comment:2 Changed 4 years ago by mike23w

Problem exists on a buildbot slave running version 0.7.9.

I didn't have access to version 0.7.10 so haven't tested that version.

comment:3 Changed 4 years ago by mike23w

Log of the MemoryError? traceback in twistd.log on the buildbot slave:

2009-07-11 10:55:44-0700 [Broker,client] ShellCommand._startCommand
2009-07-11 10:55:44-0700 [Broker,client]  C:WINDOWSsystem32cmd.exe /c move tests	est0.log tests	est.log
2009-07-11 10:55:44-0700 [Broker,client]   in dir ...

2009-07-11 10:55:44-0700 [Broker,client]   watching logfiles {'testlog': 'tests\	est.log'}
2009-07-11 10:55:44-0700 [Broker,client]   argv: ['C:\WINDOWS\system32\cmd.exe', '/c', 'move', 'tests\	est0.log', 'tests\	est.log']
2009-07-11 10:55:44-0700 [Broker,client]   closing stdin
2009-07-11 10:55:44-0700 [Broker,client]   using PTY: False
2009-07-11 10:55:44-0700 [-] command finished with signal None, exit code 0, elapsedTime: 0.109000
2009-07-11 10:56:27-0700 [-] SlaveBuilder.commandComplete <buildbot.slave.commands.SlaveShellCommand instance at 0x011C8BE8>
2009-07-11 10:56:27-0700 [Broker,client] Unhandled Error
        Traceback (most recent call last):
          File "python-2.5libsite-packages	wistedpythonlog.py", line 51, in callWithLogger
            return callWithContext({"system": lp}, func, *args, **kw)
          File "python-2.5libsite-packages	wistedpythonlog.py", line 36, in callWithContext
            return context.call({ILogContext: newCtx}, func, *args, **kw)
          File "python-2.5libsite-packages	wistedpythoncontext.py", line 59, in callWithContext
            return self.currentContext().callWithContext(ctx, func, *args, **kw)
          File "python-2.5libsite-packages	wistedpythoncontext.py", line 37, in callWithContext
            return func(*args,**kw)
        --- <exception caught here> ---
          File "python-2.5libsite-packages	wistedinternetwin32eventreactor.py", line 194, in _runWrite
            closed = fd.doWrite()
          File "python-2.5libsite-packages	wistedinternetabstract.py", line 101, in doWrite
            self.dataBuffer = buffer(self.dataBuffer, self.offset) + "".join(self._tempDataBuffer)
        exceptions.MemoryError: 

comment:4 Changed 4 years ago by dustin

  • Cc dustin added
  • Milestone changed from undecided to 0.7.12

comment:5 Changed 4 years ago by mike23w

The problem also appears on a buildbot 0.7.10p1 master and slave running in Redhat Linux Enterprise 5:

  • Buildbot: 0.7.10p1
  • Twisted: 8.0.1
  • Python: 2.5 (r25:51908, Mar 5 2007, 23:18:19) [GCC 4.1.1]
  • Buildmaster platform: linux2

Tried the same configuration:

factory.addStep(ShellCommand?(

logfiles={'testlog':'tests\ est.log'}, command=['move','test0.log','test.log'

]))

on a Linux buidlbot master and slave and got the same MemoryError? exception from the slave:

2009-07-15 18:31:59-0700 [Broker,client] Unhandled Error

Traceback (most recent call last):

File "/bldmnt/toolchain/lin32/python-2.5/lib/python2.5/site-packages/Twisted-8.0.1-py2.5-linux-i686.egg/twisted/python/log.py", line 51, in callWithLogger

return callWithContext({"system": lp}, func, *args, kw)

File "/bldmnt/toolchain/lin32/python-2.5/lib/python2.5/site-packages/Twisted-8.0.1-py2.5-linux-i686.egg/twisted/python/log.py", line 36, in callWithContext

return context.call({ILogContext: newCtx}, func, *args, kw)

File "/bldmnt/toolchain/lin32/python-2.5/lib/python2.5/site-packages/Twisted-8.0.1-py2.5-linux-i686.egg/twisted/python/context.py", line 59, in callWithContext

return self.currentContext().callWithContext(ctx, func, *args, kw)

File "/bldmnt/toolchain/lin32/python-2.5/lib/python2.5/site-packages/Twisted-8.0.1-py2.5-linux-i686.egg/twisted/python/context.py", line 37, in callWithContext

return func(*args,kw)

--- <exception caught here> ---

File "/bldmnt/toolchain/lin32/python-2.5/lib/python2.5/site-packages/Twisted-8.0.1-py2.5-linux-i686.egg/twisted/internet/selectreactor.py", line 146, in _doReadOrWrite

why = getattr(selectable, method)()

File "/bldmnt/toolchain/lin32/python-2.5/lib/python2.5/site-packages/Twisted-8.0.1-py2.5-linux-i686.egg/twisted/internet/abstract.py", line 101, in doWrite

self.dataBuffer = buffer(self.dataBuffer, self.offset) + "".join(self._tempDataBuffer)

exceptions.MemoryError?:

comment:6 Changed 4 years ago by dustin

  • Status changed from new to closed
  • Resolution set to wontfix

Looks like this is exactly  http://twistedmatrix.com/trac/ticket/3682

I think that the answer to your problem of a logfile suddenly growing by 1GB is: don't do that. Consider using UploadFile instead, or watching the original file instead of the destination file.

comment:7 Changed 4 years ago by mike23w

'Don't do that' is not really an option unfortunately.

I don't write the tests; the other developers and QA folks in another team write them. I can ask them to decrease their log file sizes but they don't have to listen if they don't want to. They claim they need lots of logging data to be able to debug without having to rerun their tests.

There have discussion of using another continuous build system because of the instability caused by this MemoryError?. That's how serious the issue is for the buildbot users. We have another continuous build system built in-house by another team that is being considered. I've been a very big advocate of buildbot; it's designed well and easy to understand and use but if it's not reliable or stable it's use is limited.

I'll look at a work around for this issue but feel it should be fixed. Why can't we have large log files? From what I can tell the support for buildbot has been quite exception to this point. So I admit I am a little surprised and disappointed.

comment:8 Changed 4 years ago by dustin

Look more closely: having a large log file is not a problem. Having a large log file that appears all at once *is* a problem -- it looks to the slave like the test script wrote 1G of data all in one go, and it tries to send that 1G of data to the master. Can you remove the rename step, or watch test0.log instead of test.log.

Alternately, as I mentioned you can use a FileUpload step to upload the completed log to the master.

What you're doing is akin to using 'tail -f' for something to which 'cp' or 'cat' is more suited. Yes, it'd be nice if the Twisted guys fixed this, and you should comment on the ticket I mentioned above to that effect, but until that happens, you should apply the right tool for the job.

Note: See TracTickets for help on using tickets.