The conventional wisdom that you are making progress if you are solving problems as they arise is plain wrong.
Look at your list of problems and see it grow, and grow faster than this approach can solve them!
Real progress comes from solving problem causes first, then cleaning up the problems they have caused afterward, so the causes don't have time to add to your troubles while you are addressing the troubles they've already given you.

Here are some problems seen at eBay, some possible causes, and some possible solutions.
Free advice is always worth at least what it costs, which means you get nothing out of it unless you spend something, usually time and analysis, to figure out if it makes any sense for you in your situation.

By its very nature, this document invites comments and continual quality improvements from reader suggestions.
By its very nature, this document will never be "done".
Continual quality improvement is a journey, not a destination, but this document will be useful in many stages along the journey.
Feel free to return as time allows to check for changes.

FIRES FIREBUGS EXTINGUISHERS
Generic fires: Generic firebugs: Generic extinguishers:

Problems escape detection in development and QA, show up first in production.

There is too little development level testing before software is delivered to QA.


There is too much complexity per change for easy testing.
There is too much complexity per change for easy fault isolation.
There is too little live testing prior to production release.
Use of formal testing methodologies is not evident.

Make changes in smaller chunks.


Cut down per-change overhead to make it feasible to make many small changes instead of one big change.
Cut down dependencies between changes, to enable making them separately rather than as a mass.
Add an extra level of testing.
Rig up an el cheapo beta level service for willing real users to bang with real auctions.
Use better testing technology: branch coverage analysis for example.
Improve realism of development testing, perhaps by playing recorded user activity into test, certainly by frequent DBMS and other data content and format updates, to keep the development testing environment in synch with the production environment and with production-soon-to-be.
Standardize environments so development looks exactly like QA looks exactly like production to software and to developer / tester / operator, so that moving from one environment to the next is not itself the cause of defects from confusing software or staff.

Problems once detected are difficult to isolate to causes.

Bugs are sometimes subtle by deliberate design choice.
Executables are monolithic general purpose tools rather than small, task specific, focused tools, making the starting search area for a problem very large.
The eBay preparation for problem troubleshooting is reactive rather than proactive.
Design stuff that leads to frequent bugs so that the bugs are blatantly visible and noisy to developer and QA eyes and ears.
"Break it hard, not soft."
For example, an unlocalized message should break in the compiler, not in production.
Research, document, and impose software "design for maintainability" standards, processes, and procedures.
Plan in advance for problems to occur, the usual case, and put as much of the scaffolding in place for their solution as possible before they are even seen.
Prepare special purpose debug versions of executables before they are needed, not after a crisis has already arisen.
For example, executables instrumented with Purify are slow to compile, bulky once compiled and run too slow for normal production use.
However, they can be built without interfering with the schedule of a train if they are built in parallel with testing and debugging and rolling into production the fast trim code, and will be ready for use the moment a problem indicating a possible memory leak or memory access violation is seen, not half a day or more later.
The general level of software quality is low, of software problems is high. There are too many changes, too fast, for the resources available to manage change.
There is too little attention to generalizing software away from choices once appropriate during the business' startup conditions, as operations and markets change.
Tired, overstressed workers do bad work, managers included.
There is high software complexity as measured by standard metrics.
There is limited attention to software standards to counter the trend to complexity.
There is no visible software maintainability enforcement mechanism.
Strike a better balance between MRD frequency and development staffing levels.

Read The Mythical Man Month again.
The more complex the process and the product, the less productive each worker becomes, rapidly, and that needs to be factored into the staffing calculations.
This is indeed a fantastically complex development environment, the productivity levels of workers in a startup should not be anticipated, nor anything within a factor of twenty of those levels.
Ethnocentric problems plague eBay software. There is too little eBay attention to process as opposed to product to be noticed from the ranks, which is where it should first be seen and most take effect.
Much of eBay software carries as unrecognized burdens the baggage of assumptions appropriate to a small software enterprise in the San Francisco Bay area.
  • the American English language and alphabet and direction of writing and size of distinguishable type
  • a single stable currency system based on US dollars
  • the C++ (American) locale defaults for dates, number punctuation, currency symbols
  • a set of cultural taboos understood at gut level
  • the local legal environment with strong, well understood tort law and consumer protection mechanisms
  • the local level of legal and personal regard for privacy
  • the local telephone and telecommunications rate arrangements and convenience of use
  • the local tax laws and hierarchies
  • the readily available mix of sophisticated monetary instruments
  • competitive and easily arranged shipping availability
  • insurance mechanisms universally portable within the trading region
  • well understood cultural goods exchange habits

Each and every one of those assumptions is inappropriate in our present business environment.
We "enjoy" a mix of languages, a mix of locales, a mix of currencies, a mix of laws, a mix of taxing bodies, a mix of cultural trading habits, a mix of nations.
We are now subject to exchange rates, tarrifs, monetary transfer barriers, rules unenforcable across national boundaries, rules based on unfamiliar cultural taboos.
We find whole functionalities gone missing, like portable insurance, dependable delivery, and too much other added complexity to enumerate.
As a priority, weed inappropriate assumptions out of the code before they are seen as failures in production, preferably as a separate project of code review against checklists and crosschecked among several reviewers.

Internationalize English like any other language, make the US just another locale like any other country, treat dollars like any other currency and not the metric from which all prices are set, put the sites on Universal Time, and all the similar de-jingoisms needed.
As a priority, design and put in place more general and extensible mechanisms where each inappropriate assumption is identified.
As a lower priority, add in extensions whose need is eventually seen, discovered as part of this same review process.
The list of countries and their currency symbols, two character ISO 3166 nation identifiers, handedness of writing, width in bytes of alphabet characters, and many other regularizable and identifiable differences are known now.
Add them to enumerations as a kit, with a single coding style, to be built and tested once and for all, rather than one by one as each nation is added to eBay's coverage, with different styles and means and repeated testing requirements.
The same problems, and the same classes of problems, recur throughout the eBay software enterprise. There is too little learning at the enterprise level. Learning in general is more formally called "time binding" of knowledge.
The mechanism of learning by an entity as opposed to a person goes by the formal name of "process improvement".
Process improvement is the standard mechanism for capturing knowledge about avoiding repetitions of mistakes; use it.
Read Demming, read SW-CMM, believe what you read, implement it as of higher priority than today's marketing demands, so that we can be in better shape to meet the same demands tomorrow and the years after.
Any software enterprise needs by design to be quality driven, not marketing driven.
Better quality will contribute to marketing success faster than better marketing will contribute to product quality improvements.
Put specific individuals in charge of process improvement as their entire job description, to focus the responsibility; grow and nourish that team until real improvement becomes readily apparent to all.
Employees choose to find other jobs, taking valuable hard won skills and knowledge away from the enterprise, and leave expensive recruiting and retraining holes in their places, reducing eBay's ability to stay competitive with its rivals. Prevention of employee burnout receives too little enterprise level attention.
The employee utilization habits at eBay are still structured to cause "startup burnout", long after eBay has stopped being a startup fairly trading with employees the high lifestyle sacrifice to get a business launched in exchange for a fair chance of employee riches at IPO.
These habits are no longer appropriate for an enterprise which has survived startup and started working for the longer term, where such tradeoffs no longer make sense either to the employee or to the enterprise.

Employee off-time needs to be treated primarily as the employee's own as an essential and valuable part of retaining employees, not subject to casual and unnecessary disruption.

  • Circumstances requiring disruption of employee off-time must be known in detail to the employee at hire time; "must be willing to work overtime as required" does not suffice.
  • Employees must be given as much advanced warning as possible when the need for extra effort is forseen.
  • Employees must not be treated as "readily on call" when there is no pressing reason to annoy them during personal time.
  • Employees must not be contacted during personal time for "status updates" and "heads up messages"; these are appropriate only during business hours.
  • The eBay emphasis needs to shift dramatically to having people already on site capable of dealing with usual "emergencies" around the clock.
    This is a 24x7 operation and should be run that way, software troubleshooting and remediation included.
    Employees must not be called out at burnout generating hours, in preference to staffing three shift software maintenance or educating onsite employees to handle usual problems without day shift help.
  • Any problem that occurs at least once a month, or whose arising can be reasonably forseen a day in advance, is a usual problem, is not an emergency justifying calling out employees in off hours, and treating such problems otherwise is a management planning failure.
  • Standard procedures for handling problems must become a part of well documented eBay process, readily available to those onsite to take care of the problems, not a casual and unreviewed note made a part of Remedy tickets.

Help management set good examples, learn to delegate, and lead real lives: lock them all out of the buildings, at gunpoint if necessary, at 5PM.

Practice emergency prevention management instead of emergency provoking management.

  • Evaluate the need for better testing to prevent 2AM callups.
  • Evaluate the need for better training of onsite staff to cope at 2AM without added help.
  • Ponder the real value of fixing problems immediately whose solution could be delayed by a rollback and a reboot to normal business hours.
  • The cost of immediate fixes may be high employee turnover and low quality fixes.
    Evaluate the balance between real site impact and employee retention issues on a case by case basis, as a primary concern, not days later at a management get together.
  • Don't allow "call me" as a rollback instruction.
    If the callee can't think up and document a rollback method when well rested, what chance is there at 2AM of a better working method being invented?

Developing software at eBay is unnecessarily time consuming and painful. Lots of attention is paid to "big tools" that require whole managed entities to maintain one tool, but the "little tools" that speed every operation of the day suffer from both neglect and also active resistance to maintenance requirements. Replace a habit of reactive neglect and resistance to change with a model of proactive quality improvement and seeking out the best available software for each task.

Whenever possible, use good, free software in common use elsewhere.
Replace existing SunOS tools with their decade newer GNU equivalents whose improved functionality is the general expectation both of arriving software developers and also of newly installed scripts.

Replace a habit of requiring bothersome, delay facilitating paperwork to do simple maintenance with a habit of fixing things that are broken just because that is part of the responsible group's job description.

Be open to use of old tools in new ways.
Most of our software lifecycle could be driven by Makefiles put to wider use.
It is a bit mind boggling to learn, for example, that distribution tarballs are still created by hand rather than by a

gmake dist <executable1> <executable2> <executable3>
command provided by a build script maintainer.

Be open to use of new tools.
The trusty vi() Unix editor has been long superseded by nvi() and also by vim(), both providing added functionality without breaking the old look and feel of vi(), neither installed at eBay for general use.

Keep up better with changes to already installed software versions.
The gcc 2.7.2.3 editor is still the default one pointed to by /usr/local/bin/gcc, despite that the gcc 2.95.2 editor is the only one still in use for new builds and should be the default.
Specific fires: Specific firebugs: Specific extinguishers:
English pages are repeatedly found linked from other locale language page URLs. Style differences between core and international URLs make parameterizing URLs difficult.
Cascades of related pages are not (effectively?) controlled by a single locale parameter.
Default values are supplied that work, rather than fail, in development and in QA, so problems remain undetected for too long.
Standardize URL styles. Get rid of eBay core, replace it with us.ebay.
Use parameters, not defaults, to control which locale version of a page is selected to be the target of a URL.
Make sure that default values not overridden with proper values break immediately, calling attention to problems before they enter or leave QA, rather than disguising problems and becoming problems themselves.
Get rid of English as a "default", internationalize it like any other language, so that spoken language specialists, not programmers, decide the final wording.
This will allow use of the same processes and procedures for sets of pages in each language, making it harder to put English pages where other language pages can see them to link to them.
Now URL sanity check tools should note pages as linked to nothing, rather than retaining default links to English pages.
English text or fragments of text are repeatedly found in documents intended for sites using other primary languages than English. English messages are used by default when pages are programmed, then either escape inclusion in text that is marked for internationalization, or escape internationalization. Use the "break it hard" model of preventing bugs from escaping detection.
For example, in the original source code, pass string fragments of text, as well as non-text data elements needing "stringification" to the internationalization process by macros.
Modify each code buddy check to include checking each quoted string for need for internationalization using a search tool such as a text editor to guarantee that each string is noticed.
Discard the current mechanism for bulk exclusion from internationalization attention so that each string actually receives individual attention, especially as we anticipate internationalizing for maintainers and operators as well as users, which will change the set of strings needing internationalization.
Modify the string fragment passing macros to abut each default English message text fragment as passed to internationalization with an appropriate preprocessor macro that will break compilation when set non-null.
Internationalizers are responsible for removing this macro as part of the internationalization process for each language's resources, separately for each internationalization task.
Set this macro non-null as a check for all passed strings having received internationalization attention when building pages supposedly internationalized.
If the macro is still present, the build will break.
Users repeatedly encounter HTTP page not found errors. Link checking is not thorough.
Link checking is not done in a representative environment.
Links that work in QA do not work in production.
Links that work in one locale break on a parallel set of pages at a different locale.
Link checking does not follow user usage patterns.
Use existing automatic link checking web crawler tools, with testing passwords as needed, to check that all links lead somewhere.
Regularize environments so that no code need change from development to QA to locale to production.
Put processes in place that assure pages are copied and installed, and checked to be copied and installed, as working sets, not individual entities.
Design interfaces using industry standard "use case" methodology, employ the same "use cases" from the design stage to test the interfaces once implemented.
Since we are building pages dynamically anyway, consider simplifying the problem by replacing separate sets of ideosyncratic pages for each locale by a single set of standardized generic pages parameterized by locale.
Code contains too many logic errors.
Executables take too long to load.
Debugging problems is slow and unreliable.
Executables push the limits of hardware and software technology.
Code maintenance is too expensive, changes too hard to implement.
Code fights with the tools used to create and maintain it.
Executables are too hard to build.
Programs fail too often in execution.
Software complexity at eBay has run amok. Round up the usual list of suspects.
Treat complexity as an insidious and resource gobbling enemy, to be met and countered head on at every point of the compass.
Remove processes that laxly allow complexity to leak into the system and replace them with processes that proactively push complexity out of the system.
Limit sizes of source files ("compilation units").
Replace simple directory structures containing huge source files with subdivided directory structures containing many and smaller source files.
Limit sizes of functions.
Limit widths of source code lines.
Limit lengths of identifiers.
Hide complexity at every level from higher levels.
Hide complexity at every level from lower levels.
Limit functionality of executables, use more, simpler executables instead of fewer, more complex ones.
We are trying to accomplish with a few dozen applications what most organizations with this much complexity to handle choose to accomplish with a few thousand applications.
Limit complexity of data structures.
Reduce coupling between modules.
Limit complexity of classes.
Layer libraries, so calls need ever only go one level down.
Remove from libraries code that is under regular and ongoing maintenance changes.
Libraries are for stuff that is stable on the scale of years, not weeks.
Decouple libraries, so a single pass will accomplish linking.
Reject ugly source code without any additional problem but that ugly code indicates lack of an ethic of quality.
Require code to explain why it is doing what it is doing.
Refactor obsessively; treat every piece of code as a prototype of a better piece of code.
Simplify, regularize, and standardize build technology.
In fact, regularize everything capable of regularization; it makes coding easier and less error prone.
Problems labeled "not worth fixing" become complex nests of problems interacting in complicated ways.
Fix problems one by one as encountered in normal circumstances, not in incomprehensibly crosslinked groups during crises.
Where possible, remove usage of proprietary code not available in source code form to be fixed and understood by eBay developers.
Reduce use of interprocess communication where network shared files will suffice.
By deliberate choice, use minimally complex technology for each task that will accomplish that task.
Limit task assignments to persons demonstrated capable of doing them well already or of following the lead of others who will teach while helping.
Reread On Walden Pond and follow the recomendations made famous there.
Put process in place to standardize, review, and enforce the above.