SharePoint can prove to be a pain to troubleshoot if you are new to the system, or do not have a good understanding of the system. I intend to give some basic info on how to get started. Use this as a springboard to build your knowledge of the system.
I think it is safe to say that all admins know where the Event Viewer is and how to use it. The bad news is that I do not find the Event Viewer nearly as useful with SharePoint as with other apps. It definitely cannot be used alone to resolve most issues unless you have previously encountered the issue(s) and know the fix.
Universal Logging Service (ULS)
The ULS logs are vital to the troubleshooting effort. They contain all kinds of information for both success and failures relating to all of the SharePoint services. They can also be tuned to get additional data when troubleshooting.
Some things to know…
ULS versus Event Logs – The ULS logs will contain all of the SharePoint messages that bubble up to the event viewer and a whole lot more. This is why I believe it is difficult to troubleshoot based on the Event Viewer alone. What the Event Viewer will tell you that is not located here are non-SharePoint messages dealing things like general server health, connectivity, etc.
ULS Location – The logs are stored in the “logs” folder in what is commonly known as the 12 hive. The 12 hive has a default location of “c:Program FilesCommon FilesMicrosoft SharedWeb Server Extensions12”
Multiple Servers – It is important to note that these logs are stored on each server in the farm running a SharePoint service. If you have multiple Web Front Ends (WFEs) you will need to make sure that you check all instances of the logs to be sure that you have all of the information. For example if the user generates an error it will only show up in the logs for the server that user interacted with.
Log Viewer – I have previously written a review of Stefan Gordon’s ULS Viewer available on CodePlex. It really makes it easier to filter and read the logs. http://ulsviewer.codeplex.com/
Configuring Diagnostics – You can configure the diagnostic levels by navigating to the Central Administration website, Operations, Diagnostic logging.
Adjust the levels as appropriate to try and get the information you need. As the name implies, Verbose will give you the most detailed diagnostics. I would not recommend setting all services to Verbose though; I would try and target specific areas in order to reduce the amount of noise extra data creates.
How big are the logs? – The size of the logs will vary depending on the level of diagnostics set for each service, the amount of usage on the system, and of course what issues you are experiencing. I have for example seen a database server go offline generating over 1000 errors a minute until the db was available again.
Within the diagnostic configuration screen, there are settings to help manage both the number of files and the period of time each log will cover. By setting these two fields to an appropriate value you can help to minimize the storage requirements, and also make each log easier to read.
What to look for? – My first course of action is to look for the first error in the log relating to the issue I’m trying to resolve. I then try and look at the information directly preceding that to try and determine if it is related. In many instances the actual error reporting is sort of a generic dump that does not specify the condition or point where the failure occurred.
I also look for patterns throughout the file to try and see what might be related, what might be a symptom, and what else is going on that maybe is not related.
It should go without saying that if the databases are not available the system will not be as well. Errors like “Cannot connect to Configuration database” will ruin your day in a hurry.
Validate Connectivity – I normally start my db troubleshooting by validating connectivity using the main SharePoint service account. An over zealous dba may have made a change that prevents access to the db.
Automated Updates – In some rare cases, I believe primarily with WSS, I have seen Windows Update push a patch or update that requires further administration before the system is available again.
Basic Install – In cases were a basic install was originally done, you may experience some additional hurdles in maintaining and troubleshooting the database.
The WSS install uses what is called the Windows Internal Database which has some interesting limitations. The biggest limitation is that you cannot administer is using your regular toolset like the SQL Server Management Studio, you need to download and use a command line cool called SQL Cmd. http://www.microsoft.com/downloads/details.aspx?FamilyID=d09c1d60-a13c-4479-9b91-9e8b9d835cdc&displaylang;=en
The databases are located at C:WINDOWSSYSMSISSEEMSSQL.2005MSSQLData
Further info on the database and how to work with it is available here: http://www.wssdemo.com/Pages/db.aspx
Roles and Services
Early on in your admin career you will probably start with broad moves like restarting the server(s) to clear up issues. As you get a better understanding of the system and when you start to relate the error with a specific server role or service you can start to zero in on the specific sub systems. At that point you can target restarting specific services instead of rebooting one or all of the servers in the farm.
One key to maintaining a healthy farm is early detection of issues, and being able to interpret the early symptoms before it leads to a real problem. One example of this that I struggled with for awhile pertained to issues doing a Full Crawl on a medium farm with about 150Gb of content. I started to see a pattern where a seemingly unimportant message in the logs would signal that the indexing service was going to lock up. It is good to know this early in the process, and not 5 hours into the process. Initially the fix included restarting the server, but in this case simply restarting the indexing service was enough to clear up the issue. We then setup a monitor to look for the initial warning message saving us time in getting updated content into the index.
There are also a number of paid options
Other Vendors – A number of other vendors offer per incident or contracted support
Backup & Recovery needs to be thought about before you have any issues. I cannot count the number of times I have run into people that are trying to fix a major issue but have no current or meaningful backup. This should be a top priority.
Disaster Recovery is also vital, but it should be a different approach than normal Backup & Recovery since it includes the whole platform; features and content.
Patches and Maintenance should be handled with care and ideally should be tested in a non-production environment first. The time it takes and the issues that may result vary quite a bit. When issues arise, and they will at some point, you need to know how to proceed and/or you need to contact a qualified technician asap.