Expert Data Miner Project

When interpreting your web statistics becomes
an easy task with the proper data mining tools

by Jean-François Beaulieu

 

   If you invested a lot of time and money in your web site, getting the proper tools to know who are your visitors and what they do should be the next step. Many statistical packages exist on the Internet, but their quality varies greatly. Most free packages will give you a rough picture but in general they fail to separate correctly spiders & other scripts  from human visitors. Often your Internet Service Provider is  giving you some results that he got from a predefined package. However your provider is locked in a difficult situation; to attract people he must slash his prices and paying for a very expensive package could harm his own business if only a fraction of his customers is really using it. Furthermore, processing mountains of log files in order to get much more than a rough picture will put an extra strain on the server, so slow it down.  Eventually you penalize those visitors who wish to fetch your pages quickly and at the end... you penalize yourself if the visitor never comes back.

 One alternative is to repatriate your log files from the server ( or main computer) and to bring them on your own computer. In such an environment, you can analyze your log files as much as you want without penalizing potential visitors to your web site. Furthermore, the Windows environment offers a wide range of popup and graphical tools that makes the analysis of such data much easier.

Among several packages, Expert Data Miner is likely to shed a new light on your perspective regarding web statistics. Produced by ASCO IT, a Montreal based company, Expert Data Miner has four main qualities that distinguish it from its competitors: Accuracy, speed, configurability, originality.

Accuracy:

   An enormous effort has been put to offer accuracy, not 'junk' or bogus statistics.  

You will be surprised to learn that spiders can form as much as 40% of your 'visitors' in some cases. Those scripts and programs will certainly help your site to be indexed by search engines, but if you wish to track down how much time a typical visitor spends on one page or where he goes, removing spiders from your sample seems a necessity. However, most packages fail to do it properly. With EDM you can configure virtually each report and say if you want or include or exclude spiders from your statistics.

Counting the real number of visitors is also a task that may not be so easy, mostly because of some providers like AOL America who are recycling IP addresses constantly. This leads many software to overestimate drastically your number of visitors; furthermore if you can't identify properly the visitors with an ID, many related statistics will also be incorrect; there is a 'snow ball' effect here.
Most other providers will, upon connection, assign a unique IP number to their Internet clients. No other person in the world should have a similar IP number. This IP address can be used to determine if any request for a page or an image was done by user 'X' or user 'Y', so to count your visitors. But when major players like AOL is reassigning in real time a predefined pool of IP addresses, things become pretty much harder.
With visitors from AOL America your log file could contain 600 different IP numbers like 152.153.12.77, 152.153.211,88, etc.. all with the same prefix but they could be related to only 11 different visitors.
EDM is using a special algorithm to identify correctly the user sessions in order to build relatively accurate statistics. It first looks for the presence of a permanent cookie, then it checks if a user ID wasn't assigned to this visitor when he filled a form, then it checks for a session cookie and if none of this is true, it will use the IP address. If your visitor comes from a provider like AOL America, an extra calculation based on the number of hits from AOL and non AOL customers is performed to readjust the number of visitors.

 

Speed:

   Among the web statistic packages designed for Windows, EDM is one of the fastest, and probably even the fastest in  the world.   If you have a web site with thousands or even dozens of thousands of visitors per day, this matters especially. Several tests were performed with log files of hundreds of thousands of lines and 1.5 million lines in one case. EDM is 12 to 15 times faster than Deep Tracker and in spite of the large amount of produced data, it can process a log 3 to 10 times faster than most other packages (often with more reports or more columns per report).

Even FastStats, whose trademark is heavily based on speed, was left behind by EDM; with log files of hundreds of thousands of line from a web site containing thousands of pages and hundreds of external referrers, both software were compared. With its default configuration, so keeping only the 100 top most visited pages or referrers, FastStats didn't perform badly and even did the jobs in 5% less time than EDM in some cases. But the latest is always outputting all you pages and referrers, not just the 5 or 10% top most pages. The configuration of FastStats was then changed to include all the pages and referrers that EDM was already processing. This time EDM did the job more than twice faster than FastStats. This test was performed against the fastest software among the the most important competitors according to their benchmark... imagine with the others!
 

Configurability:

   Most other packages give you the possibility to define 'projects' but do not offer database support. With their projects, you scan a couple of  logs, save the results, and that's over. Should you wish to make a bigger study based on months of activity, or add new log files one week after, you can't. A handful of companies will  offer some level of database support, but this will always be at the expense of speed. With Expert Data Miner, you can define cumulative or non cumulative projects. The database was also designed to be fast; so switching to the database mode will only add a fraction to the process time.

Most other packages give you some predefined columns in each report. A few of them will allow you to hide or discard some predefined columns. With EDM, you can not only switch two columns or insert/delete a predefine column in your reports, you can also create columns from scratch. This would be the wet dream of any statistician. You define a target, an operand, the column header and bingo! EDM will produce the newly defined reports & columns from your log files.  Combining such customizable columns with customizable filters on rows gives you a powerful tool when it comes to isolate the behavior of some users.

If you are already using cookies to count your returning visitors, you may not wish to change your system. EDM can be configured to accept your cookies and spot returning visitors, even cookies with alphanumeric values.

If you own a specialized web site and that your visitors do not follow the classical incoming path Google/Yahoo/Ask Jeeves, you can also add new search engines to the predefined database.

Originality:

   Many reports and concepts in EDM are pure creations that do not exist in any other package. In addition to the classical (and mandatory)  reports on daily traffic, hourly traffic, etc.. you will find several reports that have no equivalent. EDM is relying heavily on concepts like entry pages and referrers. Tracking down where your visitors come from and what they do in each case makes the core of the application.

Back to the main page