Earn upto Rs. 9,000 pm checking Emails. Join now!

Enter your email address:

Delivered by FeedBurner

Friday, November 23, 2007

Implementation and loading

Database Implementation
The database schema was designed prefering more tables in the database in favour of less
columns in each table. By spreading data over more tables, it is hoped that more intricate cor-relations
are possible. All table names start with the sf_* prefix in order to avoid cluttering
other tables already defined in the current namespace. Figure 1 shows the physical model of
the set of tables which depend on the daily 1 interval I 1 .
The physical model is based on a concrete database product, as can be seen with the types of
the attributes. During development Oracle 7.3 was used. Other database products were also
considered during the design of the tables and relationships. The schema is easily converted
to different databases, if you follow the guidelines set below.
Figure 1 shows a circular arrangement of tables around the central table st_stamp1. All
tables on the ring depend with part of their primary key on I1ID, which is the primary key of
st_stamp1. In each outer ring table, the I1ID is tagged as foreign key.
The primary key fragment I1ID is an abstract number. Table st_stamp1 uses an alternate
key consisting of the combination of the cache host name and the start time stamp of interval
I 1 , the real key for the central table. The I1ID is a means of an efficient glue between the
tables.
Due to the fact that there does not exist generic support for sequences, row identifiers, or trig-gers,
the interval id generation is provided by the insertion script. If the database product
allows, the dependency of the outer tables’ interval id on the central id should be modelled
using a foreign key or a references constraint on the outer tables’ id column.
The central table st_stamp1 contains the timestamp information about the daily interval
for which seafood was run. It also uses the name of the cache it was run for, or the logical
name of the cache group which seafood did sum up. Part of the table is also the offset to UTC
in order to watch the daylight savings transitions. Finally, the interval for which the sum was
run, and the configured interval are part of the table.If one is using a different database product, e.g. one which supports native arithmetic types,
R and S should be assigned an eight byte integers, and T an eight byte float. Even though sea-food
only uses a four byte integer for its internal request counters, the database may view a
much larger interval. For this reason, the sum of a year’s worth of requests might overflow
some internal database registers, and some DB products report the wrong results instead of
an error.
Some tables contain the related (HITR,HITS, HITT) triples. These triples refer to the number
HITs 1 perceived by seafood. Please note that a HIT triple always contains values less or
equal to the plain (R,S,T) triple in the same table and row. Furthermore, the sf_tcp_client table contains a (MISSR,MISSS,MISST) triple and an (ERRR,ERRS,ERRT) for
errors and those non-HITs with a hierarchy code of NONE.
Starting on the left side of the ring, the UDP and TCP counts are collected in five tables
sf_udp_* and sf_tcp_*. Besides the interval id as a glue, the primary key contains also
the status code from the fourth column of the access.log file, e.g. TCP_MEM_HIT.
The top of figure 1 shows the tables dealing with the server side traffic of the squid. The hier-archy
code is part of the primary key. For a sibling and a parent, the host contacted is also
part of the key. When going directly to the origin site, the destination host is neglected, even
though seafood knows about them internally.
On the right side of the ring, the request method, the scheme, the top-level, the 2nd-level
domain, and the mime types are counted. All five tables contain the HIT count besides the
plain count. Part of the primary key of the method table, the scheme table, the top-level
domains and the mime types are only those items which were configured in sea-food.
conf by the time seafood was run. Furthermore, the number of entries in the domain
tables for a particular cache host and timestamp are limited to the top N domains 1 as configu-red
with seafood.
The direct hosts can be combined into networks and autonomous systems, if a whois service
is available, and seafood was thus configured. The sf_as table contains the destination AS
number of origin servers which were visited directly. Additionally, the description for the AS
as seen during the seafood run is part of the database.
Finally, the client tables sf_udp_client and sf_tcp_client deal with the client side
of the cache. Part of the primary key is the client host address. The format of the client host,
symbolic or numeric, is taken as-is from the seafood provided output. Note that the number
of clients is limited to the top N clients 2 as configured in seafood. Especially for the TCP cli-ents,
is seems feasible to provide a cut-off threshold of some percentage of the requests in
order to weed out clients whose access is denied. Please note that such an alternative is not
part of the current project.

No comments:

 
Thanks

Total Pageviews