Notes on MySQL Usage

From NSMWiki
Jump to: navigation, search

This page is a temporary repository for some notes on how Sguil uses MySQL. Specifically, the notes are probably going to be related to performance of the SANCP table.

Database Schema

Here's a fragment of Tcl code from Sguild that describes the schema of the main SANCP merge table:

CREATE TABLE sancp                                 \
(                                                  \
 sid           INT UNSIGNED            NOT NULL,    \
 sancpid       BIGINT UNSIGNED         NOT NULL,    \
 start_time    DATETIME                NOT NULL,    \
 end_time      DATETIME                NOT NULL,    \
 duration      INT UNSIGNED            NOT NULL,    \
 ip_proto      TINYINT UNSIGNED        NOT NULL,    \
 src_ip        INT UNSIGNED,                        \
 src_port      SMALLINT UNSIGNED,                   \
 dst_ip        INT UNSIGNED,                        \
 dst_port      SMALLINT UNSIGNED,                   \
 src_pkts      INT UNSIGNED            NOT NULL,    \
 src_bytes     INT UNSIGNED            NOT NULL,    \
 dst_pkts      INT UNSIGNED            NOT NULL,    \
 dst_bytes     INT UNSIGNED            NOT NULL,    \
 src_flags     TINYINT UNSIGNED        NOT NULL,    \
 dst_flags     TINYINT UNSIGNED        NOT NULL,    \
 INDEX p_key (sid,sancpid),                         \
 INDEX src_ip (src_ip),                             \
 INDEX dst_ip (dst_ip),                             \
 INDEX dst_port (dst_port),                         \
 INDEX src_port (src_port),                         \
 INDEX start_time (start_time)                      \
) TYPE=MERGE UNION=([join $tmpTables ,])      \

The merge table is made up of identically defined sancp tables, one for each day and each sensor. For example, for sensors named "internet", "business" and "education", you'd have tables:

sancp_internet_20080101
sancp_business_20080101
sancp_business_20080101

sancp_internet_20080102
sancp_business_20080102
sancp_business_20080102

Note that sguild takes care of automatically creating the individual tables whenever the first new data arrives for each sensor at the beginning of the day. However, these tables are never searched directly. Sguild only search the main "sancp" MERGE table.

Misc Notes

  • it's not uncommon to have 500 million rows, though, or more. Some have more like 2 billion.
  • MERGE really helps with data management. The data isn't stagnant. Rows are constantly being inserted, sometimes by the tens of thousand per 5/mins.
  • Most of the searches involve a combination of time based searches(start_time) or IP (src_ip, dst_ip), though there are some port basedqueries from time-to-time
  • At some point old data has to be purged. Being able to quickly drop the oldest data is a requirement.
  • So, our main issue is with speeds of the queries and INSERTs(which is really a LOAD).
  • the JOIN we do for queries might be an issue. The sancp table only has a sensor ID (sid). But all queryresults require a sensor name. So we JOIN the sensor table to get thatinfo. Perhaps it'd be better to simply insert the sensor name as well?
  • I'm pretty sure the load problem is really that it takes a long time to update the indices, and it's being done on each table about once per minute or so. I would say every search has a timestamp requirement. Most also have the either the src or dst ip. Port searches are far less common, but not unheard of.
  • indices (.MYI files) are about 2x the size of the data (.MYD files)