BLOG.LENKOV.COM
Burn my RSS feed  

SF LinuxWorld 2004

This year LinuxWorld (http://www.linuxworldexpo.com) was much better than the last 3 years. The focus was mainly on clustering, scalability and some desktop.

 

I visited the following sessions:

 

1. How the Linux Kernel Gets Built

Moderator: Andrew Morton (OSDL)

The guys were selling the idea that 10,000 developers around the world with close to zero budget can take on Microsoft. I didn't hear anything I didn't know. Half of the discussion was about why nVidia don't opensource their drivers. Nothing you can write home about.

 

2. Disaster Recovery on Linux

Speaker: James Bottomley (SteelEye Technology, Inc)

This session was about building a disaster recovery solution based on the disk replication over TCP.

a/ Bandwidth needed: it should be somewhere between the average and sustained. As real-time we want the replica to be, as close to sustained bandwidth we should reserve for it (provisioning).

b/ Type of replications:

- Transactional (each write to disk goes to transaction log and this log has been replied on the remote system in the same order it was recorder). The problem with this approach is that the log becomes very big.

- Intend log (keep a bitmask of updated sectors and update only them). The problem with this one is that it's not keeping the order of updates, so it may end up with inconsistent data if replication breaks.

- Hybrid (keep only the last transactions in "transactional" log and the rest in intend). Currently no commercial implementation on this type.

c/ Type of communication:

- Asynchronous (fire and forget - fast, but you can't be sure that the replica is accurate)

- Synchronous (slow, but reliable -- nobody is using it)

d/ products available:

- md+nbd (better)

- drbd

 

 

3. MySQL Replication and Clustering

Speaker: Brian Aker, Director of Architecture, MySQL

Very nice guy. He explained all different scenarios for mysql high availability (replication, etc)

a/ when you have replication, you can convert the binary log to SQL with mysqlbinlog. Then you can run it on any server.

 

b/ when the replication is broken, you can skip some commands from it with SET GLOBAL SQL_SLAVE_SKIP_COUNTER = n. See more:

http://dev.mysql.com/doc/mysql/en/SET_GLOBAL_SQL_SLAVE_SKIP_COUNTER.html

 

c/ You can see the current replication status with "show {slave|master} status"

 

d/ You can purge the old logs with PURGE MASTER LOGS

 

e/ When you create a mysql cluster using shared disk to keep DB (without replication):

- you should use just myisam tables (the only one that support concurrent access)

- use "enable external logs"

- don't use NFS for shared disk (it's extremely slow)

- recommended to use iSCSI (over 1G or 10G).

 

f/ type of replications:

- simple uni-directional (what we use)

- write cluster (bi-directional replication version 4+)

- start cluster

- chain cluster (it's easy to add/remove nodes)

- parallel replication (version 5.1+)

 

g/ if you implement a text search inside your site, to avoid DoS attach by just pressing the search button many times you can create a replication of your DB on a separate node and run the search against it (this way you won't slowdown the main site at all)

 

h/ in case you are using writing cluster, don't use auto-increment.

 

g/ in case you want to have persistent but very fast access to some table, you can do HEAP table and replicate it in some persistent storage.

 

More info on Mysql cluster here:

http://www.mysql.com/products/cluster/

 

 

 

4. Lustre

Speaker: Robert Read, Cluster File Systems

This distributed file system is designed to handle very big number of nodes (10K). the system is designed from 3 components: clients(nodes), management nodes(responsible for metadata and resource locator) and storage devices(NAS)

a/ it's using heartbeat for fail-over.

b/ it's patching the kernel on the clients and the servers (management nodes).

 

More info on http://www.lustre.org/

 

 

5. IDS

Speaker: David Allen

It was overview of the IDS market. The current open source players on the market are:

- snort (of course)

- lids.org

- grsecurity.net

- acid (PHP reporting tool)

- swatch, logcheck - log monitoring

- PSAD - port scanner, attack detector

- tripwire.org

 

 

6. Using OpenSSI

Speaker: Bruce Walker, Fellow, Hewlett-Packard

This guy from HP is architect and project manager of OpenSSI project (he works full time for HP).

 

(a) The current players on SSI  market are:

- mosix/Qluster - home-node SSI

- openSSI - aggregate the resources of all nodes (FS, Mem, CPU, sync, etc)

- PolyServ

- Redhat GFS (only the FS is SSI)

 

(b) OpenSSI provides:

- single HA root fs

- single view of all the systems (devices, processes, IPC, etc)

- process migration

- load balancing of incoming connections.

 

(c) details on implementation:

- moves just the process, not the SM, semaphores, etc. If a request is made -- it be IPCed to the node where the resources are created. (Yes, if you open a file and the process has been migrated, it will forward each request to the node where the file was open, but never will move the process back to this node, no matter how many requests we have)

- don't scale threads because threads use SM and they will suck if they run on multiple nodes. (yes -- if you run multithreaded application -- it will stay just on one node)

- all the system calls are executed on the node where the process currently resides (unless they don't access a resource created on a different node, in which case they get IPCed to the "home" node)

- provide custom API (node UP/Down, run process on specific node)

- Use LVS for IP load balancing

- Use linux-ha

.org

- it has custom cluster file system (CFS), but it's just a layer to show common namespace (not well scalable)

- integrates with RedHat cluster manager

 

(d) during the Q&A he was asked about running Samba under openSSI, and the answer was:

It could be a killer app, but because it uses SM openSSI can't scale it.

 

More info on http://openssi.org

 

 

7. Tuning 2.6 for Network Performance

Speaker: Stephen Hemminger, OSDL.

There are many ways to improve the TCP performance. The TCP needs improvements especially when you try to run it over 10+G pipe.

 

(a) Different configurations for Congestion control algorithm:

- Reno 

- Vegas

- Westwood (better on wireless)

- Binary Increase Congestion Control (BIC)

 

(b) How to squeeze out more performance

- Large MTU (4k) + 63%

- LAN driver not-module up to 10%

- Turn off timestamps + 4%

- Bind IRQ to processor (varies)

- increase the tcp_rmem

(c) you can simulate delays/packet loss (inside the kernel) with netem (see http://lwn.net/Articles/92475/)

 

(d) testing TCP speed

- iperf

- netperf

 

The slides from the conference are available here:

http://developer.osdl.org/shemminger/

 

8. Exploring the Use of Linux in High Performance Computing Environments.

Speaker: Ron Gordon, Program Director, IBM Systems Group

Very funny guy, but he was just selling his IBM hardware that was supposed to run faster than anyone else (sure Yahhh)

 

 

From the Expo:

 

Hardware

 

http://www.coraid.com - produce EtherDrive board, that converts ATA interface to Ethernet using Open Standard ATA-over-Ethernet (AoE) protocol. Then you can connect many drivers in one system, and then install a driver on your Linux that can mount them in any RAID configuration (SAN-Like). The ATA to Ethernet board costs $230 and the shelf cost $1000.

 

http://netezza.com - hardware based approach to solve the DB performance problems. It's a RAID-like subsystem, where the RAID is not just on disk level but on DB level. Each HDD has it's own small PC (266Mhz iPaq like CPU), and the system is striping the DB processing amoung the HDD+CPUs where the data has been stored. They claims it outperform Oracle over Sun 30 times. Cost $1.5M.

 

 

Software - System Level

 

http://www.radiantdata.com - peerfs - a distributed file system with automatic online replication. Each node has full copy of all files and each write has been replicated over all nodes with conflict resolution (ie. locking)

 

http://steeleye.com - a company offering a disk-level replication over TCP using standard open-source tools like RAID 1 (md) device and a Network Block Device (nbd).

 

http://opmanager.com - JAMS (Just another monitoring system) like WatchCub.

 

http://www.macroimpact.com - JACFS (Just another Cluster File System) like luster, GFS, coda, InterMezzo, etc.

 

http://www.netiq.com - JAMS + some management

 

 

 

Software - Application/Desktop

 

http://gluecode.com - BPM suite allowing graphically to build a business process and then again graphically to monitor the progress. The think I like about it was that you can set a timeout for each phase of the project (ie. If tisho don't fix this bug for 2 days -- give the task to paco). Cost $20K for the business process builder and another $20 for the web portlet-based interface (java-based solution)

 

http://opengroupware.org - subversion of MS Exchange server, but free. It has shared calendar, shared folders, etc.

 

http://scalix.com - very nice looking email system (especially the web-based interface). Also allows blackberry integration. It has the basic groupware stuff like shared calendar, contacts, tasks, public folders.

 

 

Jargon Watch:

google (v) - search on the web (you can google for it)