Monday, 17 March 2008

Simplified molecular input line entry specification

The whole name is almost as catchy as "SMILES". I used to think it was a strange way of representing molecules for computers. But it actually seems like a more straight forward way than IUPAC nomenclature. It's also shorter, and there's a possibility to have unique names. (It's more difficult to pronounce though.)

You can try out SMILES strings at this page it's kind of fun. How to do it is described on wikipedia for example.

Ethane is just CC.

Add double and triple bonds like this:
C#CC=C for butenyne.

Add a branch in parentheses:
CC(C)CCC for 2-Methyl-n-pentane

If you want a ring add a number after the two atoms to be joined together:
C1C(C)CCC1 for Methyl-cyclo-pentane

Add a pyridyl group to the C next to the methyl group (aromatic atoms are written in lower case, and you have to include a second ring closure)
C1(c2ncccc2)C(C)CCC1 for (2-Pyridyl-)-2-methyl-c-pentane

You can add an extra oxirane ring:

You can mess with stereochemistry (using @ and @@)

If you still haven't had enough, you can add a double bond in E configuration to the pyridyl ring:

or Z configuration



Ψ*Ψ said...

Cool! I always kinda wondered how that worked.

Egon Willighagen said...

It might interest you that is working on a open standard for SMILES.

Felix said...

Ψ*Ψ: I am glad they forced me into that cheminformatics class, and told me what it's about

egon: it seems cool that so much is open source these days. maybe eventually I will use SMILES, this was just for curiosity

Lightnir said...

You find drawing structures by using SMILES cool? Yeah, right... When you want to do something really cool try superimposing molecules by using Smiles Arbitrary Target Specification (SMARTS). It's a kind of query language based on SMILES that allows you to search molecule fragments within SMILES strings. Check obfit from the Open Babel package for more info.

Felix said...

I am waiting until they teach me that in class ...

GMC2007 said...

One of the coolest things about SMILES is that there are algorithms for generating the canonical SMILES for a molecule. In general you can write a number of equally valid SMILES for a molecule. The canonicalisation algorithms identify one of these SMILES as the canonical SMILES. You can identify duplicate molecules in databases by simple string matching. SMILES notation also makes it very easy to build molecular models using obscure modelling tools such as the emacs editor.

Lightnir said...

Felix: Why wait if you can learn it faster by yourself. You may be interested in this short article.
GMC2007: The emacs part sounds like fun to me ]:)

GMC2007 said...

A couple of responses to lightnir's comment.

Although SMARTS notation can be used to specify how molecules should be overlaid, it goes well beyond that. The SMARTS language enables powerful and general definition of substructures. And that is the basis of chemistry. BTW good to see the use of recursive SMARTS in your example.

Although canonicalisation of SMILES is very useful, you are not obliged to write SMILES in their canonical form. If you need to build structures from scratch, it's usually quicker to type them in as SMILES (hence my reference to emacs). This makes it easy to force a particular ordering of atoms since many 3D structure generators will maintain SMILES order. If you're doing covalent docking with something like GOLD, you will typically need to have ordered the molecules so that the covalent link atom always occurs at a fixed point (e.g. first atom) in the molecule.

Shawn Wilkinson said...

Emacs isn't obscure. It's the best thing ever. Ever.