![]() |
Field Extraction Good Practice Guide |
Post Reply
|
| Author | |
Liam
Admin Group
Joined: 29 Jun 2011 Location: Stoke-on-Trent Posts: 136 |
Post Options
Quote Reply
Topic: Field Extraction Good Practice GuidePosted: 18 Jan 2012 at 8:43pm |
|
Field Extraction
good practice guide During my working week a get a lot of problems with
Email2DB that end up being solved by following some good field extraction
practices, I have thus become motivated to write this FAQ forum thread in order to
assist those in need. Please feel free to post your own best practices, in an
attempt to build up a fairly comprehensive FAQ just for this subject. This guide is here to help you in build field extractions
that are more likely to work when moving from the Run Test button to actual
emails. Some background information: Almost all emails have two
formats, plain text and HTML. Most email clients will show the HTML version of
the email whereas email2db by default uses the plain text – with some emails
the plain text and HTML versions may be formatted differently. This would mean
copy and pasting what you see in Outlook will not give an accurate
representation of the plain text version of the email and can often lead to the
field extractions not working as expected when you start processing real
emails. This most common reason for this not working is because new lines do
not appear in the plain text at the same point where they are in the HTML or
that the spacing between words or phrases is not the same. This is the most common cause of problems when moving
from Run Test to actual emails. Due to that my number one best practice is to not build
you field extractions based on new lines or spaces unless you are absolutely sure that the format of the message is plain text only and the format is fixed, in these circumstances new line feeds in particular are extremely valuable. My next best practice is to use regular expression wherever possible, providing that the expression can be built correctly for what you want then aesthetic formatting discrepancies in the email wouldn’t make a difference and the field would be extract as expected, even if the email format was changed aesthetically in the future. You can couple a look for and a regular expression together. So you can look for the phrase “Serial No.” and then look for [0-9][0-9][0-9][A-Z][0-9] for example, this would extract serial number with 3 numbers followed by a letter then a final number, e.g. 000A0, 999Z9 or 123V8. This can be useful if there are a few occurrences of phrases that match the regular expression. At the end of the day the best way to extract a specific field depends entirely on the environment, there is no one size fits all solution to field extractions. As always, if the formatting of the messages changes or if information is removed or new information added then you will need to go back to the drawing board, retest and rebuild those extractions again. This post is here to help people find the best method for them or to introduce new ways of building extractions that you may not have thought about before. Enjoy! Edited by Liam - 19 Jan 2012 at 12:57pm |
|
![]() |
|
Post Reply
|
| Forum Jump | Forum Permissions ![]() You cannot post new topics in this forum You cannot reply to topics in this forum You cannot delete your posts in this forum You cannot edit your posts in this forum You cannot create polls in this forum You cannot vote in polls in this forum |