Sunday, October 24, 2010

SPIRE 2010 Los Cabos

I recently attended the String Processing and Information Retrieval conference (SPIRE 2010). In terms of content, this conference far exceeded my expectations. Excellent presentations on the latest research. Stuff I'd *never* even heard of: Wavelet Trees for example. Very friendly presenters, who enthusiastically spent time with me between presentations to thoroughly teach some of the concepts.










 
 Yahoo! Research presents Time Explorer











Microsoft Research Keynote








No complaints about the venue either. Met so many cool and interesting researchers in the field of IR (which is basically "search").

Java SimpleDateFormat: simply a waste of Time

I've now sunk probably 8 hours  into trying to make a minimally robust date parser that can handle dates in a variety of formats. Sadly, I'm afraid this is the best I've been able to do, given the fragility of the Java SimpleDateFormat class. I've composed a little framework in two pieces. The first piece is a properties file that defines the formats that the user will be allowed to enter dates in. The second piece is the code that jumps through an absurd series of hoops in order to work around bugs and shortcomings in SimpleDateFormat. If anyone knows a better way, that's been *tested*, please, for the love of GOD, post a comment here.

DateFormats.properties

format1=yyyy-MM-dd HH:mm:ss.SSS z
format2=yyyy-MM-dd HH:mm:ss.SSS
format3=yyyy-MM-dd HH:mm:ss
format4=yyyy-MM-dd HH:mm
format5=E, MMM dd, yyyy
format6=MMM dd yyyy
format7=MMM dd yyyy HH:mm
format8=MMM dd, yyyy
format9=MMM. dd, yyyy
format10=MM/dd/yy
format11=MM/dd/yyyy
format12=M/d/yy
format13=yyyy-MM-dd
format14=yyyy-MM-dd, E
format15=yyyy MMM. dd, HHh 1mmm ss's'
format16=yyyy/M/dd/HH:mm:ss
format17=yyyy-M-dd'T'HH:mm:ss


Notice in the formats above that all months and days are defined with two format chars. E.g. "MM" or "DD". Why not "M" or "D"?  SimpleDateFormat format strings are very rigid. We want the user to be able to enter 10/1/07 or 10/01/07. But I don't want to clutter the properties file with single and double format chars. For this reason, the code automatically pads any single digits with zero. So "1" in isolation is converted to "01". So if the user enters "Jul 7, 2001" the string is automatically zero-padded to "Jul 07, 2001". This allows the properties file to include only the double-digit variants of date numbers without having to account for every variation in which a digit could be either single or double digit. With regard to the code below, I haven't taken time to extract it as a library, so please forgive the unexplained parameters such as "id" and "paramName".This code convers the time provided by the user into the current Greenwich Mean Time.

Java Source Fragments

private static final Pattern ISOLATED_DIGIT = Pattern.compile("(^|\\D|\\s)(\\d)(\\D|$|\\s)"); //digit can be surrounded by non-digit but also begining or end of line or whitespace


public static Object parseDate(String dateString, String id, String paramName) {
        if(null == dateString){
            return dateString;
        }
        if(dateString.equalsIgnoreCase("NOW")){
            //default time stamp is Zulu time (time at prime meridian, Greenwich)
            long currentTime = System.currentTimeMillis();
            Timestamp ts = new java.sql.Timestamp(currentTime-TimeZone.getDefault().getOffset(currentTime));
            return ts;
        }else{
            dateString = zeroPadIsolatedDigits(dateString);
            if(log.isDebugEnabled())log.debug("input zero-padded to " + dateString);
            for(Map.Entry entry:DATE_FORMATS.entrySet()){
                String key = entry.getKey().toString();
                DateFormat dateFormat = new SimpleDateFormat(entry.getValue().toString());
                dateFormat.setLenient(false);
                try{
                    java.util.Date date = dateFormat.parse(dateString);
                    
                    //I have found bugs in the parse method, in which strings that *don't* actually match the format template
                    //get parsed without Exception, thus truncating trailing parts of the date (like the timezone). For this reason
                    //I am forced to regurgitate the date string, and compare it to the input, and if they don't match, skip forward in the loop
                    String regurgitated = dateFormat.format(date);//...and no, we can't just compare the regurgitated string b/c PST and get adjusted to PDT
                    if(regurgitated.length() != dateString.length()){ //compare their lengths to make sure nothing got truncated
                        if(log.isDebugEnabled())log.debug(dateString + " was parsed with format " + entry.getValue() + " but was apparently parsed incorrectly as " + regurgitated + " length difference was "+(regurgitated.length()-dateString.length()));
                        continue;
                    }else{
                        //the length of the regurgitated string matches the length of the input.
                        //We still need to eat the regurgitated string, and compare the resulting  ms since epoch to the Date we originally parsed to make sure we didn't
                        //encounter a SimpleDateFormat parsing bug.
                         //Example: 2010-9-23 12:22:23.000 PST gets regurgintated as 2010-09-23 13:22:23.000 PDT, which is different text, but same ms since epoch
                        //So the above illustrates why we cannot just compare the regurgitated text to the input dateString, because Java may decide on an equivalent but
                        //textually different representation from the input (PST is the same as PDT in the half of the year when daylight savings time isn't in effect).
                        java.util.Date reparsed = dateFormat.parse(regurgitated);
                        if(reparsed.getTime() != date.getTime()){
                            if(log.isDebugEnabled())log.debug(dateString+" produces different ms since epoch than  " +regurgitated );
                            continue;
                        }
                    }
                    if(log.isDebugEnabled())log.debug("handled date" + dateString +" using format template: " + entry.getValue().toString() + " which regurgitated " + dateString);
                    TimeZone timeZone = dateFormat.getTimeZone();
                    //if(log.isDebugEnabled())log.debug(timeZone);
                    //convert to GMT time and account for timezone
                    return new java.sql.Timestamp(date.getTime()-timeZone.getOffset(date.getTime()));
                }catch(ParseException pe){
                    if(log.isDebugEnabled()){
                        log.debug(pe.getMessage());
                        log.debug(dateString + "couldn't be parsed with format " + entry.getValue());
                    }
                }
            }
            throw Util.newApplicationException(id, "date "+ dateString+" could not be parsed.","DataFormatException", 130, paramName);
        }
    }

     private static String zeroPadIsolatedDigits(String s) {
        //Java date parsing will prepend leading zeros to islolated digits. For this reason, I prepend a zero to the isolated digits before parsing the string, so that the regurgitation step
        //produces a string identical to the input
        boolean done = false;
        //this is an obscure case in which the matching regions overlap, therefore m.find() only finds one of the overlapping instance.
        //For instance, 100-1-2. Find() won't find "-2 " because it overalps "-1-". Consequently we have to repeatedly call find on a new matcher after each replacement has been made
        while (!done) {
            Matcher m = ISOLATED_DIGIT.matcher(s);
            StringBuffer buf = new StringBuffer();
            if (m.find()) {
                if(log.isDebugEnabled())log.debug("found isolated digit:" + m.group(2));
                m.appendReplacement(buf, m.group(1) + "0" + m.group(2) + m.group(3)); //need to stuff back in the stuff before the isolated digit, then 0 (zero-pad), then the digit, then stuff after)
            }else {
                done = true;
            }
            m.appendTail(buf); //OK, al isolated digits have been replaced by leading-zero padded digits
            s = buf.toString(); //replace input with zero-padded input
        }
        return s;
    }

Saturday, October 23, 2010

Data Storm

This morning's project was installing a Davis Vantage Pro weather station. It's really an amazing piece of equipment. It logs windspeed and direction, solar radiation, and dozens of other variabls. It even has a rain collector. The data comes off the unit in real-time, and is transmitted to a nice LCD console that shows graphs, weather forecasts etc. I hope to log enough data on sun and wind in order to make a determination as to whether solar panels or wind power will be my best alterative energy source. I'm also interested in quantifying just how bizarre the microclimate of western san francisco actually is. How many days is it foggy? What time does the fog typically roll in or out in a given month? With a bit of free time, I hope to be able to load the data into nextdb to create a detailed data history. 
 


 I'm already getting some valuable siting data on windspeed. There are several interesting vertical-axis helical wind-power generators that might be appropriate. Hmmm.