ApacheConEU - part 10
November 19, 2012
ApacheConEU - part 10 # In the next session Jukka introduced Tika - a toolkit for parsing content from files including a heuristics based component for guessing the file type: Based on file extension, magic and certain patterns in the file the file type can be guessed rather reliably. Some anecdotes: not all mime types are registered with IANA, there are of course conflicting file extensions, Microsoft Word not only localises their interface but also the magic in the file, ...