Semantic web and privacy issues

Whoa! This is something interesting.

On one side are the people who are talking about making interesting analysis from information over the web and on the other side are people who are talking about its potential threat to privacy.

Well, I am talking about collecting data from various sources and then making interesting analysis from this data. And this data could be of facts, things or 'people'.

Entity analytics is not something new to the Semantic Web. There is some work going on in the field of Relationship resolution (Who is who), Identity Resolution (Who knows who) and Anonymous Resolution (Who is who and who knows who, anonymously). And this is really important because it helps organizations combat against frauds and threat.

But the concern raised in this article in BBC cannot be ignored. The most striking statement made here by Hugh Glaser, Southampton University, with reference to the web is, “All of this data is public data already. The problem comes when it is processed”.

You better leave the needle in the hay. Don't try to analyze and find out where I had been last Friday!

Ok, so what is the solution. Role based security at the data source level is something that I can think of. Build security into the core of the system. This way, no data can get out without people having proper access permissions.

Another solution is to make sure users 'mark' data as available for analysis and if so what kind of analysis. Using data for sampling (individuals being totally anonymous) might not be really bad.

Well, this is something that I feel are some solutions that might be considered to solve this problem. Time will tell.

Thoughts about Google Notebook, Google Co-op and people tagging in the enterprise

Time advances, so does technology. So although a lot of ideas are hovering in my mind and I have been updating myself with the happenings in the software world, I somehow could not find time to compose a blog entry and share my views. Work has kept me busy like never before.

So let me try and consolidate everything into one entry here:

First and foremost, Google. Whew! These guys never stop (Yahoo, wake up!).

Google released the Google Notebook some time back. I have been trying this for about a week now and it is quite satisfactory.

Let me start with the pros and then go to the cons.

The tool is a quickie. Clip it and click on Add Note and you are done. It cannot be simpler (unless they provide some keyboard shortcut like Ctrl-Shift-C to copy and paste in Google Notebook). You can add your own notes or edit existing ones. You can clip images too! The search is there as always (almost taken for granted when it is Google 🙂 ).

It also allows us to make private notes or make notebooks public.

And now to the cons…

The first is a security issue. As some people are mentioning, the ease of use of this tool may tempt users to clip private data from intranets and store it in Google's servers. And Google has the right to index it.

There is absolutely no meta-data attachment. No tagging! :O (How can people forget tagging in the Web 2.0 world?!)

It is not easy to relate articles. The best way to do this is to create a new section and put everything under it, but this will tire you soon.

There is no export feature. This is a big threat. You start clipping things and you are tied to Google possibly forever!

Ok, we now proceed to the next application Google released -> Google Co-op.

Google Co-op allows users to customize the search results that Google generates (does that sound like Eurekster Swicki?).

The interesting feature here is the extensibility that Google provides in specifying topics of interest, the keywords, links etc.

And what does Google get in return? Lots of meta-information. How nice it would be, if people give you a list of words that fall in a particular category? Google will definitely relish this!

With the hopes that Google does not turn bad, let us enjoy the cool features that they provide and the competition that they face. Competition enables innovation and that is good news for end users.

Some other things that I heard recently: People tagging in the enterprise. This reminds me of a discussion that I had with my mentor some time back.

Let us suppose that I have a list of contacts in my Sametime list. How will I categorize these people? By their teams? Well, may be so.

But someday, I would want to send a mail to all people who are active in some particular community. Or I would want to know the set of people who I have contacted for a particular purpose, which is not necessarily related to their present team. Now is it possible for me to get this view of the users?

People tagging is all about this. Here is a paper from IBM that talks about people tagging in the enterprise.

The concept is simple, but extremely powerful. The idea is to tag people, the way you tag links in a bookmarking tool. Once you do that, you can find all people who belong to a particular tag.

Tagging is central to almost all resources today and will soon form part of the filesystem. (Heard of semantic filesystems?). The line between the functions/services provided by the operating system and the services provided in the internet will diminish and will result in the emergence of the first generation of Web O/Ses. Soon, Web O/Ses will be THE O/Ses.

A departing thought. Today I saw an alert in my mailbox that talked about the next generation web. Wonder where this article is from? Deccan Herald! I don't know how many of them noticed it, but this is news that the semantic web is catching on. The article talked about how Google threw unexpected results for (mostly technical) words that had more than one meaning and how semantic web can help solve this.

Whoa. Enough for today. 🙂

Wikipedia in RDF

This blog entry is not quite to do with what Wikipedia in RDF is all about, but the kind of problems that I faced in using it.

When I initially read about the Wikipedia in RDF initiative, I was excited. Imagine being able to download the meta information of ALL the articles of Wikipedia and then being able to query it, analyze it and do anything that you would want to do with it.

I loyally downloaded the gzip for RDF/XML format. The zip file size is 397 MB and the unzipped size is supposed to be 3.7 GB (supposed to be, because I did not have enough space in a single partition to unzip the entire zip. I initially had doubt if XP supports files of this size, but saw some page, which said that the maximum file size is the size of the volume in NTFS partitions).

Ok, here come a host of problems. I conducted my experiments in a 256 MB system. I guess the processor is not bad; it is a 1.7GHz Celeron system.

In order to analyze this file, I should first extract it. I extracted this zip partly (about 800 MB) and then tried to open it in my text editor – SciTE. I was disappointed. The file did not open. I then tried Wordpad (I did not dare to try Notepad!), Vim (for Windows), Edit (from cmd.exe) and Mozilla Firefox.

The best response I got was from Edit (I am not surprized. I have done some tests before and I saw that Edit is the best text editor in Windows!), which clearly said it cannot handle files of that size and it will show the first 65000 odd lines. Decent. I atleast get to view 65000 lines!

The second best response was from Mozilla Firefox. I had some problems here. Firefox tried to parse the file, since it was in RDF. I changed the extension to txt so as to avoid parsing and tried again. Firefox immediately started loading the file. It occupied about 150MB of memory, just before it stopped working.

Vim was bad too. 😦 The file just did not open and Vim made an abnormal exit.

So I am left with a host of problems before I can start playing with this file.

Is there any text editor that I can use to open this file? I guess there should be SOME editor that does caching and is written specially to load huge files.

Ok, now on to the second problem. I am thinking of making some analysis using this RDF document. In order to do that I should be able to 'load' the entire file in memory (because it requires an XML parsing of RDF), or else I cannot use it. I guess I should use FileChannel to create a map of the file and a pull parser to parse the file.

I have not tried this, but I am cent per cent sure that I will face problems. Size does matter!

Wish me luck. 🙂

ISL Project Interview

I had been to college today for the ISL Internship Interview. 3 of us (me, Kulki and Suresh) left early morning, each with only 4 hours of sleep last night.

We left early morning at around 6:30 and reached by 10:30. The process comprised of a written test followed by interview for short-listed candidates.

We ended up selecting 6 students. It was not an easy job!

I felt really bad looking at the sullen faces of the students who were not selected, especially when I realized I was part of the decision. But then destiny plays its role and I guess we made the right choices.

Thanks to all the co-ordinators. In all, it was a nice (although tiring) experience.